La lecture en ligne est gratuite
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres
Télécharger Lire

Une étude comparative de six annuaires du Web francophone

De
6 pages
Houssem Assadi and Thomas Beauvisage France Télécom R&D – DIH/UCE, France A Comparative Study of Six French-language Web Directories Abstract: This paper presents a comparative study of six French-language Web directories (MSN, Nomade, Open Directory, Voila, Voila Pages Perso and Yahoo). The study focuses on the quantitative and qualitative aspects of the organization of these directories, and on the way in which they describe sites. It reveals a wide variety of structures, content and organizational principles. In this respect, Web directories do not correspond to classic theories of classification. They highlight the difficulty of proposing a structured representation of the heterogeneous content of the Web. 1. Introduction Web directories, lists of web sites classified into categories, are extensively used by Internauts. They are particularly useful for novices since they facilitate navigation and provide particularly relevant links, especially compared to "full text" search engines with their problems of complexity of queries, noise, etc. The creation of Web directory categories and the classification of sites are done manually, which contributes to an acceptable organization of information: categories are created "rationally" and systematically, and the relevance of listed sites is generally guaranteed. We undertook a comparative study of six Web directories partially or totally dedicated to the French-language Web: MSN, Nomade, Open ...
Voir plus Voir moins
Houssem Assadi and Thomas Beauvisage
France Télécom R&D – DIH/UCE, France
A Comparative Study of Six French-language Web
Directories
Abstract:
This paper presents a comparative study of six French-language Web directories (MSN,
Nomade, Open Directory, Voila, Voila Pages Perso and Yahoo). The study focuses on the quantitative
and qualitative aspects of the organization of these directories, and on the way in which they describe
sites. It reveals a wide variety of structures, content and organizational principles. In this respect, Web
directories do not correspond to classic theories of classification. They highlight the difficulty of
proposing a structured representation of the heterogeneous content of the Web.
1. Introduction
Web directories, lists of web sites classified into categories, are extensively used by
Internauts. They are particularly useful for novices since they facilitate navigation and provide
particularly relevant links, especially compared to "full text" search engines with their
problems of complexity of queries, noise, etc. The creation of Web directory categories and
the classification of sites are done manually, which contributes to an acceptable organization
of information: categories are created "rationally" and systematically, and the relevance of
listed sites is generally guaranteed.
We undertook a comparative study of six Web directories partially or totally dedicated
to the French-language Web:
MSN
,
Nomade
,
Open Directory
,
Voila
,
Voila Pages Perso
and
Yahoo
. This study concerned data available in these Web directories in February 2001. We
first developed a specific software package in order to explore the structure and content of
directories (hierarchical links and cross-references between categories, listed addresses and
site descriptions). On the basis of these data we then performed a qualitative analysis of the
organizational principles of each directory, followed by quantitative investigations consisting
of: 1) calculation of statistical indicators representing the structure and complexity of each
directory; 2) calculation of the specific characteristics of each directory, based on the content
and presentation of the indexed sites. Our study included other analyses based on a formal
exploitation of directories as graphs, but the limited scope of this article does not enable us to
include those aspects as well.
2. What is a Web directory?
A Web directory offers users lists of sites grouped into hierarchically structured
categories forming a "tree". These categories contain the indexed sites or pages (identified by
their URL, i.e. their Internet address), along with a brief description of their content.
Directories differ not only in size and number of URLs presented and indexed, but also
in structure. The structure of a directory can be defined by the combination of three elements:
-
Multi-indexing: some directories index the same URL in different categories; thus, the
same address sometimes appears several times in different places in a directory.
Position of the URLs indexed in the tree: some directories propose URLs in all their
categories, while others classify them only in terminal categories (i.e. those which do not
have sub-categories).
-
Use of cross-references: directories such as Yahoo propose links, within categories, to
other categories which may be situated anywhere in the directory and not necessarily
directly beneath them in the tree.
Structurally, each directory is thus a combination of these three elements, as presented
in Table 1.
Multi-indexing
URLs are indexed only
in terminal categories
Use of cross-references
MSN
Nomade
Open Directory
Voila
Voila PP
Yahoo
Table 1 - Structural description of directories
3. Differences of size and structure
The first difference we observe is in the
coverage
of directories, calculated in terms of
the number of indexed URLs. This calculation must take into account multi-indexing: if a
directory lists the same URL in several places, it will present more addresses than individual
URLs actually indexed. That is why it is important to distinguish between the number of
URLs presented and the number of individual URLs indexed (cf. Table 2). For example,
whereas Nomade is the directory with the broadest coverage (139,000 indexed URLs), Yahoo
presents the highest number of URLs to the user and thus has the highest repetition rate.
Directories also vary highly in terms of
depth
. Yahoo, with its 18-level depth tree, has
the most depth. The smallest directory, Voila Pages Perso, has a maximum depth of only 5.
But depth does not systematically correlate with the number of URLs (see Table 2). We
furthermore found wide variability in terms of
density
(mean number of URLs per category).
Whereas Nomade and Voila propose a mean of close to 20 URLs per category containing at
least one URL, Open Directory, Yahoo and MSN offer a mean of between 5 and 10.
Number of
categories
Maximum
depth
Total
number of
URLs
presented
Number of
individual
URLs
Mean rate of
URL
repetition
Number of
categories
which index
at least one
URL
Mean
number of
URLs per
category
with at least
one URL
MSN
6,875
7
62,523
46,137
1.35
6,507
9.61
Nomade
9,165
10
183,590
138,832
1.32
8,754
20.97
Open Dir.
5,244
10
32,496
32,496
1
4,201
7.73
Voila
8,967
10
134,502
59,744
2.25
7,854
17.12
Voila PP
601
5
50,794
27,923
1.81
564
90.06
Yahoo
44,137
18
192,160
106,832
1.8
37,178
5.17
Table 2 - Indicators of size, depth and multi-indexing
Finally, cross-references strongly influence the directory structure. They facilitate
navigation for the user and enable the creators of directories to compensate for the rigidity of
the overall organization of the tree. The four directories that use cross-references (Nomade,
Open Directory, Voila and Yahoo) do so differently (see Table 3). While Nomade and Voila
make little use of them (only 1.6% of Voila categories use cross-references, with a mean of
1.4 cross-references proposed), Open Directory and Yahoo use them on a big scale (20% of
the Yahoo categories, with a mean of 4 cross-references per category).
Total number of
categories
Number of
categories with a
cross-reference
Total number of
cross-references
Mean number of
cross-references
per category with
cross-reference
Proportion of
categories with
cross-reference
MSN
6,875
-
-
-
-
Nomade
9,165
586
997
1.70
6.3%
Open Dir.
5,244
526
1,660
3.15
10.0%
Voila
8,967
142
202
1.42
1.6%
Voila PP
601
-
-
-
-
Yahoo
44,137
8,806
34,840
3.95
19.9%
Table 3 - Use of cross-references
4. Web directories have different organizational principles
In this section we consider the principles governing the organization and structuring of
directories. Several models can be used for the organization of data and knowledge, derived
from fields as varied as knowledge representation in Artificial Intelligence, the compilation of
thesauruses (e.g. in libraries) or the creation of indexes and other "paper" directories (e.g.
yellow pages, professional directories). These different models have been adopted by the
publishers of Web directories, in some cases intentionally, in others less so. Three modes of
organization can be distinguished:
-
Systematic categorization of domains of human activities, objects of daily life, etc. in
an ontological approach.
-
Less systematic and more practical cataloguing, focused on human activities (e.g.
business, recreation, various forms of sociability), in a "yellow pages" approach.
-
Categorization of the "Internet world": mapping of sites and services available on the
Internet, without any precise criteria for classification of objects of the world, human
activities, etc.
Note that Internet directories, at least those that we studied, do not correspond strictly to
any one of these approaches; they combine the principles of usual classificatory objects:
ontologies, thesaurus, etc. In the six directories that we studied, very different organizational
models were identified.
For example, if we compare Yahoo and Voila, Yahoo has a systematic classification
approach evident in a large number of categories (44,000 compared to 9,000 in Voila)
organized in an 18-level tree structure (compared to 10 levels in Voila). Yahoo also has a very
dense network based on a system of cross-references between categories (34,000 cross-
references, compared to 200 in Voila). The main first-level categories in Yahoo are
"Regional" (
Exploration géographique
) and "Business and economy" (
Commerce et
économie
). This reveals a systematic classification approach. The encyclopaedic aspect of the
Yahoo directory is also evident in the presence of categories such as "Social science"
(
Sciences humaines
) at the top of the tree. By contrast, Voila has a pragmatic approach,
focused on services related to human activities: business, social activities and recreation. This
practical aspect of Voila is strengthened by the existence of a first-level category "Shopping,
daily life" (
Achat, vie pratique
) which accounts for 13% of all indexed sites and has no
equivalent at the top level in Yahoo France directory.
5. Web directories have little in common
The six directories studied index a total of 283,000 distinct URLs. We found that in
general the directories overlap very little and have only 25 URLs in common; 70% of all the
indexed URLs are in only one directory and 89% in one or two directories.
Each directory has its specific characteristics, confirmed by a low two-to-two overlap.
Since the sizes of directories differ, the calculation of two-to-two overlaps between directories
is asymmetrical and must be analysed for each pair of directories (Table 4).
Ð
shares n % of its
URL with
Î
MSN
Nomade
Open
Directory
Voila
Voila PP
Yahoo
MSN
100.0%
33.8%
13.2%
22.7%
1.5%
35.7%
Nomade
11.2%
100.0%
9.3%
19.7%
3.4%
30.4%
Open Directory
18.8%
39.9%
100.0%
22.9%
2.4%
33.6%
Voila
17.6%
45.8%
12.4%
100.0%
3.8%
37.7%
Voila PP
2.4%
16.6%
2.7%
8.0%
100.0%
8.7%
Yahoo
1.4%
39.5%
10.2%
21.1%
2.3%
100.0%
Key: 11.2% of URLs of Nomade are also indexed by MSN, while 33.8% of the URLs of MSN are in
the Nomade base.
Table 4 - Overlaps between directories
The particularity of the Voila Pages Perso personal sites directory is strongly confirmed
by the very low level of overlap with other directories, especially
from
Voila Pages Perso
to
other directories, even though Voila Pages Perso is the smallest directory of all.
The mean overlap rate between the different directories is 18.1% and 22.8% if we
exclude the very specific Voila Pages Perso. As in the case of Voila Pages Perso, size does
not seem to be the decisive factor in overlaps between directories. For example, MSN and
Open Directory, both small, share less than a third of their URLs, on average, with other
directories up to four times bigger – an overlap equivalent to that between Nomade and
Yahoo, the two biggest. It thus appears that each directory indexes sites peculiar to it. This is
confirmed by the proportion, in each directory, of URLs indexed in no other directory (see
Table 5).
Directory
Number of sites indexed
Number of sites peculiar
to the directory
Percentage of sites
peculiar to the directory
MSN
46,223
21,174
45.8%
Nomade
139,051
74,089
53.3%
Open Directory
32,496
12,292
37.8%
Voila
59,801
21,482
35.9%
Voila PP
28,330
20,614
72.8%
Yahoo
107,052
48,899
45.7%
Table 5 - Proportion of indexed URLs peculiar to each directory
With the exception of Voila Pages Perso, which has a particular content (73% of URLs
peculiar to it), we note that Nomade, the biggest directory, is also the one with the most
particular characteristics (53.3%). This result was expected. Less predictable were the rates of
particularity of MSN (45.8% of URLs peculiar to it), which is three times smaller than
Nomade, and Yahoo, which has few particularities despite its size. It thus seems that there is a
twofold effect contributing to the particularity of directories: their size, which statistically
increases their chances of indexing sites that others exclude, and their content, determined by
their choice of indexed sites.
6. Web directories have strong identities
In this part we examined the particularity of each directory in terms of content and style.
We first wished to establish whether, on a given topic (e.g. art, sport or politics), different
directories have marked differences in choice of content and, if so, to what extent? We
excluded Voila Pages Perso from this analysis, because of its intrinsic particularity, and
retained only the five general-interest directories.
We qualified the content of the directories on the basis of the short descriptions they
give of indexed Web sites on a given theme. The following method was applied:
1. We first choose topics present at the first level, for the five directories studied;
2. We then extracted, for each directory and for all the sites classified under the chosen
topic, the directories' descriptions of those sites;
3. The corpus thus constituted was processed with a lexicometric tool, Alceste (Reinert
1993). This enabled us to identify the specific vocabulary used by each directory to
describe sites on a given topic.
The first topic selected was "Art and Humanities" (
Art et culture
) present at the top level
of the five directories. At this level MSN focuses on North American museums (specific
vocabulary: "US", "Canada", "Montreal", "New York", etc.), Open Directory gives priority to
downloading of music ("MP3", "server", "free", etc.), Nomade emphasizes the accessibility of
art ("consult", "invite", "share", etc.), Voila is more business-oriented ("bargains",
"shopping", "catalogue", "order", etc.) and Yahoo seems to be more eclectic, with a slight
preference for cinema ("scenario", "critique", "synopsis", etc.).
The same analysis was applied to the category "Business and economy" (
Commerce et
économie
), also found at the first level of the five general-interest directories. Here again,
MSN focuses largely on North American sites, with particular emphasis on financial topics
("financial", "bank", "investment", "shares", etc.). Nomade seems to be oriented more towards
the tourism industry ("visit", "restaurant", "rating", "hotel", etc.). Yahoo and Open Directory
tend to favour international trade and the manufacturing sector ("truck", "rubber",
"development", "international", etc.). Lastly, Voila's position appears to be less specific
although it has a slight tendency towards local business (place names are over-represented).
MSN
Nomade
Open Dir.
Voila
Voila PP
Yahoo
Nouns
78.5%
66.9%
73.8%
74.0%
70.6%
78.9%
Verbs
5.1%
16.4%
8.4%
9.8%
12.8%
5.6%
Adjectives
14.9%
12.3%
15.1%
13.2%
12.2%
14.2%
POS
maincat.
Adverbs
1.5%
4.4%
2.7%
3.0%
4.4%
1.3%
1 SG
1.3%
0.9%
0.9%
0.2%
8.8%
1.0%
2 SG
9.0%
0.9%
9.6%
2.3%
3.7%
26.0%
3 SG
79.7%
51.2%
60.6%
54.3%
46.8%
57.2%
1 PL
1.0%
1.3%
2.9%
0.2%
3.6%
0.9%
2 PL
2.2%
38.3%
15.5%
34.6%
30.1%
3.5%
Verbsan
pronouns
person/number
3 PL
6.7%
7.4%
10.5%
8.5%
7.0%
11.5%
Table 6 - Distribution of main POS categories and person/number for each directory
In addition to this, we performed a part-of-speech tagging of all the site descriptions for
each Web directory in order to see its stylistic specificities. We thus analyzed the distribution
of main POS categories and of verbs and pronouns person/number for each directory (see
Table 6). This analysis raised an opposition between two presentational attitudes: on the one
hand, directories as "helpful guides", with an over-representation of verbs and the 2
nd
person
plural (typically: "You will find on this site…"): Nomade, Voila, Voila Pages Perso; on the
other hand, directories as "neutral information relay", with very few verbs and a dominance of
nouns, adjectives and the 3
rd
person singular: MSN, Yahoo and Open Directory.
7. Discussion
Most earlier studies on Web directories were undertaken by researchers and specialists
in library science and documentation (Bertonèche 2001) (Chan, Xia et al. 1999) (Van der
Walt 1998) (Vizine-Goetz 1996). Other studies have focused on directories as systems of
reference classification, and have used them as a resource for automatic classification of
documents (Mladenic 1998), (Labrou and Finin 1999). Two studies in particular attracted our
attention in so far as they present a comparison of several Web directories (Bertonèche 2001),
(Van der Walt 1998). In both cases the approach is purely qualitative and the findings are
very close to our own, i.e. wide diversity, even heterogeneity, among the directories studied.
Our study has the particularity of combining a qualitative approach ("manual" analysis
of directories and critical analysis of their modes of organization), and a quantitative approach
based on statistical and formal processing: exploitation of the textual content of site
descriptions in directories and of the structure of the directory as a graph.
Specialists in cataloguing and library science recommend taking into account the
different classificatory and methodological theories developed by their disciplines, and
applying them to the cataloguing of Internet sites. Our view is that this classificatory approach
cannot be transposed as such to the Internet world, for at least two reasons. The first is that the
Web is not an encyclopedia of knowledge and is not comparable to a library. It offers content,
services and, more generally, resources of all kinds, with a very wide variety of topics and
quality. The second reason lies in the diversity of contexts in which the Internet is used, and
of user profiles and needs. This is a very different situation from readers in a library or other
documentary resource center, whose needs can be determined
a priori
and whose profiles
have been defined on the basis of a long history of practice.
We agree with library science specialists on the need for more rigor in the construction
of Web directories. We also think that general-interest Web directories, like those that
currently exist, have possibly reached their limits. The management of a large number of sites
and categories seems to pose problems manifested in a lack of coherence. Two areas appear to
be emerging for further development of Web directories suited to current trends in their use:
"regional" directories, specialized by sector of activity, geographic or cultural domain, etc.,
and systems based on collaborative evaluation and cataloguing of web sites within "interest
communities".
8. References
Bertonèche, J. (2001). “L'Internet-bibliothèque : accéder au savoir ou se l'approprier?”
SPIRALE - Revue de Recherches en Education(28): 195-214.
Chan, L. M., L. Xia, et al. (1999). Structural and multilingual approaches to subject access on
the web. 65th IFLA Council and General Conference, Bangkok, Thailand.
Labrou, Y. and T. Finin (1999). Yahoo! as an Ontology Using Yahoo! Categories to Describe
Documents. CIKM.
Mladenic, D. (1998). Turning Yahoo to Automatic Web-Page Classifier. European
Conference on Artificial Intelligence.
Reinert, M. (1993). “Les "mondes lexicaux" et leur logique.” Langage et société(66): 5-39.
Van der Walt, M. (1998). The Structure of Classification Schemes Used in Internet Search
Engines. Fifth International ISKO Conference, Lille, France, Ergon Verlag.
Vizine-Goetz, D. (1996). Using Library Classification Schemes for Internet Resources. OCLC
Internet Cataloging Project Colloquium, San Antonio (Texas).
Un pour Un
Permettre à tous d'accéder à la lecture
Pour chaque accès à la bibliothèque, YouScribe donne un accès à une personne dans le besoin