se advances we choose tothey are restricted to rather small systems, and often
investigate in detail the collaboration network of scien-view networks as static graphs, whose nodes are individ-
tists.uals and links represent various quantiﬁable social inter-
Recently Newman has taken an important step to-actions.
wards applying modern network ideas to collaborationIn contrast, recent approaches with methodology
networks [10,11]. He studied several large database fo-rooted in statistical physics focus on large networks,
cusing on several ﬁelds of research over a ﬁve year pe-searching for universalities both in the topology of the
riod,establishingthatcollaborationnetworkshavealltheweb and in the dynamics governing it’s evolution. These
general ingredients of small world networks: they have acombined theoretical and empirical results have opened
1
arXiv:cond-mat/0104162v1 [cond-mat.soft] 10 Apr 2001surprisinglyshortnode-to-nodedistanceandalargeclus- the WWW.
tering coeﬃcient [10], much largerthan the one expected Our work stands on three pillars. First, we use direct
from a random Erdo˝s-R´enyi type network of similar size measurementsontheavailabledatatouncoverthemech-
and average connectivity. Furthermore, the degree dis- anismofnetworkevolution. Thisimpliesdeterminingthe
tribution appears to follow a power law [11]. diﬀerent parameters and uncovering the various compet-
Our study takes a diﬀerent, but complementary ap- ing processes present in the system. Second, building
proach to collaboration networks than that followed by on the mechanisms and parameters revealed by the mea-
Newman. We view collaboration networks as prototype surements we construct a model that allows us to inves-
of evolving networks, where the accent is on dynamics tigate the large scale topology the system, as well as its
and evolution. Indeed, the co-authorship network con- dynamical features. The predictions oﬀered by a contin-
stantly expands by the addition of new authors to the uum theory of the model allow us to explain some of the
database, as well as the addition of new internal links results that were uncovered by ours, as well Newman’s
representingpapers co-authoredby authorsthat wereal- measurements. Thethirdandﬁnalstep willinvolvecom-
readypartofthe database. The topologicalpropertiesof puter simulations of the model, serving several purposes:
these networks are determined by these dynamical and (i) It allows us to investigate quantities that could not
growth processes. Consequently, in order to understand be extracted from the continuum theory; (ii) Veriﬁes the
their topology, we ﬁrst need to understand the dynam- predictionsofthecontinuumtheory;(iii)Allowsustoun-
ical process that determines their evolution. In this as- derstandthe natureofthe measurementswecanperform
pect Newman’s study focuses on the static properties of on the network, explaining some apparent discrepancies
the collaboration graph, while our work investigates the between the theoretical and the experimental results.
dynamical properties of these networks. We show that
such dynamical approach can explain many of the static
II. DATABASES: CO-AUTHORSHIP INtopological features seen in the collaboration graph.
MATHEMATICS AND NEURO-SCIENCEIt is important to emphasize that the properties of the
co-authorship network are not unique. The WWW is
also a complex evolving network, where nodes and links For each research ﬁeld whose practitioners collaborate
areadded(andremoved)ataveryhighrate,thenetwork in publications one can deﬁne a co-authorship network
topology being profoundly determined by these dynami- which is a reﬂection of the professional links between the
calfeatures[3,20,21,25]. TheactornetworkofHollywood scientists. In this network the nodes are the scientists
is very similar to the co-authorship network, because it and two scientists are linked if they wrote a paper to-
grows through the addition of new nodes (actors) and gether. In order to get information on the topology of a
new links (movies linking existing actors) [2,4,14]. Sim- scientiﬁcco-authorshipweboneneedsacompletedataset
ilarly, the nontrivial scaling properties of many cellular ofthe published papers, ideally from the birth ofthe dis-
[23], ecological [24] or business networks are all deter- cipline until today. However, computer databases cover
mined by dynamical processes that contributed to the at most the past several decades. Thus any study of this
emergence of these networks. So why single out the col- kind needs to be limited to only a recent segment of the
laboration network as a case study? A number of fac- database. This will impose unexpected challenges, that
tors have contributed to this choice. First we needed a need to be addressed, since such limited data availability
network for which the dynamical evolution is explicitly is a general feature of most networks.
available. That is, in addition to a map of the network The databases considered by us contain article titles
topology, it is important to know the time at which the and authors of all relevant journals in the ﬁeld of mathe-
nodes and links have been added to the network, crucial matics (M) and neuro-science (NS), published in the pe-
forrevealingthenetworkdynamics. Thisrequirementre- riod 1991-98. We have chosen these two ﬁelds for several
duces the currently available databases to two systems: reasons. A ﬁrst factor was the size of the database: bio-
the actor network, where we can follow the dynamics logicalsciencesorphysicsareordersofmagnitude larger,
by recording the year of the movie release, and the col- toolargetoaddresstheirpropertieswithreasonablecom-
laboration network for which the paper publication year puting resources. Second, the selected two ﬁelds oﬀer
allows us to track the time evolution. Of these two, the suﬃcientdiversityby displayingdiﬀerentpublishing pat-
co-authorship data is closer to a prototypical evolving terns: in NS collaborationis intense, while mathematics,
network than the Hollywood actor database for the fol- although there is increasing tendency towards collabora-
lowing reasons: in the science collaboration network the tion [26], is still a basically single investigator ﬁeld.
co-authorship decision is made entirely by the authors, In mathematics our database contains 70,975 diﬀerent
i.e. decisionmakingisdelegatedtothelevelofindividual authors and 70,901 papers for an interval spanning eight
nodes. In contrast, for actors the decision often lies with years. In NS the number of diﬀerent authors is 209,293
the casting director, a level higher than the node. While and the number of published papers is 210,750. A com-
in the long run this diﬀerence is not particularly impor- plete statistics for the two considered database is sum-
tant, the collaboration network is still closer in spirit to marized in Fig. 1, where we plot the cumulative number
a prototypicalevolvingnetworksuch as social systemsor
2of papers and authors for the period 1991-98. We con- the parameters that are crucial to the understanding of
sider ”newauthor”an authorwho wasnotpresentin the the processes which determine the network topology, of-
database from 1991 up to a given year. feringinputfortheconstructionofanappropriatemodel.
(a) A. Degree distribution follows a power-law
0.3
NS
2
0.2
A quantity that has been much studied lately for vari-NSM
0.1 ous networks is the degree distribution, P(k), giving the
t probability that a randomly selected node has k links.
1 NetworksforwhichP(k)hasapower-lawtail, areknown
M as scale-freenetworks[3,13]. Onthe otherhand, classical
network models, including the Erdo˝s-R´enyi [27,28] and
the Watts and Strogatz [4] models have an exponentially
0 decaying P(k) and are collectively known as exponential1991 1992 1993 1994 1995 1996 1997 1998
t networks. ThedegreedistributionsofboththeMandNS
data indicate that collaboration networks are scale-free.
(b)
0.4 The power-law tail is evident from the raw, uniformly
2 NS
binned data (Fig. 2a,b), but the scaling regime is better
NS0.2 seen on the plot that uses logarithmic binning, reducing
M
the noise in the tail (Fig. 2c). The cumulative data with
t logarithmic binning indicates γ = 2.4 and γ = 2.1M NS
1 for the two databases [29].
M
5 5
10 10
(b)(a)
1998 1998
1993 19934 4
slope −210 10 slope −20 slope −3 slope −31991 1992 1993 1994 1995 1996 1997 1998
t 3 3
10 10
FIG. 1. (a) Cumulative number of papers for the M and
2 2
10 10
NS databases in the period 1991-98. The inset shows the
1 1number of papers published each year. (b) Cumulative num- 10 10
M NS
ber of authors (nodes) for the M and NS databases in the
0 0
10 10
0 1 2 3 0 1 2 3period 1991-98. The inset shows the number of new authors 10 10 10 10 10 10 10 10
kadded each year.
(c)
010
Before proceeding we need to clarify a few method-
ological issues that aﬀect the data analysis. First, in
−2
10the database the authors are represented by their sur-
NSname and initials of ﬁrst and middle name, thus there is
a source of error in distinguishing some of them. Two
−4
10
diﬀerent authors with the same initials and surname will M
appeartobethesamenodeinthedatabase. Thiserroris
−610important mainly for scientists of Chinese and Japanese
0 1 2 310 10 10 10
descent. Second, seldom a given author uses one or two k
initials in diﬀerentpublications, andin suchcaseshe/she
FIG. 2. Degree distribution for the (a) M and (b) NS
will appear as separate nodes. Newman [10] showed that database,showingthedatabasedonthecumulativeresultsup
the error introduced by those problems is of the order toyeas 1993 (×)and 1998 (•). (c) Degree distribution shown
of a few percents. Our results are also aﬀected by these with logarithmic binning computed from the full dataset cu-
methodological limitations, but we do not expect that it mulative up to 1998. The lines correspond do the best ﬁts,
will have a signiﬁcant impact on our results. and have the slope 2.1 (NS, dotted) and 2.4 (M, dashed).
III. DATA ANALYSIS We will see in the coming sections that the data indi-
cates the existence of two scaling regimes with two dif-
ferent scaling exponents. The combination of these twoIn this sectionwe investigatethe topologyand dynam-
regimescouldeasilygivetheimpressionofanexponentialicsofthetwodatabases,MandNS.Ourgoalistoextract
3
5 5
N ( )x10 Np ( ) x10
P(k)
P(k)cutoﬀ in the P(k) for large k. Further analysis, oﬀered asymptotic limit, in which diﬀerent relevant quantities
in sections V-VII, indicates that a consideration of two take up a stationary value. The smaller separation for
scaling regimes oﬀers a more accurate description. the NS ﬁeld is expected, since mathematicians tend to
work in smaller groups and write papers with fewer co-
authors.
B. Average separation decreases in time
C. Clustering coeﬃcient decays with timeTheabilityoftwonodes, iandj, tocommunicatewith
eachotherdependsonthelengthoftheshortestpath,l ,ij
between them. The average of l over all pairs of nodes An important phenomena characterizingthe deviationij
is denoted by d=, and we call it the averagesep- of real networks from the completely random ER modelij
aration of the network, characterizing the networks in- is clustering. The clustering coeﬃcient, a quantitative
terconnectedness. Large networks can have surprisingly measure of this phenomena, C, can be deﬁned as fol-
lows [4]: pick a node, i that has links to k other nodessmall separation,explainingthe originofthe small-world i
concept [2,19]. Determining the average separation in in the system. If these k nodes form a fully connectedi
a large network is a rather time-consuming procedure. clique, there are k (k − 1)/2 links between them, buti i
Usually sampling a fraction of all nodes and determining in reality we ﬁnd much fever. Let us denote by N thei
their distance from all other points gives reasonable re- number of links that connect the selected k nodes toi
sults. The results for the cumulative database are shown each other. The clustering coeﬃcient for node i is then
in Fig. 3. C = 2N /k (k − 1). The clustering coeﬃcient for thei i i i
whole network is obtained by averagingC overall nodesi
in the system, i.e. C =< C > . In simple terms thei i20
clustering coeﬃcient of a node in the co-authorship net-
18
work tells us how much a node’s collaborators are will-M
16 ing to collaborate with each other, and it represents the
14 probability that two of it’s collaborators wrote a paper
together. The clustering coeﬃcient for the cumulative
12
network as a function of time is shown in Fig. 4.
10
NS
8
1
6
4 0.9
1991 1992 1993 1994 1995 1996 1997 1998
t
NS
0.8FIG. 3. Average separation in the M and NS databases.
The separation is computed on the cumulative data up to the
0.7indicatedyear. Theerrorbarsindicatethestandarddeviation
of the distances between all pairs of nodes. M
0.6
0.5The ﬁgure indicates that d decreases with time, which 1991 1992 1993 1994 1995 1996 1997 1998
tis highly surprising because all network models so far
predict that the average separation should increase with FIG. 4. Clustering coeﬃcient of the M and NS database,
system size [20,28]. The decreasing trend observed by us determined for the cumulative data up to the year indicated
could have two diﬀerent origins. First, it is possible that on the t axis.
indeed, the separationdoesdecreaseasinternallinks, i.e.
papers written by authors that were previously part of
thedatabase,the increaseinterconnectivity,thusdecreas- Theresults,inagreementwiththeseparationmeasure-
ing the diameter. Second, the decreasing diameter could ments, suggest a stronger interconnectedness for the NS
beaconsequenceofthefactthatwehavenoaccesstothe compared with M, and a slow convergence in time to an
full database, but only starting from year 1991. As we asymptotic value.
demonstrate in sect. VI, such incomplete dataset could
resultin an apparentlydecreasingseparationeven if oth-
erwise for the full system the separation increases. D. Relative size of the largest cluster increases
Onecannotetheslowconvergenceofthediameterand
the moreconnected nature ofthe NS ﬁeld expressed by a It is important to realize that the collaboration net-
smaller separation. The slow convergence indicates that work is fragmented in many clusters. There are several
perhaps even longer time interval is needed to reach the reasons for this. First, in every ﬁeld there are scientists
4
d
Cthatdonotcollaborateatall,thatistheyaretheonlyau- the fact that we are reconstructing the already existing
thors of all papers on which their name appears. This is giantcluster,anditisonlyapartialmeasureofit’semer-
more frequent in mathematics, which despite an increas- gence.
ing tendency towardcollaboration[26], is still more frag- Finally, the fast convergence of the NS cluster size to
mented thanphysicsorneuralscience. Second, and most an approximately stationary value around 0.9 indicates
important, the database contains papers published only that after 1994the networkreached a roughly stationary
after 1990. Thus there is a possibility that two authors topology,i.e. thebasicalliancesareuncovered. Thisdoes
co-authored a paper before 1990, but in our database not seems to be the case for M, where after ten years r
they appear as disconnected. still increases, perhaps due to smaller publication and
If we look only at a single year, we see many isolated collaboration rate in the ﬁeld.
clusters of authors. The cumulative dataset containing
several years develops a giant cluster, that contains a
E. Average degree increaseslarge fraction of the authors. To investigate the emer-
gence of this giant connected component we measured
the relative size of the largest cluster, r, giving the ra- With time the number of nodes in our co-authorship
tio between the number of nodes in the largest cluster network increases due to arrival of new authors. The
and the total number of nodes in the system. A cluster total number of links also increases through the connec-
is deﬁned as a subset of nodes interconnected by links. tions made by new authors with old ones and by new
Results from our cumulative co-authorship networks are connections between old authors. A quantity character-
presented in Fig. 5. As expected, in M the fraction of izing the network’s interconnectedness is the average de-
clustered researchers is considerably smaller than in NS. gree < k >, giving the average number of links per au-
thor. The time dependence of < k > for the cumulative
network is shown in Fig. 6, indicating an approximately
1
linear increase of with time.
0.8
NS 12
0.6
10 NS
0.4 M
8
0.2 6
M
40
1991 1992 1993 1994 1995 1996 1997 1998
t
2
FIG. 5. Relative size of the largest cluster for the M and
0NS database. Results are computed on the cumulative data 1991 1992 1993 1994 1995 1996 1997 1998
tup to the given year.
FIG. 6. Average number of links per node (< k >) for the
Mand NS database. Resultsare computed on thecumulative
The continuous increase in r may appear as the sce-
data up to the given year.
nariocommonlydescribedaspercolation[30]orthemuch
studiedemergenceofthegiantcomponentinrandomnet-
works [28]. However, the process leading to this giant This is a rather important deviation from the major-
cluster is fundamentally diﬀerent from these much stud-
ity of currently existing evolving network models, that
iedphenomena. Inmostresearchﬁelds,apartfromavery assume a constant < k > as the network expands. As
small fraction of authors that do not collaborate, all au-
expected, the average degree for M is much smaller than
thors belong to a single giant cluster from the very early for NS.
stagesoftheﬁeld. Thatis,thesystemisalmostfullycon-
nected from the very ﬁrst moment. The only reason why
the giant cluster in our case grows so dramatically in the F. Node selection is governed by preferential
ﬁrst several years is that we are missing the information attachment
on the network topology before 1991. A good example is
the actor network, where the huge majority of the actors
Classical network models assume complete random-
are part of the large cluster at any stage of the network,
ness, i.e. the nodes are connected to each other inde-
starting from early 1900’s until today. However, if we
pendent of the the number of links they already had
would start recording collaborations only after 1990 for
[2,27,28]. The discovery of the power-law connectivity
example, the data would indicate, incorrectly, that many
distribution required the development of new modeling
actors are disconnected. The increasing r indicates only
5
r
3 6paradigms. A much used assumption is that in scale- 10 10
slope 1free networks nodes link with higher probability to those
slope 1
1999
2 1998 1997 nodes that already have a larger number of links, a phe- 10
4 1996 1995 10 1994
nomena labeled as preferential attachment [3,13]. Im-
1
10plicitly or explicitly, preferential attachment is part of
2
10all network models that aim to explain the emergence of
0
10theinhomogeneousnetworkstructureandpowerlawcon-
0nectivity distribution [5–8]. The availability of dynamic 10
−1
10dataonthe networkdevelopmentallowsustoinvestigate
M NS
its presence in the co-authorship network. For this net-
−2 −2
10 10
0 1 2 3 0 1 2 3 4work preferential attachment appears at two levels, that 10 10 10 10 10 10 10 10 10
kwe discuss separately.
(i) New nodes: For a new author, that appears for the FIG. 7. Cumulated preferential attachment (κ(k)) of in-
ﬁrst time on a publication, preferential attachment has coming new nodes for the M and NS database. Results com-
a simple meaning: it is more likely that the ﬁrst paper puted by considering the new nodes coming in the speciﬁed
will be co-authored with somebody that already has a year, and the network formed by nodes already present up to
large number of co-authors (links) that with somebody this year. In the absence of preferential attachment κ(k)∼ k,
lessconnected. As aresult”old”authorswith morelinks shown as continuous line on the ﬁgures.
will increase their number of co-authors at a higher rate
than those with fever links. To investigate this process
in quantitative terms we determined the probability that (ii) Internal links: A large number of new links ap-
an old author with connectivity k is selected by a new pear between old nodes as the network evolves, repre-
author for co-authorship. This probability deﬁnes the senting papers written by authors that were part of the
Π(k) distribution function. Calling ”old authors” those network, but did not collaborate before. Such internal
present up to the last year, and ”new author” those who linksareknowntoeﬀectboththetopologyanddynamics
wereaddedduringthelastyear,wedeterminethechange of the network [5]. These internal links are also subject
in the number of links, Δk, for an old author that at the to preferential attachment. We studied the probability
beginning of the last year had k links. Plotting Δk as a Π(k ,k ) that an old author with k links forms a new1 2 1
function of k gives the function Π(k), describing the na- linkwith anotheroldauthorwithk links. TheΠ(k ,k )2 1 2
ture of preferential attachment. Since the measurements probability map can be calculated by dividing N(k ,k ),1 2
are limited to only a ﬁnite (ΔT = 1 year) interval, we the number of new links between authors with k and k1 2
improve the statistics by plotting the integral of Π(k): links, with the D(k ,k ), number of pairs of nodes with1 2
connectivities k and k present in the system:1 2Z k
′ ′κ(k) = Π(k )dk . (1) N(k ,k )1 2
1 Π(k ,k )= . (2)1 2
D(k ,k )1 2
Ifpreferentialattachmentisabsent, Π(k)shouldbein-
ThethreedimensionalplotofΠ(k ,k )isshowninFig.8,dependent of k, as each node grows independently of it’s 1 2
the overall behavior indicating preferential attachment:degree,andκ(k)isexpectedtobelinear. AsFig.7shows,
ν+1 Π(k ,k ) increases with as either k or k ’s increase.1 2 1 2we ﬁnd that κ(k) isnonlinear, increasingasκ(k)∼ k ,
where the best ﬁt gives ν ≃ 0.8 for M and ν ≃ 0.75 for
νNS. This implies that Π(k) ∼ k , where ν is diﬀerent
0 M NS
10from 1 [31]. As simulations have shown, such nonlinear 010
−2
10
−2dependence generates deviations from a power law P(k) 10
−410
−4[31]. This was supported by analytical calculations [8], 10
−610
−6that demonstrated that the degree distribution follows 10
1000 10000a power law only for ν = 1. The consequence of this
1000
100
1 1 100nonlinearity will be discussed below. 1010 10 k 102 100 k2100 10001 1k k1 1000 1 10000
FIG. 8. Internal preferential attachment for the M and NS
database, 3D plots: Δk as a function of k and k . Results1 2
computed on the cumulative data in the last considered year.
A natural hypothesis is to assume that Π(k ,k ) fac-1 2
torizesinto the productk k . AsFig. 9shows, weindeed1 2
ﬁnd that
6
Δk6
Z k k1 2
node in unit time, we write the probability that between′ ′ ′ ′
κ(k k )= Π(k k )d(k k ) (3)1 2 1 2 1 2 node i and j a new internal link is created as
1
k ki jcan be well approximated with a slope 2 as a function Π = P N(t)a, (6)ij
′
of k k , indicating that for internal links the preferential k k1 2 s ms,m
attachment is linear in the degree.
where the prime sign indicates that the summation is
6
10000 10 done for s = m values.
slope 1 slope 1
1999 1998 The measurements also indicated (Fig. 7) that new
4
10 nodes link to the existing nodes with preferential attach-100 NSM
ν
2 ment, Π(k) follows k with ν ≃ 0.75−0.8. Aiming to
10
obtain an analytically solvable model, at this point we1
0
10 neglect this nonlinearity and we approximate Π(k) with
a linear k dependence. The eﬀect of the nonlinearities0
−2
10
will be discussed in sect. VII. Thus, if node i has ki
−4 links, the probabilitythat an incoming node will connect0 10 2 3 4 5 6 7
10 100 1000 10000 100000 10 10 10 10 10 10
to it is given byk k1 2
FIG. 9. Cumulated internal preferential attachment (κ(k)) kiPfor the M and NS database, scaling as a function of the k k Π = b , (7)1 2 i
kjjproduct. Resultscomputedonthecumulativedatainthelast
considered year. The straight lines have slope 1, expected for
where b is the average number of new links that an in-
κ(k k ) if there would be no preferential attachment.1 2 coming node creates.
We have thus formulated the dynamical rules that
govern our evolving network model, capturing the basic
mechanism governing the evolution of the co-authorshipIV. MODELING THE WEB OF SCIENCE
network:
In this section we use the obtained numerical results 1. Nodes join the network at a constant rate.
to construct a simple model for the evolution of the co-
2. Incoming nodes link to the already present nodesauthorship network. It is important to emphasize that
following preferential attachment (7).
the purpose of the model is to capture the main mech-
anisms that aﬀect the evolution and the scaling of the 3. Nodes already present in the network form new in-
network, and not to incorporate every numerical detail ternal links following preferential attachment (6).
ofthe measuredweb. However,the advantageofthe pro-
4. We neglect the aging of nodes, and assume that allposedmodelisitsﬂexibility: features,neglectedhere,can
nodes and links present in the system are active,be incorporated into the current modeling framework.
able to initiate and receive new links.We denote by k (t) the number of links node i has ati
time t; by T(t) and N(t) the total number of links and
In the model we assume that the number of authors
total numberofnodes attime t, respectively. We assume
on a paper is constant. In reality m is a stochastic vari-
that all nodes present in the system are active, i.e. they
able, as the number of authors varies from paper to pa-
can author further papers. This is a reasonable assump-
per. However, for the scale-free model the exponent γ is
tion as the time-span over which data is available to us
knowntobeindependentofm, thusmakingmastochas-
is shorter than the professional lifetime of a scientist. In
tic variable is not expected to change the scaling behav-
agreement with Fig. 1, we consider that new researchers
ior.
join the ﬁeld at a constant rate, leading to
N(t) =βt. (4)
V. CONTINUUM THEORY
The average number of links per node in the system at
Taking into account that new links join the systemtime t is thus given by:
with a constant rate, β, the continuum equation for the
T(t) evolutionofthenumberoflinksnodeihascanbewritten
= . (5)
as:N(t) Xdk bβk k ki i i j′Fig.9suggests,thattheprobabilitytocreateanewinter- P= +N(t)a P . (8)
′dt kj k ks mnal link between two existing nodes is proportional with j s,mj
the product of their connectivities. Consequently, denot-
Theﬁrsttermontherighthandsidedescribesthecontri-ing by a the number of newly created internal links per
bution due to new nodes (7) and the second term gives
7
κ (k k )Δk
1 2dtithe new links created with already existing nodes (6).