La lecture en ligne est gratuite
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres
Télécharger Lire

Evolution of the social network of scientific collaborations

14 pages
Evolution of the social network of scientific collaborations1,2 1 1,2,∗ 1 3 2,4A.L. Baraba´si , H. Jeong , Z. N´eda , E. Ravasz , A. Schubert , T. Vicsek1Department of Physics, University of Notre Dame, Notre Dame, IN 46556, USA2 Collegium Budapest, Institute of Advanced Study, Budapest, Hungary3 Bibliometric Service, Library of the Hungarian Academy of Sciences, Budapest, Hungary4Department of Biological Physics, E¨otv¨os Lor´and University, Budapest, Hungary(Last revised February 1, 2008)The co-authorship network of scientists represents a prototype of complex evolving networks. Inaddition, it offers one of the most extensive database to date on social networks. By mapping theelectronic database containing all relevant journals in mathematics and neuro-science for an eight-yearperiod(1991-98), weinferthedynamicandthestructuralmechanismsthatgoverntheevolutionandtopology ofthiscomplexsystem. Threecomplementaryapproachesallowustoobtainadetailedcharacterization. First, empirical measurements allow us to uncover the topological measures thatcharacterize the network at a given moment, as well as the time evolution of these quantities.The results indicate that the network is scale-free, and that the network evolution is governed bypreferential attachment, affecting both internal and external links. However, in contrast with mostmodel predictions the average degree increases in time, and the node separation decreases. Second,we propose a simple model that ...
Voir plus Voir moins
se advances we choose tothey are restricted to rather small systems, and often investigate in detail the collaboration network of scien-view networks as static graphs, whose nodes are individ- tists.uals and links represent various quantifiable social inter- Recently Newman has taken an important step to-actions. wards applying modern network ideas to collaborationIn contrast, recent approaches with methodology networks [10,11]. He studied several large database fo-rooted in statistical physics focus on large networks, cusing on several fields of research over a five year pe-searching for universalities both in the topology of the riod,establishingthatcollaborationnetworkshavealltheweb and in the dynamics governing it’s evolution. These general ingredients of small world networks: they have acombined theoretical and empirical results have opened 1 arXiv:cond-mat/0104162v1 [cond-mat.soft] 10 Apr 2001 surprisinglyshortnode-to-nodedistanceandalargeclus- the WWW. tering coefficient [10], much largerthan the one expected Our work stands on three pillars. First, we use direct from a random Erdo˝s-R´enyi type network of similar size measurementsontheavailabledatatouncoverthemech- and average connectivity. Furthermore, the degree dis- anismofnetworkevolution. Thisimpliesdeterminingthe tribution appears to follow a power law [11]. different parameters and uncovering the various compet- Our study takes a different, but complementary ap- ing processes present in the system. Second, building proach to collaboration networks than that followed by on the mechanisms and parameters revealed by the mea- Newman. We view collaboration networks as prototype surements we construct a model that allows us to inves- of evolving networks, where the accent is on dynamics tigate the large scale topology the system, as well as its and evolution. Indeed, the co-authorship network con- dynamical features. The predictions offered by a contin- stantly expands by the addition of new authors to the uum theory of the model allow us to explain some of the database, as well as the addition of new internal links results that were uncovered by ours, as well Newman’s representingpapers co-authoredby authorsthat wereal- measurements. Thethirdandfinalstep willinvolvecom- readypartofthe database. The topologicalpropertiesof puter simulations of the model, serving several purposes: these networks are determined by these dynamical and (i) It allows us to investigate quantities that could not growth processes. Consequently, in order to understand be extracted from the continuum theory; (ii) Verifies the their topology, we first need to understand the dynam- predictionsofthecontinuumtheory;(iii)Allowsustoun- ical process that determines their evolution. In this as- derstandthe natureofthe measurementswecanperform pect Newman’s study focuses on the static properties of on the network, explaining some apparent discrepancies the collaboration graph, while our work investigates the between the theoretical and the experimental results. dynamical properties of these networks. We show that such dynamical approach can explain many of the static II. DATABASES: CO-AUTHORSHIP INtopological features seen in the collaboration graph. MATHEMATICS AND NEURO-SCIENCEIt is important to emphasize that the properties of the co-authorship network are not unique. The WWW is also a complex evolving network, where nodes and links For each research field whose practitioners collaborate areadded(andremoved)ataveryhighrate,thenetwork in publications one can define a co-authorship network topology being profoundly determined by these dynami- which is a reflection of the professional links between the calfeatures[3,20,21,25]. TheactornetworkofHollywood scientists. In this network the nodes are the scientists is very similar to the co-authorship network, because it and two scientists are linked if they wrote a paper to- grows through the addition of new nodes (actors) and gether. In order to get information on the topology of a new links (movies linking existing actors) [2,4,14]. Sim- scientificco-authorshipweboneneedsacompletedataset ilarly, the nontrivial scaling properties of many cellular ofthe published papers, ideally from the birth ofthe dis- [23], ecological [24] or business networks are all deter- cipline until today. However, computer databases cover mined by dynamical processes that contributed to the at most the past several decades. Thus any study of this emergence of these networks. So why single out the col- kind needs to be limited to only a recent segment of the laboration network as a case study? A number of fac- database. This will impose unexpected challenges, that tors have contributed to this choice. First we needed a need to be addressed, since such limited data availability network for which the dynamical evolution is explicitly is a general feature of most networks. available. That is, in addition to a map of the network The databases considered by us contain article titles topology, it is important to know the time at which the and authors of all relevant journals in the field of mathe- nodes and links have been added to the network, crucial matics (M) and neuro-science (NS), published in the pe- forrevealingthenetworkdynamics. Thisrequirementre- riod 1991-98. We have chosen these two fields for several duces the currently available databases to two systems: reasons. A first factor was the size of the database: bio- the actor network, where we can follow the dynamics logicalsciencesorphysicsareordersofmagnitude larger, by recording the year of the movie release, and the col- toolargetoaddresstheirpropertieswithreasonablecom- laboration network for which the paper publication year puting resources. Second, the selected two fields offer allows us to track the time evolution. Of these two, the sufficientdiversityby displayingdifferentpublishing pat- co-authorship data is closer to a prototypical evolving terns: in NS collaborationis intense, while mathematics, network than the Hollywood actor database for the fol- although there is increasing tendency towards collabora- lowing reasons: in the science collaboration network the tion [26], is still a basically single investigator field. co-authorship decision is made entirely by the authors, In mathematics our database contains 70,975 different i.e. decisionmakingisdelegatedtothelevelofindividual authors and 70,901 papers for an interval spanning eight nodes. In contrast, for actors the decision often lies with years. In NS the number of different authors is 209,293 the casting director, a level higher than the node. While and the number of published papers is 210,750. A com- in the long run this difference is not particularly impor- plete statistics for the two considered database is sum- tant, the collaboration network is still closer in spirit to marized in Fig. 1, where we plot the cumulative number a prototypicalevolvingnetworksuch as social systemsor 2 of papers and authors for the period 1991-98. We con- the parameters that are crucial to the understanding of sider ”newauthor”an authorwho wasnotpresentin the the processes which determine the network topology, of- database from 1991 up to a given year. feringinputfortheconstructionofanappropriatemodel. (a) A. Degree distribution follows a power-law 0.3 NS 2 0.2 A quantity that has been much studied lately for vari-NSM 0.1 ous networks is the degree distribution, P(k), giving the t probability that a randomly selected node has k links. 1 NetworksforwhichP(k)hasapower-lawtail, areknown M as scale-freenetworks[3,13]. Onthe otherhand, classical network models, including the Erdo˝s-R´enyi [27,28] and the Watts and Strogatz [4] models have an exponentially 0 decaying P(k) and are collectively known as exponential1991 1992 1993 1994 1995 1996 1997 1998 t networks. ThedegreedistributionsofboththeMandNS data indicate that collaboration networks are scale-free. (b) 0.4 The power-law tail is evident from the raw, uniformly 2 NS binned data (Fig. 2a,b), but the scaling regime is better NS0.2 seen on the plot that uses logarithmic binning, reducing M the noise in the tail (Fig. 2c). The cumulative data with t logarithmic binning indicates γ = 2.4 and γ = 2.1M NS 1 for the two databases [29]. M 5 5 10 10 (b)(a) 1998 1998 1993 19934 4 slope −210 10 slope −20 slope −3 slope −31991 1992 1993 1994 1995 1996 1997 1998 t 3 3 10 10 FIG. 1. (a) Cumulative number of papers for the M and 2 2 10 10 NS databases in the period 1991-98. The inset shows the 1 1number of papers published each year. (b) Cumulative num- 10 10 M NS ber of authors (nodes) for the M and NS databases in the 0 0 10 10 0 1 2 3 0 1 2 3period 1991-98. The inset shows the number of new authors 10 10 10 10 10 10 10 10 kadded each year. (c) 010 Before proceeding we need to clarify a few method- ological issues that affect the data analysis. First, in −2 10the database the authors are represented by their sur- NSname and initials of first and middle name, thus there is a source of error in distinguishing some of them. Two −4 10 different authors with the same initials and surname will M appeartobethesamenodeinthedatabase. Thiserroris −610important mainly for scientists of Chinese and Japanese 0 1 2 310 10 10 10 descent. Second, seldom a given author uses one or two k initials in differentpublications, andin suchcaseshe/she FIG. 2. Degree distribution for the (a) M and (b) NS will appear as separate nodes. Newman [10] showed that database,showingthedatabasedonthecumulativeresultsup the error introduced by those problems is of the order toyeas 1993 (×)and 1998 (•). (c) Degree distribution shown of a few percents. Our results are also affected by these with logarithmic binning computed from the full dataset cu- methodological limitations, but we do not expect that it mulative up to 1998. The lines correspond do the best fits, will have a significant impact on our results. and have the slope 2.1 (NS, dotted) and 2.4 (M, dashed). III. DATA ANALYSIS We will see in the coming sections that the data indi- cates the existence of two scaling regimes with two dif- ferent scaling exponents. The combination of these twoIn this sectionwe investigatethe topologyand dynam- regimescouldeasilygivetheimpressionofanexponentialicsofthetwodatabases,MandNS.Ourgoalistoextract 3 5 5 N ( )x10 Np ( ) x10 P(k) P(k) cutoff in the P(k) for large k. Further analysis, offered asymptotic limit, in which different relevant quantities in sections V-VII, indicates that a consideration of two take up a stationary value. The smaller separation for scaling regimes offers a more accurate description. the NS field is expected, since mathematicians tend to work in smaller groups and write papers with fewer co- authors. B. Average separation decreases in time C. Clustering coefficient decays with timeTheabilityoftwonodes, iandj, tocommunicatewith eachotherdependsonthelengthoftheshortestpath,l ,ij between them. The average of l over all pairs of nodes An important phenomena characterizingthe deviationij is denoted by d=, and we call it the averagesep- of real networks from the completely random ER modelij aration of the network, characterizing the networks in- is clustering. The clustering coefficient, a quantitative terconnectedness. Large networks can have surprisingly measure of this phenomena, C, can be defined as fol- lows [4]: pick a node, i that has links to k other nodessmall separation,explainingthe originofthe small-world i concept [2,19]. Determining the average separation in in the system. If these k nodes form a fully connectedi a large network is a rather time-consuming procedure. clique, there are k (k − 1)/2 links between them, buti i Usually sampling a fraction of all nodes and determining in reality we find much fever. Let us denote by N thei their distance from all other points gives reasonable re- number of links that connect the selected k nodes toi sults. The results for the cumulative database are shown each other. The clustering coefficient for node i is then in Fig. 3. C = 2N /k (k − 1). The clustering coefficient for thei i i i whole network is obtained by averagingC overall nodesi in the system, i.e. C =< C > . In simple terms thei i20 clustering coefficient of a node in the co-authorship net- 18 work tells us how much a node’s collaborators are will-M 16 ing to collaborate with each other, and it represents the 14 probability that two of it’s collaborators wrote a paper together. The clustering coefficient for the cumulative 12 network as a function of time is shown in Fig. 4. 10 NS 8 1 6 4 0.9 1991 1992 1993 1994 1995 1996 1997 1998 t NS 0.8FIG. 3. Average separation in the M and NS databases. The separation is computed on the cumulative data up to the 0.7indicatedyear. Theerrorbarsindicatethestandarddeviation of the distances between all pairs of nodes. M 0.6 0.5The figure indicates that d decreases with time, which 1991 1992 1993 1994 1995 1996 1997 1998 tis highly surprising because all network models so far predict that the average separation should increase with FIG. 4. Clustering coefficient of the M and NS database, system size [20,28]. The decreasing trend observed by us determined for the cumulative data up to the year indicated could have two different origins. First, it is possible that on the t axis. indeed, the separationdoesdecreaseasinternallinks, i.e. papers written by authors that were previously part of thedatabase,the increaseinterconnectivity,thusdecreas- Theresults,inagreementwiththeseparationmeasure- ing the diameter. Second, the decreasing diameter could ments, suggest a stronger interconnectedness for the NS beaconsequenceofthefactthatwehavenoaccesstothe compared with M, and a slow convergence in time to an full database, but only starting from year 1991. As we asymptotic value. demonstrate in sect. VI, such incomplete dataset could resultin an apparentlydecreasingseparationeven if oth- erwise for the full system the separation increases. D. Relative size of the largest cluster increases Onecannotetheslowconvergenceofthediameterand the moreconnected nature ofthe NS field expressed by a It is important to realize that the collaboration net- smaller separation. The slow convergence indicates that work is fragmented in many clusters. There are several perhaps even longer time interval is needed to reach the reasons for this. First, in every field there are scientists 4 d C thatdonotcollaborateatall,thatistheyaretheonlyau- the fact that we are reconstructing the already existing thors of all papers on which their name appears. This is giantcluster,anditisonlyapartialmeasureofit’semer- more frequent in mathematics, which despite an increas- gence. ing tendency towardcollaboration[26], is still more frag- Finally, the fast convergence of the NS cluster size to mented thanphysicsorneuralscience. Second, and most an approximately stationary value around 0.9 indicates important, the database contains papers published only that after 1994the networkreached a roughly stationary after 1990. Thus there is a possibility that two authors topology,i.e. thebasicalliancesareuncovered. Thisdoes co-authored a paper before 1990, but in our database not seems to be the case for M, where after ten years r they appear as disconnected. still increases, perhaps due to smaller publication and If we look only at a single year, we see many isolated collaboration rate in the field. clusters of authors. The cumulative dataset containing several years develops a giant cluster, that contains a E. Average degree increaseslarge fraction of the authors. To investigate the emer- gence of this giant connected component we measured the relative size of the largest cluster, r, giving the ra- With time the number of nodes in our co-authorship tio between the number of nodes in the largest cluster network increases due to arrival of new authors. The and the total number of nodes in the system. A cluster total number of links also increases through the connec- is defined as a subset of nodes interconnected by links. tions made by new authors with old ones and by new Results from our cumulative co-authorship networks are connections between old authors. A quantity character- presented in Fig. 5. As expected, in M the fraction of izing the network’s interconnectedness is the average de- clustered researchers is considerably smaller than in NS. gree < k >, giving the average number of links per au- thor. The time dependence of < k > for the cumulative network is shown in Fig. 6, indicating an approximately 1 linear increase of with time. 0.8 NS 12 0.6 10 NS 0.4 M 8 0.2 6 M 40 1991 1992 1993 1994 1995 1996 1997 1998 t 2 FIG. 5. Relative size of the largest cluster for the M and 0NS database. Results are computed on the cumulative data 1991 1992 1993 1994 1995 1996 1997 1998 tup to the given year. FIG. 6. Average number of links per node (< k >) for the Mand NS database. Resultsare computed on thecumulative The continuous increase in r may appear as the sce- data up to the given year. nariocommonlydescribedaspercolation[30]orthemuch studiedemergenceofthegiantcomponentinrandomnet- works [28]. However, the process leading to this giant This is a rather important deviation from the major- cluster is fundamentally different from these much stud- ity of currently existing evolving network models, that iedphenomena. Inmostresearchfields,apartfromavery assume a constant < k > as the network expands. As small fraction of authors that do not collaborate, all au- expected, the average degree for M is much smaller than thors belong to a single giant cluster from the very early for NS. stagesofthefield. Thatis,thesystemisalmostfullycon- nected from the very first moment. The only reason why the giant cluster in our case grows so dramatically in the F. Node selection is governed by preferential first several years is that we are missing the information attachment on the network topology before 1991. A good example is the actor network, where the huge majority of the actors Classical network models assume complete random- are part of the large cluster at any stage of the network, ness, i.e. the nodes are connected to each other inde- starting from early 1900’s until today. However, if we pendent of the the number of links they already had would start recording collaborations only after 1990 for [2,27,28]. The discovery of the power-law connectivity example, the data would indicate, incorrectly, that many distribution required the development of new modeling actors are disconnected. The increasing r indicates only 5 r 3 6paradigms. A much used assumption is that in scale- 10 10 slope 1free networks nodes link with higher probability to those slope 1 1999 2 1998 1997 nodes that already have a larger number of links, a phe- 10 4 1996 1995 10 1994 nomena labeled as preferential attachment [3,13]. Im- 1 10plicitly or explicitly, preferential attachment is part of 2 10all network models that aim to explain the emergence of 0 10theinhomogeneousnetworkstructureandpowerlawcon- 0nectivity distribution [5–8]. The availability of dynamic 10 −1 10dataonthe networkdevelopmentallowsustoinvestigate M NS its presence in the co-authorship network. For this net- −2 −2 10 10 0 1 2 3 0 1 2 3 4work preferential attachment appears at two levels, that 10 10 10 10 10 10 10 10 10 kwe discuss separately. (i) New nodes: For a new author, that appears for the FIG. 7. Cumulated preferential attachment (κ(k)) of in- first time on a publication, preferential attachment has coming new nodes for the M and NS database. Results com- a simple meaning: it is more likely that the first paper puted by considering the new nodes coming in the specified will be co-authored with somebody that already has a year, and the network formed by nodes already present up to large number of co-authors (links) that with somebody this year. In the absence of preferential attachment κ(k)∼ k, lessconnected. As aresult”old”authorswith morelinks shown as continuous line on the figures. will increase their number of co-authors at a higher rate than those with fever links. To investigate this process in quantitative terms we determined the probability that (ii) Internal links: A large number of new links ap- an old author with connectivity k is selected by a new pear between old nodes as the network evolves, repre- author for co-authorship. This probability defines the senting papers written by authors that were part of the Π(k) distribution function. Calling ”old authors” those network, but did not collaborate before. Such internal present up to the last year, and ”new author” those who linksareknowntoeffectboththetopologyanddynamics wereaddedduringthelastyear,wedeterminethechange of the network [5]. These internal links are also subject in the number of links, Δk, for an old author that at the to preferential attachment. We studied the probability beginning of the last year had k links. Plotting Δk as a Π(k ,k ) that an old author with k links forms a new1 2 1 function of k gives the function Π(k), describing the na- linkwith anotheroldauthorwithk links. TheΠ(k ,k )2 1 2 ture of preferential attachment. Since the measurements probability map can be calculated by dividing N(k ,k ),1 2 are limited to only a finite (ΔT = 1 year) interval, we the number of new links between authors with k and k1 2 improve the statistics by plotting the integral of Π(k): links, with the D(k ,k ), number of pairs of nodes with1 2 connectivities k and k present in the system:1 2Z k ′ ′κ(k) = Π(k )dk . (1) N(k ,k )1 2 1 Π(k ,k )= . (2)1 2 D(k ,k )1 2 Ifpreferentialattachmentisabsent, Π(k)shouldbein- ThethreedimensionalplotofΠ(k ,k )isshowninFig.8,dependent of k, as each node grows independently of it’s 1 2 the overall behavior indicating preferential attachment:degree,andκ(k)isexpectedtobelinear. AsFig.7shows, ν+1 Π(k ,k ) increases with as either k or k ’s increase.1 2 1 2we find that κ(k) isnonlinear, increasingasκ(k)∼ k , where the best fit gives ν ≃ 0.8 for M and ν ≃ 0.75 for νNS. This implies that Π(k) ∼ k , where ν is different 0 M NS 10from 1 [31]. As simulations have shown, such nonlinear 010 −2 10 −2dependence generates deviations from a power law P(k) 10 −410 −4[31]. This was supported by analytical calculations [8], 10 −610 −6that demonstrated that the degree distribution follows 10 1000 10000a power law only for ν = 1. The consequence of this 1000 100 1 1 100nonlinearity will be discussed below. 1010 10 k 102 100 k2100 10001 1k k1 1000 1 10000 FIG. 8. Internal preferential attachment for the M and NS database, 3D plots: Δk as a function of k and k . Results1 2 computed on the cumulative data in the last considered year. A natural hypothesis is to assume that Π(k ,k ) fac-1 2 torizesinto the productk k . AsFig. 9shows, weindeed1 2 find that 6 Δk 6 Z k k1 2 node in unit time, we write the probability that between′ ′ ′ ′ κ(k k )= Π(k k )d(k k ) (3)1 2 1 2 1 2 node i and j a new internal link is created as 1 k ki jcan be well approximated with a slope 2 as a function Π = P N(t)a, (6)ij ′ of k k , indicating that for internal links the preferential k k1 2 s ms,m attachment is linear in the degree. where the prime sign indicates that the summation is 6 10000 10 done for s = m values. slope 1 slope 1 1999 1998 The measurements also indicated (Fig. 7) that new 4 10 nodes link to the existing nodes with preferential attach-100 NSM ν 2 ment, Π(k) follows k with ν ≃ 0.75−0.8. Aiming to 10 obtain an analytically solvable model, at this point we1 0 10 neglect this nonlinearity and we approximate Π(k) with a linear k dependence. The effect of the nonlinearities0 −2 10 will be discussed in sect. VII. Thus, if node i has ki −4 links, the probabilitythat an incoming node will connect0 10 2 3 4 5 6 7 10 100 1000 10000 100000 10 10 10 10 10 10 to it is given byk k1 2 FIG. 9. Cumulated internal preferential attachment (κ(k)) kiPfor the M and NS database, scaling as a function of the k k Π = b , (7)1 2 i kjjproduct. Resultscomputedonthecumulativedatainthelast considered year. The straight lines have slope 1, expected for where b is the average number of new links that an in- κ(k k ) if there would be no preferential attachment.1 2 coming node creates. We have thus formulated the dynamical rules that govern our evolving network model, capturing the basic mechanism governing the evolution of the co-authorshipIV. MODELING THE WEB OF SCIENCE network: In this section we use the obtained numerical results 1. Nodes join the network at a constant rate. to construct a simple model for the evolution of the co- 2. Incoming nodes link to the already present nodesauthorship network. It is important to emphasize that following preferential attachment (7). the purpose of the model is to capture the main mech- anisms that affect the evolution and the scaling of the 3. Nodes already present in the network form new in- network, and not to incorporate every numerical detail ternal links following preferential attachment (6). ofthe measuredweb. However,the advantageofthe pro- 4. We neglect the aging of nodes, and assume that allposedmodelisitsflexibility: features,neglectedhere,can nodes and links present in the system are active,be incorporated into the current modeling framework. able to initiate and receive new links.We denote by k (t) the number of links node i has ati time t; by T(t) and N(t) the total number of links and In the model we assume that the number of authors total numberofnodes attime t, respectively. We assume on a paper is constant. In reality m is a stochastic vari- that all nodes present in the system are active, i.e. they able, as the number of authors varies from paper to pa- can author further papers. This is a reasonable assump- per. However, for the scale-free model the exponent γ is tion as the time-span over which data is available to us knowntobeindependentofm, thusmakingmastochas- is shorter than the professional lifetime of a scientist. In tic variable is not expected to change the scaling behav- agreement with Fig. 1, we consider that new researchers ior. join the field at a constant rate, leading to N(t) =βt. (4) V. CONTINUUM THEORY The average number of links per node in the system at Taking into account that new links join the systemtime t is thus given by: with a constant rate, β, the continuum equation for the T(t) evolutionofthenumberoflinksnodeihascanbewritten = . (5) as:N(t) Xdk bβk k ki i i j′Fig.9suggests,thattheprobabilitytocreateanewinter- P= +N(t)a P . (8) ′dt kj k ks mnal link between two existing nodes is proportional with j s,mj the product of their connectivities. Consequently, denot- Thefirsttermontherighthandsidedescribesthecontri-ing by a the number of newly created internal links per bution due to new nodes (7) and the second term gives 7 κ (k k )Δk 1 2 dtithe new links created with already existing nodes (6).