Accepted Manuscript Disentangling the evolution of MEDLINE bibliographic database: A complex network perspective Andrej Kastrin, Dimitar Hristovski PII: DOI: Reference:
S1532-0464(18)30227-2 https://doi.org/10.1016/j.jbi.2018.11.014 YJBIN 3086
To appear in:
Journal of Biomedical Informatics
Received Date: Accepted Date:
1 June 2018 28 November 2018
Please cite this article as: Kastrin, A., Hristovski, D., Disentangling the evolution of MEDLINE bibliographic database: A complex network perspective, Journal of Biomedical Informatics (2018), doi: https://doi.org/10.1016/ j.jbi.2018.11.014
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Disentangling the evolution of MEDLINE bibliographic database: A complex network perspective Andrej Kastrina,∗, Dimitar Hristovskia a Institute
of Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Vrazov trg 2, SI–1000 Ljubljana, Slovenia
Abstract Scientific knowledge constitutes a complex system that has recently been the topic of in-depth analysis. Empirical evidence reveals that little is known about the dynamic aspects of human knowledge. Precise dissection of the expansion of scientific knowledge could help us to better understand the evolutionary dynamics of science. In this paper, we analyzed the dynamic properties and growth principles of the MEDLINE bibliographic database using network analysis methodology. The basic assumption of this work is that the scientific evolution of the life sciences can be represented as a list of co-occurrences of MeSH descriptors that are linked to MEDLINE citations. The MEDLINE database was summarized as a complex system, consisting of nodes and edges, where the nodes refer to knowledge concepts and the edges symbolize corresponding relations. We performed an extensive statistical evaluation based on more than 25 million citations in the MEDLINE database, from 1966 until 2014. We based our analysis on node and community level in order to track temporal evolution in the network. The degree distribution of the network follows a stretched exponential distribution which prevents the creation of large hubs. Results showed that the appearance of new MeSH terms does not also imply new connections. The majority of new connections among nodes results from old MeSH descriptors. We suggest a wiring mechanism based on the theory of structural holes, ∗ Corresponding
author Email addresses:
[email protected] (Andrej Kastrin),
[email protected] (Dimitar Hristovski)
Preprint submitted to Journal of Biomedical Informatics
November 19, 2018
according to which a novel scientific discovery is established when a connection is built among two or more previously disconnected parts of scientific knowledge. Overall, we extracted 142 different evolving communities. It is evident that new communities are constantly born, live for some time, and then die. We also provide a Web-based application that helps characterize and understand the content of extracted communities. This study clearly shows that the evolution of MEDLINE knowledge correlates with the network’s structural and temporal characteristics. Keywords: complex networks, network evolution, science of science, bibliographic databases, MEDLINE
1. Introduction Scientists circulate and publish their solutions to answer highly complex and intricate scientific questions. The scientific production of information and novel knowledge has become particularly dynamic and considerably interdisciplinary. 5
The physical and mental barriers that once isolated researchers are becoming more permeable. Consequently, the body of life sciences literature, nowadays known as the bibliome, is of respectable capacity and considerable complexity. The number of published works is currently rapidly expanding. MEDLINE— the premier literature collection in the area of the life sciences—as of May 2018
10
includes over 27 million links to biomedical records, with more than 2000 references added daily. In this paper we analyzed the dynamic properties and growth principles of MEDLINE using network analysis methodology. Complex systems research has inspired numerous scientists since the smallworld [1] and scale-free [2] characteristics were found in various large-scale net-
15
works, including the World Wide Web and social networking services which represent friendship relations between users. We have shown previously [3, 4] that MEDLINE could be characterized as a complex network, containing nodes and edges, where the former refer to knowledge concepts and the latter represent relations among them. The network induced in this manner may be applied
2
20
to explain in detail the anatomy and dynamics of complex systems and assists us in decoding significant wiring patterns (also called topological properties), interesting behavioral characteristics, and predicting future trends. Knowledge in such networks is represented on the basis of co-occurrences between biomedical concepts, such as proteins, disorders, molecular functions, or drugs. A
25
co-occurrence methodology is built on an expectation that terms appearing together in the same sentence are somehow biologically related [5]. Namely, it is the distributional hypothesis which implies that terms that are related in connotation tend to occur in similar linguistic contexts [6]. Co-occurrence has many different nuances (e.g., co-occurrence in a given text window, paragraph
30
co-occurrence, keyword co-occurrence). In addition to pure co-occurrence, we can also include (i) the concept frequency, (ii) how close concepts are to one another (e.g., in a sentence), and (iii) weighted combination of all concepts within a document [5]. Several services have been introduced exploiting co-occurrence methodology for unraveling significant motifs in the life science literature, in-
35
cluding our BITOLA system [7]. However, considering growing co-occurrence frequency between pairs of terms we may identify key patterns, topics, memes, and emerging trends within the given research field. In a realistic scenario nodes and edges of the network are not distributed evenly but exhibit locally dense groups (so called communities), which tend to
40
be integrated within the larger network structure [8]. Such community structures are usually of particular importance. For example, a community may be a set of Web pages related to a common topic [9] or a group of genes exhibiting similar function [10]. Detection of communities is an essential aspect in deciphering the structure of a complex network. We define community as a (sub)group
45
of nodes sharing comparable characteristics and perceived as being distinctive from the rest of the network [11]. Nodes in a particular community connect with each other more tightly than with other members of the network [12]. In other words, a community contains a thickly interlinked group of nodes that are only weakly tied to other segments of the network; we say that an effectively divided
50
community has high modularity [13, 14]. According to Leskovec et al. [15], an 3
acceptable community has limited conductance, i.e., it has numerous internal relations and only few edges interacting with the rest of the network structure. Similarly, Kleinberg [16] has shown that communities contain a core of authoritative nodes which are linked by hub nodes. Research on different community 55
detection algorithms can be found in an excellent survey by Lancichinetti and Fortunato [17]. In addition to being large, complex networks also have a dynamic, temporal nature [18]. Complex networks are developing systems that transform over time either by adding new entities or by forming new edges among existing ones.
60
They could expand (or even shrink) at breakneck velocity in terms of magnitude and space in time. Unfortunately, understanding such network structure is very challenging while interactions are evolving rapidly. The study of the dynamic evolution is a relatively new research field under the umbrella of network analysis. The ways in which communities take shape and evolve over time is a theme
65
that runs through a large part of research. The open question fundamental to modern network analysis is understanding and deciphering the mechanisms underpinning community evolution. Communities could be born or die, grow or shrink, or have their community membership shift [19]. Network communities are not static but are evolving objects in time. The key question is how well the
70
communities allow us to track temporal evolution in the network. The problem of community detection is well known in static networks and can easily be generalized to temporal networks. For our need, we formally define an evolving network as a sequence of static networks where each snapshot represents the state of the network at a particular time. To elucidate community evolution
75
we can simply compute static communities on each timeslot independently and then match the communities between adjacent time points. Science itself is a complex system and has intensively been the subject of complex networks research [20, 21], mainly through the analysis of co-authorship network [22, 23, 24] or single-word analysis [25, 26]. Precise dissection of the ex-
80
pansion of scientific knowledge may help us to construct theoretical frameworks of the collective dynamics of science. In this paper we analyze dynamic proper4
ties and expansion principles of MEDLINE, applying state-of-the-art complex network methodology and tools. The fundamental assumption of this study is that the timeline of the life sciences field can be represented as a list of co85
occurrences of concepts that are associated with each MEDLINE citation (i.e., MeSH terms) and which identify its important topics. We analyzed research themes as developing clusters of MeSH descriptors over time. We examined how the properties of the MeSH-based network can be utilized to assist in understanding the temporal development of scientific thinking in the broad field
90
of biomedicine. To this end, we performed a statistical evaluation based on over 25 million citations in the MEDLINE database, from 1966 until 2014. As far as we know this is the first such extensive work has been conducted on such a large segment of MEDLINE. Motivation and basic analytical tools for this study were already presented at the ASONAM conference [27]. However, this
95
study is of particular importance to scientists who want to learn more about research themes, novel directions and partnership opportunities, to funding organizations who try to follow effects of funding, and also for policy creators who try to understand the outcomes of science policies.
2. Related work 100
Nowadays, studies of science are growing exponentially with research on coauthorship and citation networks [28], knowledge diffusion [29], phylomemetic patterns in science evolution [30], culminating in fascinating maps of science [31] and knowledge [32]. For a recent comprehensive review of the field we refer the reader to Fortunato et al. [21] and Zeng et al. [20]. Science mapping has a long
105
tradition. One of the earliest studies, which uses a co-occurrence approach, was carried out by Bauin in the mid-1980s. The results of this study were reported in a book with the expressive title Mapping the Dynamics of Science and Technology, which traces a basis for the application of a co-occurrence approach in mapping knowledge structures [33]. Since then numerous researchers have used
110
co-occurrence analysis as a framework to analyze knowledge networks in differ-
5
ent areas, for example in biology [34], scientometrics [35], and healthcare [36]. In past decades we have observed a great deal of attention being paid to research of topics, ideas, and scientific memes (i.e., significant ideas that emerge and spread through the research literature). Perc [25] analyzed more than half a 115
million bibliographic records published by the American Physical Society from the end of the 19th century until 2012. After identifying all unique terms and phrases and describing their usage trends, he obtained an understanding of the patterns of physical science research. Results indicate that both the rise and fall of scientific paradigms is driven by the robust regulation of self-organization
120
principles. Kuhn et al. [26] analyzed nearly 50 million titles and abstracts from the Web of Science, PubMed Central, and the American Physical Society, published between 1893 and 2009. Their analysis reveals that the dynamics of memes can be predicted by a straightforward relationship between count of occurrence and the degree to which they propagate through a citation network.
125
He et al. [37] found that scientific memes have a specific evolving process. From an initial meme, researchers’ consideration shifts to associated themes and then shifts back, in a cycle of shifts. Our work has been inspired by papers in which authors tried to analyze dynamic aspects of topic modeling. Existing approaches for topic evolution
130
analysis mainly use a bag-of-words model [38]. The moment of evolution is then analyzed by comparing changes of topics over time. Some recent work tries to use a network analysis approach to analyze longitudinal data [39, 40], but there has been very limited work on combining topic modeling and community detection algorithms [41]. Gruhl et al. [42] discovered that themes in the blogo-
135
sphere developed due to the evolution of topological communities. Similarly, Li et al. [43] demonstrated that actors within the same community are concerned about similar themes across time. Zhou et al. [44] propose an approach that integrates generative probabilistic modeling with community discovery. Their approach can detect communities and supports semantic topic descriptions of
140
the communities examined. Liu et al. [45] developed a Bayesian hierarchical model in which the creation of a relation between pairs of documents is seen as 6
a mixture of topic similarity and community closeness. Li et al. [46] propose the Community Topic Model (CTM) which can reveal communities associated with similar themes and the Dynamic CTM which can identify dynamic attributes 145
of communities and topics. Recently, Nguyen et al. [47] presented an adaptive modularity-based framework for detecting and tracing the evolution of network communities in temporal networks.
3. Methods In this section we first describe data acquisition procedure and then detail 150
the network reduction methodology. Next, we proceed with the measures used in the topological analysis of the constructed networks. After that, we discuss extraction and characterization of communities. The section concludes with the software description. 3.1. Data acquisition
155
MEDLINE is the premier bibliographic collection in the life science disciplines. As of May 2018, it includes more than 27 million citations that begin in the late 19th century. Since the mid-1940s, MEDLINE records have been indexed by qualified annotators from the U.S. National Library of Medicine using the MeSH thesaurus. MeSH is a hierarchically organized thesaurus containing
160
biomedical concepts at distinct levels of complexity. The vocabulary includes three different categories of terms: main headings (descriptors), supplementary concepts, and qualifiers. Descriptors are the core components of the MeSH thesaurus and signify the main subject of a particular bibliographic record. For example, for a citation which reports the results of bioinformatics analysis of
165
the effects of tobacco smoke on gene expression, MeSH descriptors might be Computational Biology, Oligonucleotide Array Sequence Analysis, Smoke, Tobacco, and Transcriptome among others. Qualifiers are attached to descriptors inside the MeSH field to express a special aspect of the biomedical concept. We narrow our exploration to MeSH descriptors only. Each MEDLINE record has
7
170
approximately 12 MeSH descriptors associated with it. In each record, some MeSH descriptors are labeled as major descriptors, which designate the principal subject of the record. The 2015 MeSH vocabulary, that was exploited in this work, consists of 27 455 descriptors. We downloaded the MEDLINE Baseline Repository, up to the end of 2015,
175
that consist of 9 451 243 records annotated with major MeSH descriptors. For each record we extracted the following details: the unique identifier of the MEDLINE record, the record’s publication year, and a series of major MeSH descriptors. Next, we built a single MeSH co-occurrence network from all the MEDLINE records. In this MeSH network, each node represents a particular
180
MeSH descriptor and a link between two descriptors was established if they both occur within the same MEDLINE record at least once. We note that we did not examine the orientation of the relationships (i.e., the relationship between MeSH terms u and v is the same as the relationship between terms v and u), in other words, the edges are undirected. The constructed network was stored
185
in edge list format. A diagrammatic illustration of the assembled network is represented in Figure 1.
Missing figure
Include Figure 1 about here
Figure 1: Illustration of the composed MeSH network. Nodes refer to major MeSH terms. A link between a pair of MeSH terms is established if they appear together in the same MEDLINE record. Co-occurrence frequency is designated as edge width. For example, the pair Computational Biology – Stem Cells appears in many more MEDLINE records then does the pair Depressive Disorder – Bilirubin.
In this setting, the network is represented by the graph G(V, E, T ) that is
8
assembled from the set of nodes V defining major MeSH descriptors, the set of edges E representing relations between the pairs of nodes, and the set T of 190
linearly ordered timeslots. We also suppose that multiple edges and loops are not allowed. For an exhaustive survey of the literature on complex networks, see the excellent review by Newman [48]. 3.2. Network reduction The collected network was post-processed to remove all non-useful and non-
195
informative nodes and edges. Here we introduce a methodology which helps to reduce the MeSH network. First, descriptors that appear highly frequently (e.g., Humans, Animals, Mice) were removed; we build a list of non-useful MeSH descriptors based on MEDLINE check tags [49]. The edge reduction is a little bit complex. For edge elimination we applied the Pearson’s χ2 test for independence
200
for each co-occurrence pair to obtain a statistic, which indicates whether a particular pair of MeSH terms occurs together more often than by chance. If an expected value was less than five, we applied Yates’s correction to continuity. If the statistic was greater than the critical value of 3.84 (p ≤ 0.05), we can be 95% confident that a particular MeSH relation occurs more often than by chance.
205
The detailed methodology for edge reduction is presented elsewhere [50]. 3.3. Topological analysis The size of the network relates to the number of nodes in the network. The average degree refers to the mean number of neighbors of each node, in our case the average number of MeSH terms of each term. The magnitude
210
of the largest connected component represents the extent of the main cluster of the network. The largest connected component is also known as the giant component. Effective diameter is a robust alternative of the standard diameter, which is equivalent to the number of edges needed on average to reach 90% of all other nodes.
215
Besides elementary metrics we also measure clustering and modularity to examine the small-world presence and the community architecture. Originally,
9
the clustering coefficient was a local characteristic to express cliquishness in the network, i.e., to what extent two neighbors of a particular MeSH term also interact mutually [1]. In this paper, we use Kaiser’s [51] definition of mean clustering 220
coefficient. Modularity is a measure that indicates the possible presence of community structure at the global scale [14]. A high level of modularity implies that the topology has pure community anatomy and is composed of several communities with dense connections within each cluster and less-dense links between clusters. In this study, we adopt the effective community-detection approach
225
developed by Blondel et al. [52] to discover communities with optimal level of modularity. 3.4. Tracking evolution of communities The evolution of communities can be captured by recognizing critical events that characterize the changes in network structure over time. To track significant
230
events we convert the whole MeSH network into static subnetworks on a yearly basis, from 1966 until 2014. These subnetworks embed the communities of MeSH terms, for which we observe their evolution through years. The total collection of subnetworks is input data for our further experiments. After detecting communities using the Louvain algorithm [52], we need to
235
discover the relationships between communities in order to track community evolution. To properly capture the evolution of communities, after detecting communities at time-snap level, it is essential to pair the communities between successive time snaps. In this work, we adapted a community evolution algorithm introduced by Konstantinidis et al. [53]. The procedure starts by selecting
240
the initial set of communities C11 , C21 , . . . , Ck1 using the Louvain community detection algorithm applied to the G1 network. We employ the Louvain community detection algorithm as it can efficiently scale to several million nodes. Marker variable Ti is then allocated to each community in the first snapshot. In the next step, the list of communities is detected from topology G2 and a
245
pairing procedure is executed between all the combinations of communities from the two snapshots in sequence. This is performed to detect any potential evolu10
tion from the first snapshot to the second. The dynamic communities T(1,2,...,i) are updated in the following step, based on that evolution. For instance, if Ca1 does not emerge in the second snapshot, we do not update marker Ta . A split 250
is recorded if the community is detected twice in the new timeslot, and a merge is registered if two or more communities have combined into one. If the pairing procedure detects no matching community, the community is indexed as being dead. The evolution discovery algorithm is executed until all networks have been handled.
255
To determine if a pair of consecutive communities match we employed the Jaccard coefficient [54]. We adopt Jaccard coefficient due to its efficiency and popularity in community matching [53]. The similarity measure s between two successive communities Cin and C((i(n−td))) is computed using the equation s(Cin , C(i(n−td)) ) =
|Cin ∩ C(i(n−td)) | , |Cin ∪ C(i(n−td)) |
where td ∈ [1, 3] is a timeslot delay. 260
If the computed similarity exceeds a threshold value (which was set to 0.3 in this study based on a Greene’s et al. [55] suggestion), the pair is matched and Cin is added to the timeline for the dynamic community Ti . 3.5. Software Data preprocessing was done using Bash and Python scripts. The main
265
part of the network evolution analysis was performed in R and Python. The programming scripts to reproduce the results of the analysis and source code for Web application are freely available in the GitHub repository https://github. com/akastrin/medline-evolution. The Web application is also available at the URL http://akastrin.si/med2clu.
270
4. Results In this section we first present results of basic topological analysis. Next, we offer results regarding network evolution and community detection analysis. Finally, we describe our Web application. 11
4.1. Topological analysis 275
We began our analysis with basic exploratory mining of the constructed data set. The 2015 MEDLINE distribution which was utilized in this study contains 20 903 177 bibliographic citations which are tagged with MeSH descriptors. Thereof we have 9 451 243 citations which contain major MeSH terms and have publication date equal or greater to year 1966 and equal or lower to year 2014;
280
only these citations were used for further study. The number of citations rises steeply across years as shown in Figure 2. A similar trend is observed for the number of co-occurrences, except between 1990 and 2000, when co-occurrences stay approximately constant. On average there are 1.74 ± 1.05 major MeSH terms per citation.
Missing figure
Include Figure 2 about here
Figure 2: Number of citations and co-occurrences over time. The blue line represenst the number of bibliographic citations tagged with major MeSH terms and published between 1966 and 2014. The red line depicts the number of co-occurrences among major MeSH terms in the same time period.
285
We first examined the global network, without considering the time component. The network consists of 20 928 nodes and 1 747 963 undirected edges among them.
Mean and maximal node degree are c = 167.05 edges and
kmax = 2774 edges, respectively. The diameter of the network is D = 7 edges. The network exhibit rather short average path length between pairs of nodes; 290
there are around L = 2.62 hops from the picked node to any other node in the network. The mean clustering of the network is C = 0.32. The whole network resembles the small world as a result of small average path length and rather
12
high clustering. The giant component of the network contains about 99% of all nodes. 295
As we said previously in the Methods section, the degree of a node is the number of links incident to a particular node. Degree distribution is the probability distribution of these degrees over the whole network. The degree distribution is a very important concept in studying real networks behavior; it can help to determine if the network follows specific regularities (e.g., power
300
law). Degree distribution for the MeSH network is represented on the left panel (A) in Figure 3. It is evident that the scale-free distribution is not a reasonable prototype for a degree distribution (D = 0.054). However, we obtained a reasonably good fit with Weibull distribution, especially for the tail of the distribution (D = 0.015). The distribution has similar shape in the weighted
305
case. The Weibull or stretched exponential distribution is a generalization of the exponential distribution. The stretched exponential distribution of degrees may be the result of a sublinear attachment growth type [56]. According to recent empirical evidence, Weibull and log-normal distributions are more often better models of degree distribution than is a power law [57]. Stretched exponential
310
distribution in our case may reflect the nature of the mechanism which prevents evolution of large hubs and keeps MeSH terms more specific [58]. The relationship between the clustering coefficient and degree is represented in the right panel (B) in Figure 3. It is evident that the MeSH network exhibits a weak but apparently plausible hierarchical structure. A full hierarchical network
315
will have a clustering coefficient that extents as a power of a degree, while a pure random network will have a clustering coefficient that is fixed with degree. Overall, there is a statistical significant correlation between degree and average clustering coefficient (r = −0.38, p < 0.001). If we take into account that the MeSH vocabulary is indeed organized as a tree structure with 16 main branches
320
(e.g., Anatomy, Organisms, Diseases) this dependence is not surprising. In the next step we also included time components in the analysis. For a detailed analysis we dissected the network into one-year slices, from 1966 to 2014, obtaining 49 timestamps in total. We first present basic descriptive statistics, in13
Missing figure
Include Figure 3 about here
Figure 3: Cumulative degree distribution (A). Weibull distribution produces the best fit to the data. Right panel (B) depicts degree vs. average clustering, demonstrating that clustering has some dependence on degree.
cluding node and edge distribution, average degree, clustering coefficient, giant 325
component size, number of communities, and modularity distribution. Networks displayed complex and intricate topology patterns as can be observed in their corresponding structure parameters in Figure 4.
Missing figure
Include Figure 4 about here
Figure 4: Descriptive statistics of the networks over time. The figure represents detailed a structural analysis of the network over time for the period 1966–2014. On each panel two statistics are represented on the left and right y-axis respectively.
The number of nodes and the number of edges rapidly increase in time. In 1966 the network starts with 5606 nodes and 55 539 edges. In 2014 the network 330
has 11 987 nodes and 154 397 edges. It is important to note that the number of edges grows superlinearly with respect to the number of nodes. Average degree first decreases, reaches a plateau in the 1980 and then increases. Average distances among nodes in term of effective diameter and average path 14
length, shrinks over time. The clustering coefficient has a very different life 335
history. Values first increase, reach a maximum in 1984, and then decrease. The giant component shows an almost linear increasing trend. The number of communities first increases through time, reaches a maximum in 1984, and then decreases. Modularity exhibits very high values across time. We will refer back to modularity later in the paper, when we try to describe quality of the derived
340
communities. Figure 5 presents a heat map of the 10-top most highly linked MeSH descriptors for each time snap. It is obvious that the top-degree MeSH terms exhibit significant time distribution differences. Some MeSH terms only appear in a single year, such as Enzymes, Water, or Protein Binding. Some MeSH terms
345
appear in phases, for instance Models, Biological and Decision Making. Finally, some MeSH terms are hot for a long period of time, for example Research or Public Policy. We can also observe that the top MeSH terms are much more active in recent years, with a higher number of connections; note the lighter color tones in Figure 5.
Missing figure
Include Figure 5 about here
Figure 5: Dynamics of the top-10 MeSH terms with highest degree over the years. The y-axis corresponds to a list of MeSH terms with the highest degree from 1966 to 2014. The x-axis is the corresponding year. Rows are sorted using the hierarchical clustering algorithm. Lighter color tones represent nodes with higher degree.
350
4.2. Network evolution To summarize global network evolution we adopted the approach presented by Choudhury and Uddin [59]. Figure 6 represents the evolution of the MeSH 15
network in relation to the number of nodes and edges. The type of a connection between MeSH terms may be described using three mechanisms: (i) a new edge 355
may be formed between MeSH terms that arrived in each year, (ii) a newly emerged MeSH term may form an edge with an already established MeSH term, and (iii) a new edge can be established between two old MeSH terms. The term ‘old MeSH term’ Kold refers to a set of MeSH terms introduced in the year(s) preceding a particular year. For instance, in the year 2000, the
360
total number of old MeSH terms was 9607. This means that these MeSH terms were introduced within the years 1966–1999. The number of edges of the above mechanisms were denoted as Enew , Ecomm , and Eold . It is evident from Figure 6 that the Eold mechanism governs the dynamics of link evolution, such that new links are established among the old MeSH terms. The Eold type of mechanism
365
is amplified through the years. On the other hand, the number of edges formed within only new MeSH terms (Enew ) and the number of edges between new MeSH terms and old MeSH terms (Ecomm ) are negatively correlated with time. It is evident that the arrival of new MeSH terms does not imply new relations. To the contrary, the majority of new relations results from old MeSH terms. This
370
is a very important finding, while it supports the theory that scientific creativity is based (mainly) on new combinations of existing knowledge instances. 4.2.1. Community detection analysis Finally, we performed community detection analysis on time-sliced network data. We analyzed networks in terms of community structure and their evo-
375
lution. We define community as a subnetwork that represents a set of tightly interconnected MeSH terms. Links between MeSH terms in a subgraph refer to their co-occurrences. For community detection we used a modularity optimization algorithm introduced by Blondel et al. [52] as described previously in the Methods section. The last panel (D) in Figure 4 depicts a diagram for
380
the modularity of each network in the timeline together with the number of communities for each timestamp. Modularity reaches high values, suggesting dense connections among the nodes within different communities and weak con-
16
Missing figure
Include Figure 6 about here
Figure 6: Network evolution processes according to new MeSH terms and various edge formation processes. Knew = number of new MeSH terms in each year, Kold = number of previous MeSH terms entered to the network before a specified year, Enew = number of edges created within the new MeSH terms entered each year, Eold = the number of edges associated with the old MeSH terms arriving in the preceding year(s) of a given year and Ecomm = the number of edges between a new MeSH terms and old MeSH terms from previous year(s). Note that Enew ∪ Eold ∪ Ecomm is equal to the total number of edges E in a particular year.
nections between nodes from different communities. It is also interesting to measure the size of a particular community in time (i.e., the number of MeSH 385
terms that define the community) which may reflect the knowledge transition in time. Median size of communities for each time slice is presented in Figure 7. After a deep precipice which ends in early the 1980s we can observe a gradual increase of community size in time. In parallel we also measured the activity level of each community, which we define as the number of distinct MEDLINE
390
citations that were loaded with community specific MeSH terms. The activity trend nearly perfectly correlates with community size. Overall, we extracted 142 different evolving communities. The number of communities in each time period is also depicted in the last panel in Figure 4. Figure 8 depicts a heat map of evolving communities for each timeslot. Each
395
cell in the heat map represents a particular community in a given time. It is obvious from the diagonal trend line in the plot that the dynamics of community evolution is quite predictive. New communities are constantly born, live for some time and then die. There are also some interesting communities that are present for a longer period of time. We refer to these communities as significant
17
Missing figure
Include Figure 7 about here
Figure 7: Community size and activity levels. Activity level is defined as the number of distinct MEDLINE citations that were loaded with community specific MeSH terms.
400
communities.
Missing figure
Include Figure 8 about here
Figure 8: Heat map of evolving communities for each timeslot. Each cell in a plot represents particular community in a given year.
To measure quality of the extracted communities we usually need some ground truth partition. Constuction of such partition is manual, expensive, and time-consuming work. Instead, we used the idea of internal quality measure, where we require assumption about “good” communities. Many internal 405
quality evaluation metrics for community detection have been proposed in the literature [60]. However, there is no consensus on how they compare to each other. In this study we used modularity (Q) to evaluate cluster quality; please consider the panel (D) in Figure 4. High modularity means that a “good” community should have a bigger than expected number of internal links and
410
a smaller than expected number of inter-community links when compared to
18
random network with similar characteristics [13]. Our results show that average modularity was 0.76 in year 1966 and 0.84 in year 2014. Average modularity across years was 0.85 ± 0.04. In practice, researchers assume that a network possesses modular structure when Q > 0.25 [13]. In our settings, modularity is 415
highly correlated with community size (r = 0.96, p < 0.001); correlations with other network invariants were small and nonsignificant. Results confirm high quality of the extracted communities. As we said previously in the Methods section, MeSH vocabulary is hierarchical structure. The depth of a MeSH descriptor coincide with the number of
420
ancestors it has, so depth increases with traversal away from the root. Figure 9 illustrates average depth of MeSH terms that compose particular community over years. The average trend is increasing, meaning that terms become more specific over years. Consequently, this means that communities also become more topic-specific over years.
Missing figure
Include Figure 9 about here
Figure 9: MeSH terms depth. Blue line refers to average MeSH terms deepth in the hierarchy over years. Gray ribbon refers to corresponding standard deviation.
425
For illustration purposes we extracted 10 different significant communities with span intervals from 12 to 25 years. These are the communities with the longest horizontal interval in Figure 8. Due to space limitations we present only the first six communities in Figure 10 in the form of word clouds. For each dynamic community we go through timestamps and count MeSH terms that
430
appear in a particular year. The size of a word in a word cloud is proportional to the total number of particular MeSH terms in each dynamic community. The 19
first community (A) clearly refers to genetics, in particular to cytogenetics. The second community collects terms around visual perception, while the community presented in (C) is clearly loaded with terms from statistical theory and 435
applications. The fourth community is very specific; it collects a lot of terms connected with psychoanalysis. The fifth community refers to dental medicine and the last presented community is relevant to economics and demography. It is evident that all communities are coherent and very topic specific.
Missing figure
Include Figure 10 about here
Figure 10: Word clouds for significant communities. For description please see text.
4.3. Web-based application to track communities 440
4.3.1. Application overview The Web application was implemented in the Python programming language using the Flask microframework. The complete programming code is available from the GitHub repository (please see Section 3.5 for details). The code can also be run locally on a personal computer. The application’s dashboard is
445
composed of three panels (Figure 11): a strategic diagram, a wordcloud, and a MeSH report table. The strategic diagram represents a global picture of communities for a selected year. Besides year the user may also select minimal size of community. The user can zoom in or out by selecting a rectangular region in the strategic diagram. The strategic diagram is embedded in a two-
450
dimensional space defined by density and centrality. When a user selects a particular community in a strategic diagram, the word cloud on the top-right panel is created. The word cloud is composed of MeSH terms that define the 20
selected community. Size of the MeSH terms is proportional to the exponential value of the term’s weight (please see Section 4.3.2 for details). At the same time 455
we render the MeSH table in the bottom-left panel of the dashboard. The table is composed of MeSH terms that define the selected community and present information about the official DOI number (with the link to the external MeSH Browser), name and weight.
Missing figure
Include Figure 11 about here
Figure 11: Screenshot of the Web application
4.3.2. Implementation details In the next paragraphs we introduce a methodology necessary to understand
460
and use our Web application. Web application relies on two measures, namely density and centrality to map the field of biomedical research. Each research theme is characterized by both measures. Callon’s density or internal cohesion is a measure of the strength of the edges 465
that link together the cluster of MeSH terms [61]. It reflects the inner stamina of a group of nodes and its ability to self-organize and to evolve in time. The higher the density, the more coherent the cluster is. Density is formally defined as P d = 100 ×
eij w
,
where i and j are MeSH terms belonging to the theme and e is the normalized 470
co-occurrence frequency of both terms. w is the number of MeSH terms in the
21
cluster. The normalized co-occurrence frequency is calculated as eij =
c2ij , ci cj
where cij is the number of MEDLINE citations in which two MeSH terms i and j co-occur and ci and cj represent the number of citations in which each one appears. Normalized frequency equals unity when the pair of MeSH terms 475
always appears together and zero when they are never associated. Callon’s centrality expresses the extent of connection of a community of nodes with other areas of the network [61]. More formally, it is the strength of external links of a particular community to other communities. It can be also understood as a measure of the significance of a research community in the
480
evolution of the entire scientific area. The greater the strength of a community’s connection with other research communities, the more central this community will be to the whole network. Centrality is formally defined as X c = 10 × ekh , where k is a MeSH term belonging to the theme, h a MeSH term belonging to other themes, and e is the normalized co-occurrence frequency of both terms.
485
A strategic diagram is a visual representation of the structural organization of the particular research area. A strategic diagram is constructed by drawing centrality and density into a two-dimensional plot, where the horizontal axis refers to centrality and the vertical axis designates density. The origin of the graph is at the mean of the respective axis values. We could determine four
490
types of research communities according to the quadrant in which the particular community appears: 1. Communities in quadrant I exhibit high centrality and high density. Such motor communities are both well developed and essential for a particular research area. They are connected externally to scientific concepts
495
pertinent to other communities that are conceptually tightly connected. 2. Communities in quadrant II have low centrality and high density. Such communities have well evolved internal links but irrelevant external con22
nections and so are of only minor significance for the particular scientific field. These communities are very specific but peripheral. 500
3. Communities in quadrant III exhibit low centrality and low density. They are both feebly developed and peripheral. The communities in this quadrant have low density and low centrality, primarily referring to either emerging or disappearing scientific themes. 4. Communities in quadrant IV show high centrality and low density. These
505
communities are significant for a particular scientific field but are not well evolved. So, this quadrant combines transversal and general, basic research themes. The importance of a MeSH term i is characterized by its standardized withincommunity degree z which is formally defined as z=
510
kis − k¯s , SD ks
where kis is the number of edges between MeSH term i and other MeSH terms in community s, k¯s is the average of within-community k of all MeSH terms in s and SD ks is the corresponding standard deviation.
5. Discussion Human knowledge has been largely documented in detail by a huge amount 515
of literature produced during several hundred years. Understanding how topics in the scientific literature evolve is an important and interesting research problem. In this paper we present a methodology for unveiling the historical evolution of the MEDLINE bibliographic database and report results of an extensive evaluation of the proposed approach. This study provides a statistical
520
map and dynamics of the entire area of the life sciences from the point of view of co-occurring MeSH descriptors, community detection, and adapted co-word analysis. Evidence of this work demonstrates that (i) the MeSH descriptors approach is suitable for tracking the historical development of the life sciences and 23
(ii) the evolution of MEDLINE knowledge clearly correlates with the network’s 525
structural and temporal characteristics. In comparison to the body of existing research, this paper is novel in two features: (i) we incorporate the longitudinal framework based on dynamic communities into the experimental design and (ii) we provide a Web interface for visualizing the evolution of research communities. What can we conclude about the state of life-science research based on evo-
530
lution study? The biomedical domain evolves very briskly, as can be seen from the exponential deluge of the number of publications in MEDLINE. Second, the cognitive extent is evolving nearly linearly, as is demonstrated by the linear trend of distinctive MeSH terms over decades and by branched wiring. The distinction between exponential growth of publications and linear expansion
535
of ideas was also stressed recently by Fortunato et al. [21] for a Web of Science database. The life sciences combine many diverse directions and research fields under their umbrella. This is evident from the high modularity across the years. High modularity indicates that MEDLINE knowledge is driven by some non-trivial self-organizing process. Moreover, this knowledge is also tightly in-
540
tegrated, as demonstrated by the relatively high clustering coefficients. We also observe that a particular research field needs a long time to establish its roots, including its base of knowledge and vocabulary in particular. We may speculate that a particular research field should establish a strong knowledge background before this is reflected in community structure. Networks also become more
545
connected over time, with the average degree increasing. This is in line with Leskovec’s notion of densification [62]. At the same time the effective diameter as well as average path length are decreasing as the network grows. This is in contrast to the traditional opinion that such distance should increase slowly as a function of the number of nodes [62].
550
Modern biomedical research stands on the shoulders of countless giants. In science, many research topics evolve from existing ones. We say that science is a cumulative process. Life science history contains a number of epochs, each dominated by certain themes. Theories and models that are reasonable when formulated are properly discarded when new experimental data and evidences 24
555
make them untenable. There is no evidence for rejecting this assumption in our analysis. Using a very simple approach in which we draw a heat map of evolving communities for each timestamp, we observe that a community builds its existence on the basis of previous communities. Over the 50 years covered by the study, more and more co-occurrence links are generated among MeSH terms,
560
indicating that the knowledge grew steadily with a practically linear growth feature. However, nascent or in-coming themes are difficult if not impossible to detect using this approach. Examining the dynamics of a MEDLINE network we found that new relations are established more frequently among existing MeSH descriptors than
565
among newly established MeSH terms. This is in line with the famous theory of structural holes [63], which states that structural holes are disconnected parts of a network that exist between strongly connected components of nodes. According to Chen’s et al. [64] interpretation of the theory of structural holes, a novel scientific discovery is formed when a connection is created between two
570
or more previously discordant pieces of scientific knowledge. Researchers therefore actively look for structural holes in existing knowledge as a way of finding new discoveries. This finding supports the idea that forthcoming discoveries are somehow constrained by the current topology of the knowledge network. For example, Uzzi et al. [65], based on an analysis of nearly 18 million scientific
575
papers, revealed that science is driven by (mainly conventional) combinations of prior work. Moreover, the assessment of grant applications demonstrated that truly novel ideas are consistently given lower scores [21]. This finding diverges from mainstream theory on network evolution models which states that the growth of the network is explained by a preferential attachment model. This
580
model explains the growth process of a (real) network by adding one connection at a time to the node of highest degree. However, we suggest an alternative mechanism that forms a link between disparate sets of scientific concepts. This way we disregard a basic assumption of preferential attachment, which assumes that links added to the network are somehow aware of the degree status of every
585
existing node [64]. The evolution of large hubs, which is typical for networks 25
described by preferential attachment, is thus prevented by what is also evident from the degree distribution in Figure 3. Indeed, Broido and Clauset [57], contrary to popular belief, recently found that scale-free networks (which result from the preferential attachment process) are rare in nature. 590
Previous work in the area of network evolution has focused mainly on the exploration of the dynamics of co-authorship networks and on citation relationships. A large body of this work is also oriented towards qualitative analysis. To the best of our knowledge, we are the first to apply analysis on such a large part of MEDLINE and infer its structural and dynamic properties. We identify
595
only one similar study. Siqueiros-Garc´ıa et al. [66] performed an analysis of MeSH terms of about 50 000 papers from MEDLINE but limited to genomics only. Although we use partially similar methods and techniques, our results are not directly comparable due to different scope of the analysis. On the other hand, state-of-the-art research on topic evolution has mainly leveraged latent
600
Dirichlet allocation (LDA) to extract topics from a collection of documents and predict future topics. Leydesdorff and Nerghes [67] argue that LDA models are currently more popular than models based on semantics or co-word analysis. Although LDA can effectively process massive amounts of data it still struggles with the problem of output validation.
605
Several improvements could be thought of and we address some methodological limitations of this study. First of all, we observed that some MeSH terms are too specific (e.g., Brain Neoplasms, Melanoma) and thus emerge as outliers in the mapping process. It may be promising to include only the very first layers of the MeSH hierarchy in the analysis. Our further preliminary work shows
610
that the first two or three MeSH layers will suffice. This may result in a better and more complete map of community evolution. Secondly, Kostoff et al. [68] reported that a significant amount of information in abstracts is omitted from the MeSH terms. In addition, Westergaard et al. [69] recently demonstrated that text mining of full-text papers consistently outperforms using MEDLINE
615
abstracts only. This suggest that a substantial part of relevant information may be missed with MeSH terms. Our study is based exclusively on the MeSH 26
terms so it is premature to talk about the entire life sciences vocabulary. It is therefore worthwhile to repeat our work using the text words in titles, abstracts and even full texts of biomedical papers. Moreover, a decade ago, statistical 620
topic modeling was introduced as an approach to outline the contents of large document corpora. The methodology is in a mature stage nowadays and it is worthwhile to include it in our further work. We may speculate that using this approach we could get more fine-grained analysis of the structure of the community evolution.
625
There are many indications for further work. First, we must focus on the problem of selecting (filtering) co-occurrence patterns more rigorously. Some relationships using the current settings are superfluous (e.g., co-occurrences among MeSH descriptors that are too specific) or are not meaningful from the semantic point of view. As described elsewhere [4] we will approach this prob-
630
lem by using UMLS resources. Using the UMLS Metathesaurus each particular biomedical concept could be mapped to its semantic type. The UMLS Semantic network could then be employed to provide a set of allowable links among these concepts. We assume that such an approach will significantly enhance the reliability of our analysis. Second, we mentioned previously in the Results section
635
that the manual unified description of particular cluster content is a tedious, time-consuming, and expensive task. As such, manual cluster characterization represents the main limitation of the present methodology. To navigate the wasted number of MeSH terms in each cluster, the optimal choice will be automatic text summarization (ATS). ATS is a promising methodology for helping
640
researchers seeking information to obtain the ‘gist’ in a given topic by producing a textual summary from documents with minimal redundancy [70]. Our intention is also to incorporate temporal frequent pattern mining into the workflow. Using this approach it would be possible to observe the evolution of MeSH terms that are involved in shifting research topics. The presented approach to commu-
645
nity detection relies only on pure co-occurrences between MeSH terms. Using recent approaches in biomedical knowledge discovery research it would be possible to take a semantic view into account. For example, using the SemRep [71] 27
system, the relations among biomedical concepts could be characterized as more semantically expressive and with greater precision. Last but not least, our ap650
proach is descriptive and not (yet) predictive. The data represented provides an excellent dataset for further analysis and the forecast of scientific trends.
6. Conclusions In this paper we examined the topic evolution problem for scaled scientific literature. We conduct an extensive evolution analysis in order to understand 655
the development of the MEDLINE bibliographic database from 1966 until 2014. In our experiment on the MEDLINE network, we showed how communities could be exploited to examine the evolution of the network. This work highlights the current status, main research points of attractions, and trends in the field of biomedicine. The major result of this study shows that the appearance of new
660
MeSH terms does not imply new connections. The majority of new edges results from old MeSH descriptors. We suggest a wiring mechanism based on the theory of structural holes according to which a novel scientific discovery is established when an edge is built among two or more previously disjunct areas of scientific knowledge. We showed that the historical evolution of biomedical knowledge
665
correlates with the network’s structural and temporal features. Further work is needed in order to provide automatic summarization of the content of the derived communities and also to include more sophisticated statistical methods (e.g., frequent pattern mining) to discover possible universal and domain-specific patterns in the network.
670
Acknowledgement We would like to thank Thomas C. Rindflesch for improving the quality of the paper. Without Tom’s help the manuscript would never have been finished. This work was conducted while the first author was supported by a postdoctoral grant from Slovenian Research Agency (grant no. Z5-9352).
28
675
References [1] D. J. Watts, S. H. Strogatz, Collective dynamics of ‘small-world’ networks, Nature 393 (6684) (1998) 440–442. doi:10.1038/30918. [2] A.-L. Barab´ asi, R. Albert, Emergence of scaling in random networks, Science 286 (5439) (1999) 509–512. doi:10.1126/science.286.5439.509.
680
[3] A. Kastrin, T. C. Rindflesch, D. Hristovski, Large-scale structure of a network of co-occurring MeSH terms: Statistical analysis of macroscopic properties, PloS ONE 9 (7) (2014) e102188. doi:10.1371/journal.pone. 0102188. [4] A. Kastrin, T. C. Rindflesch, D. Hristovski, Link prediction on a network
685
of co-occurring MeSH terms: Towards literature-based discovery, Methods of Information in Medicine 55 (4) (2016) 340–346. doi:10.3414/ME15-010108. [5] B. T. Alako, A. Veldhoven, S. van Baal, R. Jelier, S. Verhoeven, T. Rullmann, J. Polman, G. Jenster, Copub mapper: Mining MEDLINE based
690
on search term co-publication, BMC Bioinformatics 6 (1) (2005) 51. doi: 10.1186/1471-2105-6-51. [6] Z. S. Harris, Distributional structure, Word 10 (2-3) (1954) 146–162. doi: 10.1080/00437956.1954.11659520. [7] D. Hristovski, B. Peterlin, J. A. Mitchell, S. M. Humphrey, Using literature-
695
based discovery to identify disease candidate genes, International Journal of Medical Informatics 74 (2) (2005) 289–298. doi:10.1016/j.ijmedinf. 2004.04.024. [8] M. E. Newman, J. Park, Why social networks are different from other types of networks, Physical Review E 68 (3) (2003) 036122. doi:10.1103/
700
PhysRevE.68.036122.
29
[9] J. Kleinberg, S. Lawrence, The structure of the Web, Science 294 (5548) (2001) 1849–1850. doi:10.1126/science.1067014. [10] D. M. Wilkinson, B. A. Huberman, A method for finding communities of related genes, Proceedings of the National Academy of Sciences of the 705
United States of America 101 (suppl 1) (2004) 5241–5248. doi:10.1073/ pnas.0307740100. [11] S. Fortunato, Community detection in graphs, Physics Reports 486 (3) (2010) 75–174. doi:10.1016/j.physrep.2009.11.002. [12] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, D. Parisi, Defining and
710
identifying communities in networks, Proceedings of the National Academy of Sciences of the United States of America 101 (9) (2004) 2658–2663. doi:10.1073/pnas.0400054101. [13] M. E. Newman, M. Girvan, Finding and evaluating community structure in networks, Physical Review E 69 (2) (2004) 026113.
715
doi:10.1103/
PhysRevE.69.026113. [14] M. E. Newman, Modularity and community structure in networks, Proceedings of the National Academy of Sciences of the United States of America 103 (23) (2006) 8577–8582. doi:10.1073/pnas.0601602103. [15] J. Leskovec, K. J. Lang, A. Dasgupta, M. W. Mahoney, Community
720
structure in large networks: Natural cluster sizes and the absence of large well-defined clusters, Internet Mathematics 6 (1) (2009) 29–123. doi:10.1080/15427951.2009.10129177. [16] J. M. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM (JACM) 46 (5) (1999) 604–632.
725
[17] A. Lancichinetti, S. Fortunato, Community detection algorithms: A comparative analysis, Physical Review E 80 (5) (2009) 056117. doi:10.1103/ PhysRevE.80.056117.
30
[18] P. Holme, J. Saram¨ aki, Temporal networks, Physics Reports 519 (3) (2012) 97–125. doi:10.1016/j.physrep.2012.03.001. 730
[19] G. Palla, A.-L. Barab´ asi, T. Vicsek, Quantifying social group evolution, Nature 446 (7136) (2007) 664–667. doi:10.1038/nature05670. [20] A. Zeng, Z. Shen, J. Zhou, J. Wu, Y. Fan, Y. Wang, H. E. Stanley, The science of science: From the perspective of complex systems, Physics Reports 714-715 (2017) 1–73. doi:10.1016/j.physrep.2017.10.001.
735
[21] S. Fortunato, C. T. Bergstrom, K. B¨orner, J. A. Evans, D. Helbing, S. Milojevi´c, A. M. Petersen, F. Radicchi, R. Sinatra, B. Uzzi, A. Vespignani, L. Waltman, D. Wang, A.-L. Barab´asi, Science of science, Science 359 (6379) (2018) eaao0185. doi:10.1126/science.aao0185. [22] M. E. Newman, The structure of scientific collaboration networks, Proceed-
740
ings of the National Academy of Sciences of the United States of America 98 (2) (2001) 404–409. doi:10.1073/pnas.98.2.404. [23] M. E. Newman, Coauthorship networks and patterns of scientific collaboration, Proceedings of the National Academy of Sciences of the United States of America 101 (suppl 1) (2004) 5200–5205. doi:10.1073/pnas.
745
0307545100. [24] X. Sun, J. Kaur, S. Milojevi´c, A. Flammini, F. Menczer, Social dynamics of science, Scientific Reports 3. doi:10.1038/srep01069. [25] M. Perc, Self-organization of progress across the century of physics, Scientific Reports 3. doi:10.1038/srep01720.
750
[26] T. Kuhn, M. Perc, D. Helbing, Inheritance patterns in citation networks reveal scientific memes, Physical Review X 4 (4) (2014) 041036. doi: 10.1103/PhysRevX.4.041036. [27] A. Kastrin, T. C. Rindflesch, D. Hristovski, Evolution of MEDLINE bibliographic database: Preliminary results, in: 2016 IEEE/ACM Interna31
755
tional Conference on Advances in Social Networks Analysis and Mining (ASONAM), IEEE, 2016, pp. 644–646. doi:10.1109/ASONAM.2016. 7752305. [28] T. Martin, B. Ball, B. Karrer, M. Newman, Coauthorship and citation patterns in the Physical Review, Physical Review E 88 (1) (2013) 012814.
760
doi:10.1103/PhysRevE.88.012814. [29] A. Mazloumian, D. Helbing, S. Lozano, R. P. Light, K. B¨orner, Global multi-level analysis of the ‘Scientific Food Web’, Scientific Reports 3. doi: 10.1038/srep01167. [30] D. Chavalarias, J.-P. Cointet, Phylomemetic patterns in science evolu-
765
tion—the rise and fall of scientific fields, PloS ONE 8 (2) (2013) e54847. doi:10.1371/journal.pone.0054847. [31] K. B¨ orner, Atlas of Science: Visualizing What We Know, MIT Press, 2010. [32] K. B¨ orner, Atlas of Knowledge: Anyone Can Map, MIT Press, 2015. [33] M. Callon, A. Rip, J. Law, Mapping the Dynamics of Science and technol-
770
ogy: Sociology of Science in the Real World, Springer, 1986. [34] X. Y. An, Q. Q. Wu, Co-word analysis of the trends in stem cells field based on subject heading weighting, Scientometrics 88 (1) (2011) 133–144. doi:10.1007/s11192-011-0374-1. [35] S. Ravikumar, A. Agrahari, S. Singh, Mapping the intellectual structure
775
of scientometrics: A co-word analysis of the journal Scientometrics (2005– 2010), Scientometrics 102 (1) (2015) 929–955. doi:10.1007/s11192-0141402-8. [36] Y. Hong, Q. Yao, Y. Yang, J.-j. Feng, W.-x. Ji, L. Yao, Z.-y. Liu, et al., Knowledge structure and theme trends analysis on general practitioner re-
780
search: A co-word perspective, BMC Family Practice 17 (1) (2016) 10. doi:10.1186/s12875-016-0403-5. 32
[37] D. He, X. Zhu, D. S. Parker, How does research evolve? Pattern mining for research meme cycles, in: 2011 IEEE 11th International Conference on Data Mining (ICDM), IEEE, 2011, pp. 1068–1073. 785
[38] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet allocation, Journal of Machine Learning research 3 (Jan) (2003) 993–1022. [39] G. Kossinets, D. J. Watts, Empirical analysis of an evolving social network, Science 311 (5757) (2006) 88–90. doi:10.1126/science.1116869. [40] O. Mryglod, Y. Holovatch, R. Kenna, B. Berche, Quantifying the evolution
790
of a scientific topic: Reaction of the academic community to the Chornobyl disaster, Scientometrics 106 (3) (2016) 1151–1166. doi:10.1007/s11192015-1820-2. [41] Y. Ding, Community detection: Topological vs. topical, Journal of Informetrics 5 (4) (2011) 498–514. doi:10.1016/j.joi.2011.02.006.
795
[42] D. Gruhl, R. Guha, D. Liben-Nowell, A. Tomkins, Information diffusion through blogspace, in: Proceedings of the 13th international conference on World Wide Web, ACM, 2004, pp. 491–501. [43] D. Li, B. He, Y. Ding, J. Tang, C. Sugimoto, Z. Qin, E. Yan, J. Li, T. Dong, Community-based topic modeling for social tagging, in: Proceedings of the
800
19th ACM international conference on Information and knowledge management, ACM, 2010, pp. 1565–1568. [44] D. Zhou, E. Manavoglu, J. Li, C. L. Giles, H. Zha, Probabilistic models for discovering e-communities, in: Proceedings of the 15th international conference on World Wide Web, ACM, 2006, pp. 173–182.
805
[45] Y. Liu, A. Niculescu-Mizil, W. Gryc, Topic-link LDA: Joint models of topic and author community, in: Proceedings of the 26th annual international conference on machine learning, ACM, 2009, pp. 665–672.
33
[46] D. Li, Y. Ding, X. Shuai, J. Bollen, J. Tang, S. Chen, J. Zhu, G. Rocha, Adding community and dynamic to topic models, Journal of Informetrics 810
6 (2) (2012) 237–253. doi:10.1016/j.joi.2011.11.004. [47] N. P. Nguyen, T. N. Dinh, Y. Shen, M. T. Thai, Dynamic social community detection and its applications, PloS ONE 9 (4) (2014) e91431. doi:10. 1371/journal.pone.0091431. [48] M. E. Newman, The structure and function of complex networks, SIAM
815
Review 45 (2) (2003) 167–256. doi:10.1137/S003614450342480. [49] H. J. Lowe, G. O. Barnett, Understanding and using the medical subject headings (mesh) vocabulary to perform literature searches, Jama 271 (14) (1994) 1103–1108. [50] A. Kastrin, B. Peterlin, D. Hristovski, Chi-square-based scoring function
820
for categorization of medline citations, Methods of information in medicine 49 (04) (2010) 371–378. [51] M. Kaiser, Mean clustering coefficients: The role of isolated nodes and leafs on clustering measures for small-world networks, New Journal of Physics 10 (8) (2008) 083042. doi:10.1088/1367-2630/10/8/083042.
825
[52] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefebvre, Fast unfolding of communities in large networks, Journal of Statistical Mechanics: Theory and Experiment 2008 (10) (2008) P10008. doi:10.1088/1742-5468/2008/ 10/P10008. [53] K. Konstantinidis, S. Papadopoulos, Y. Kompatsiaris, Exploring Twitter
830
communication dynamics with evolving community analysis, PeerJ Computer Science 3 (2017) e107. doi:10.7717/peerj-cs.107. [54] P. Jaccard, The distribution of the flora in the alpine zone, New Phytologist 11 (2) (1912) 37–50. doi:10.1111/j.1469-8137.1912.tb05611.x.
34
[55] D. Greene, D. Doyle, P. Cunningham, Tracking the evolution of communi835
ties in dynamic social networks, in: Proceedings of the 2010 international conference on Advances in social networks analysis and mining (ASONAM), IEEE, 2010, pp. 176–183. [56] M. Newman, A.-L. Barabasi, D. J. Watts, The structure and dynamics of networks, Princeton University Press, Princeton, NJ, 2011.
840
[57] A. D. Broido, A. Clauset, Scale-free networks are rare (2018). arXiv: 1801.03400. [58] M. Herrera, D. C. Roberts, N. Gulbahce, Mapping the evolution of scientific fields, PloS ONE 5 (5) (2010) e10355. doi:10.1371/journal.pone. 0010355.
845
[59] N. Choudhury, S. Uddin, Time-aware link prediction to explore network effects on temporal knowledge evolution, Scientometrics 108 (2) (2016) 745– 776. doi:10.1007/s11192-016-2003-5. [60] S. Emmons, S. Kobourov, M. Gallant, K. B¨orner, Analysis of network clustering algorithms and cluster quality metrics at scale, PloS ONE 11 (7)
850
(2016) e0159161. [61] M. Callon, J.-P. Courtial, F. Laville, Co-word analysis as a tool for describing the network of interactions between basic and technological research: The case of polymer chemsitry, Scientometrics 22 (1) (1991) 155– 205. doi:10.1007/BF02019280.
855
[62] J. Leskovec, J. Kleinberg, C. Faloutsos, Graphs over time, in: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining - KDD ’05, ACM Press, New York, New York, USA, 2005, pp. 177–187. doi:10.1145/1081870.1081893. [63] R. S. Burt, Structural holes and good ideas, American Journal of Sociology
860
110 (2) (2004) 349–399. doi:10.1086/421787.
35
[64] C. Chen, Y. Chen, M. Horowitz, H. Hou, Z. Liu, D. Pellegrino, Towards an explanatory and computational theory of scientific discovery, Journal of Informetrics 3 (3) (2009) 191–209. doi:10.1016/j.joi.2009.03.004. [65] B. Uzzi, S. Mukherjee, M. Stringer, B. Jones, Atypical combinations 865
and scientific impact, Science 342 (6157) (2013) 468–472. doi:10.1126/ science.1240474. [66] J. M. Siqueiros-Garc´ıa,
E. Hern´andez-Lemus,
R. Garc´ıa-Herrera,
A. Robina-Galatas, Mapping the structure and dynamics of genomicsrelated MeSH terms complex networks, PLoS ONE 9 (4) (2014) e92639. 870
doi:10.1371/journal.pone.0092639. [67] L. Leydesdorff, A. Nerghes, Co-word maps and topic modeling: A comparison using small and medium-sized corpora (N < 1,000), Journal of the Association for Information Science and Technology 68 (4) (2017) 1024– 1035. doi:10.1002/asi.23740.
875
[68] R. N. Kostoff, J. A. Block, J. A. Stump, K. M. Pfeil, Information content in Medline record fields, International Journal of Medical Informatics 73 (6) (2004) 515–527. doi:10.1016/j.ijmedinf.2004.02.008. [69] D. Westergaard, H.-H. Stærfeldt, C. Tønsberg, L. J. Jensen, S. Brunak, A comprehensive and quantitative comparison of text-mining in 15 million
880
full-text articles versus their corresponding abstracts, PLoS Computational Biology 14 (2) (2018) e1005962. doi:10.1371/journal.pcbi.1005962. [70] R. Mishra, J. Bian, M. Fiszman, C. R. Weir, S. Jonnalagadda, J. Mostafa, G. Del Fiol, Text summarization in the biomedical domain: A systematic review of recent research, Journal of Biomedical Informatics 52 (2014) 457–
885
467. doi:10.1016/j.jbi.2014.06.009. [71] T. C. Rindflesch, M. Fiszman, The interaction of domain knowledge and linguistic structure in natural language processing: Interpreting hypernymic
36
propositions in biomedical text, Journal of Biomedical Informatics 36 (6) (2003) 462–477. doi:10.1016/j.jbi.2003.11.003.
37
*Graphical Abstract
*Highlights (for review)
Scientific knowledge constitutes a dynamic complex system We analyzed the dynamic properties and growth principles of the MEDLINE Novel discovery is usually established among disconnected parts of knowledge Evolution of MEDLINE correlates with the structural and temporal characteristics