Discovering discoveries: Identifying biomedical discoveries using citation contexts

Journal of Informetrics 11 (2017) 46–62 Contents lists available at ScienceDirect Journal of Informetrics journal homepage: www.elsevier.com/locate/...

Download PDF

627KB Sizes 0 Downloads 82 Views

Report

PDF Reader
Full Text

Journal of Informetrics 11 (2017) 46–62

Contents lists available at ScienceDirect

Journal of Informetrics journal homepage: www.elsevier.com/locate/joi

Regular article

Discovering discoveries: Identifying biomedical discoveries using citation contexts Henry Small a,∗ , Hung Tseng b,1 , Mike Patek c a

SciTech Strategies, Inc., 105 Rolling Road, Bala Cynwyd, PA 19004, USA National Institute of Arthritis and Musculoskeletal and Skin Diseases, National Institutes of Health, 6701 Democracy Boulevard, Bethesda, MD 20892, USA c SciTech Strategies, Inc., 58 Russell Street, Keene, NH 03431, USA b

a r t i c l e

i n f o

Article history: Received 25 August 2016 Received in revised form 12 October 2016 Accepted 6 November 2016 Keywords: Discovery Biomedicine Citation contexts Citances Machine learning Pubmed central

a b s t r a c t A procedure for identifying discoveries in the biomedical sciences is described that makes use of citation context information, or more precisely citing sentences, drawn from the PubMed Central database. The procedure focuses on use of speciﬁc terms in the citing sentences and the joint appearance of cited references. After a manual screening process to remove non-discoveries, a list of over 100 discoveries and their associated articles is compiled and characterized by subject matter and by type of discovery. The phenomenon of multiple discovery is shown to play an important role. The onset and timing of recognition of the articles are studied by comparing the number of citing sentences with and without discovery terms, and show both early onset and delays in recognition. A comparative analysis of the vocabularies of the discovery and non-discovery sentences reveals the types of words and concepts that scientists associate with discoveries. A machine learning application is used to efﬁciently extend the list. Implications of the ﬁndings for understanding the nature and justiﬁcation of scientiﬁc discoveries are discussed. © 2016 Elsevier Ltd. All rights reserved.

1. Introduction Discoveries are the sine qua non of science. They are how scientists learn new things about the world, make sense of reality, and advance the boundaries of the known into the realm of the unknown. They are also what scientists strive to make, what advances their careers and status, and what they ﬁght over for priority (Merton, 1957). Discoveries can be solutions to known problems or solutions to problems that only later become manifest. But not every discovery pans out. Like polywater, cold fusion, or N-rays, some fail to gain acceptance and fall by the wayside. However, some are so compelling that they command immediate assent and even astonishment, like Archimedes jumping out of the bath tub exclaiming “Eureka!” (Koestler, 1964, p. 106) or Jim Watson saying that the double helix model of DNA is too beautiful not to be true (1968, p. 205). But what are the hallmarks of scientiﬁc discovery and how do we know when a discovery is made? Philosophers of science have drawn the distinction between the context of discovery and the context of justiﬁcation (Losee, 1972, p. 115; Reichenbach, 1949). The context of discovery, or nascent moment as Holton (1973, p. 17) calls it, can be governed by chance, erroneous information, and even dreams. The context of justiﬁcation, on the other hand, is where

∗ Corresponding author. E-mail addresses: [email protected] (H. Small), [email protected] (H. Tseng), [email protected] (M. Patek). 1 Hung Tseng’s views expressed here are personal and do not represent those of the NIAMS/NIH. http://dx.doi.org/10.1016/j.joi.2016.11.001 1751-1577/© 2016 Elsevier Ltd. All rights reserved.

H. Small et al. / Journal of Informetrics 11 (2017) 46–62

47

cooler heads must evaluate cold facts. Popper (1959, p.31) argued that philosophy can only deal with the latter period where hypotheses are put to the most stringent tests. The crucial point is that discovery is more than just an insight, inspiration, or lucky guess. It must also pass some initial threshold of justiﬁcation and survive a process of ongoing challenges. This second stage may involve corroboration, conﬁrmation by others, and demonstrating consistency with existing experiment and theory (Stent, 1972). In this paper we will be primarily concerned with the context of justiﬁcation, not the “aha” moment of initial inspiration, and with the process by which the scientiﬁc community comes to label a ﬁnding as a “discovery”, although in the end we will call the separation of these contexts into question. Kuhn (1962, p. 52) said that discovery is not possible without a paradigm which sets our expectations. When an expectation is violated, a problem is born which we can then attempt to solve. The solution to the anomaly may require a revolution or revamping of our understanding, which then opens up new questions. Problems that arise within the context of a paradigm are called puzzles. When DNA became recognized as critical to inheritance (Dubos, 1976), the natural question arose “What is the molecular structure of DNA and how does it enable inheritance?” When questions crystallize within a community, a competition among scientists can ensue to ﬁnd a solution. Of course, the recognition of an unsolved problem or open question requires a deep understanding of the current state of knowledge. Scientists may even lack the framework to ask a question such as “How does gravity affect time?”, a question which would be unlikely to come up without relativity theory. Thus, earlier discoveries can set the stage for later problems and discoveries. As Olby (1974, p. 426) said about Watson and Crick’s double helix structure of DNA, it was not just that it ﬁt with the known facts about DNA but that it opened up new questions and set the framework for future work. This has been variously described as the fruitfulness of a theory (Kuhn, 1977, p. 322). Others have attempted to model the discovery process in computer programs, conceiving all problems as puzzles whose solutions could be found by some kind of heuristic search (Langley, Simon, Bradshaw, & Zytkow, 1987). Another research tradition sees discovery as the ﬁnding of novel combinations. In 1964 Arthur Koestler introduced his ideas on “bisociation” − the joining of two frames of reference to arrive at a novel synthesis. Swanson’s work (1986) is in a similar vein, involving the connecting of previously unconnected areas of biomedical knowledge − or more accurately, indirectly connected areas − to gain new knowledge. More recently Foster, Rzhetsky and Evans (2015) explore scientists’ problem choices and show that risky choices that pay off result in greater recognition than conservative choices that remain within the paradigm. They operationalize this on a network of chemical entities that have been connected in article abstracts and look for novel combinations, the more risky and unlikely the combination, the greater the surprise and the reward. Perhaps all discoveries involve a degree of surprise, and the source of this may be the unexpected convergence between the conjecture and the evidence, or, as Ziman (1968, p. 48) describes it, as the falsiﬁcation of a preconceived or vague notion. However, because all new knowledge is tentative and subject to revision, it can take a period of time and contributions by many researchers until the initial conjecture comes to be regarded as a “discovery” by the community. Hence all discoveries are retrospective designations even though some lags may be very short and others quite long. Despite efforts to construct a theory, the ability to systematically identify discoveries has remained elusive. No comprehensive inventory has been created. The usual approach is to rely on the pronouncements and press releases of scientists themselves or their interpretation by science writers. It would appear at ﬁrst glance that citation analysis, where we can observe the impact of scientiﬁc articles over time, is an ideal tool for identiﬁcation, and indeed simple citation counts do identify many scientiﬁc discoveries (Garﬁeld, 1979). However, there are numerous reasons for citation, and highly cited lists tend to be dominated by methods, reviews and data compilations.2 Thus, simple citation counts do not provide enough information for a deﬁnitive identiﬁcation. In this paper we propose a method that augments citation counting with the language used by citing authors, namely words that explicitly label referenced items as discoveries. This method, together with machine learning to omit false positives, greatly improves our ability to automate discovery identiﬁcation. Once an accurate list is in hand we can begin to work backwards to ﬁnd common characteristics that can shed light on the nature of discovery. 2. Data and methods Fortunately, we are now gaining access to an expanding corpus of machine readable scientiﬁc articles in full text and we can use this resource to study the contexts in which articles are cited, so called citation context analysis. An important source of curated full text for the analysis of scientiﬁc papers is PubMed Central® (PMC). This open repository was created in 2000 and includes papers that were required to be publically available under the National Institutes of Health public access policy and legislative mandates. We limited our study of biomedical discoveries to the full text from PubMed Central called the “open access subset”. This subset includes 1.1 million full texts of primarily biomedical articles covering publications mainly in the most recent period but also some coverage extending back several decades. The oldest article found in the subset was from 1896, but 90% of articles are from the last 13 years (counting through mid-2015). Over the time period, the coverage rapidly expanded from 4500 articles in 2000 to about 200,000 in 2014.

2

Of the 100 most cited articles in Pubmed Central only about four are discoveries and the remaining 96 either methods, reviews or data compilations.

48

H. Small et al. / Journal of Informetrics 11 (2017) 46–62

In addition, the references cited by these source papers have been captured, and PMC processing adds codes to the references that allow the user to connect the reference within the text to the bibliographic information at the end of the article, as well as, in most cases, providing a unique article identiﬁer for each reference, the “Pub Med ID”, which can be used to ﬁnd the item in the National Library of Medicine’s PubMed data base. The references, of course, span a much wider time period than do the source articles from which they are taken, going back to 1980 and earlier. However, about 56% of references come from the last 13 years. The PMC “open access subset” obtained from NLM was downloaded and formatted for loading into a MySQL data base. The subset included data up through mid-2015. In addition to the bibliographic information for each article, all reference lists from the articles were parsed, as well as all sentences from each of the full texts, including sentences where references appeared. Roughly 38 million references from the articles were loaded into the MySQL data base, 34 references per article, and 166 million sentences or 149 sentences per article. About 19% of these sentences contain one or more References Our objective was to compile a list of biomedical discoveries and the cited articles associated with those discoveries. To ascertain whether a cited article was a discovery, the citing sentences were searched for the string “*discover*” (where * denotes a wild card), which includes terms such as “discovery”, “discover”, and “discovered”. We will call these “discovery words”. A citing sentence containing “discovery words” will be called a “discovery citance”. A citance is deﬁned as a single sentence from full text that contains one or more references (Nakov, Schwartz & Hearst, 2004), in contrast to the more general term, citation context, which may refer to more extended text around a given reference (Small & Klavans, 2011). Earlier studies of citation contexts were focused on categorizing the attitudes or sentiments of citing authors toward cited works (Moravcsik & Murugesan, 1975), while others were concerned with characterizing their types and content (Small, 1982). More recently Radev and Abu-Jbara (2012) have outlined a variety of applications for citation contexts including the characterization of discoveries in scientiﬁc ﬁelds. Our study builds on earlier research on citation contexts that viewed cited documents as shared symbols for speciﬁc methods or theories (Small, 1978). Hence in this study we seek to identify articles that have come to symbolize discoveries for a number of citing authors and thus reﬂect a shared deﬁnition of what constitutes a discovery. These earlier studies suffered from the lack of context data in electronic form, and precluded performing the kind of comprehensive analysis reported here. With the growing availability of full text databases, however, the information and computer sciences communities are now making increasing use of citation context and citance data (Teufel, 2010), and the PMC database in particular is being used as a test bed for various analyses, for example, using contexts to enhance information retrieval (Liu et al., 2014). We use citances rather than longer text excerpts so that the association of the discovery words and the references cited will be as close as possible. This strategy can still fail to ﬁnd associations between the discovery words and a reference if they occur in separate sentences, or, conversely, create a false association if the reference and the discovery words occurring in the same sentence are semantically unrelated. These problems can be compensated for to some extent by requiring that the association of discovery words and speciﬁc references occur in multiple citances. For this analysis the minimum number of discovery citances was set to 20. Later on we will dip below the threshold of 20 in a machine learning experiment. This second set will consist of 100 additional articles going down to a threshold of 16 discovery citances. About 0.4% of the sentences having references contained “discovery words”, compared to 0.2% of sentences without references, suggesting that discoveries are more often than not associated with speciﬁc articles. While extending the search to sentences not containing references would perhaps have allowed us to ﬁnd discoveries not associated with speciﬁc articles, this task was set aside for future study. The search for discovery citances in the PMC subset yielded around 126,000 citing sentences associated with one or more cited references. Because an individual citing sentence can contain more than one reference, the number of citancereference pairs expands to about 238,000 records, that is, nearly two references per citance. Roughly 50,000 (21%) of these references were not given a PubMed identiﬁer in the PMC subset, and were dropped from the analysis. This left 188,000 citance-reference pairs for which full article information was readily available. By summarizing these data on the cited article and counting the number of discovery citances for each article, a list of roughly 126,000 articles was generated and ranked from the most to the least number of citances. Applying the cutoff of 20 citances per article, a list of 293 potential discovery articles was obtained. In this paper we will describe a series of exploratory analyses focused on identifying and characterizing discoveries with the help of citance data. First we will describe how discoveries are manually selected from the list of 293 articles. Then we will discuss the list of selected discoveries and their topic scope. We will show the role that multiple discovery plays in the list and how this is revealed by co-citation within citances. This is followed by an attempt to classify the discoveries by type. The timing of discovery recognition is addressed in the next section, followed by a word analysis of the citances to reveal associated concepts. Finally, a machine learning experiment is carried out to see if the manual process of discovery identiﬁcation can be automated.

H. Small et al. / Journal of Informetrics 11 (2017) 46–62

49

3. Results 3.1. Manual screening Of course this relatively crude search for discovery terms does not guarantee that the retrieved sentence discusses a scientiﬁc discovery, and a second step is needed to screen out cases where the discovery words were being used in some other way. The screening process was done manually by retrieving and scanning a sample of citances for each cited article since at least 20 sentences were available per article. Only a small number of citances had to be scanned for each article because of the uniform or repetitive nature of the language used by citing authors (Small, 1978). If one or more citances explicitly identiﬁed the referenced article as a discovery it was coded as such. However, surprisingly only about 135 of 293 (46%) articles meeting the threshold of 20, could be classiﬁed as scientiﬁc discoveries. The majority of cases fell under the broad rubric of methods or tools designed to assist in the discovery process. Prominent among these were databases and computational algorithms for gene or drug discovery whose citances included discovery words but did not indicate that a discovery was actually made. For example, the article with the largest number of discovery citances concerned a tool called “Database for Annotation, Visualization and Integrated Discovery”. The second highest ranked article was a method for calculating False Discovery Rate statistics. The third ranked article, however, was the type of item we were seeking, namely the ﬁrst identiﬁcation of a micro-RNA in 1993. Micro-RNAs soon became one of the dominant themes in the list. However, not until the ninth ranked article did we ﬁnd another actual discovery, namely RNA-interference from 1998. The authors of this article were awarded the Nobel Prize in 2006. While fewer than one-half of the articles turned out to be scientiﬁc discoveries, the importance of those discoveries for biomedicine provided some reassurance. Eleven of the 135 (8%) discoveries identiﬁed turned out to have been awarded the Nobel Prize. A potential ambiguity in the designation of scientiﬁc discovery versus discovery method comes about when application of a method resulted in an actual discovery, for example, when the method of the genome-wide association study (GWAS) was successfully applied to discover several genetic loci for type-2 diabetes. In such cases where methods or tools led to speciﬁc discoveries, the article was coded as a discovery. Fortunately only a few cases presented this difﬁculty. Another surprise was ﬁnding two physical science discoveries, speciﬁcally graphene and high temperature superconductivity. The PMC database has no stated subject matter scope, only the stipulation that the work was supported by Federal funding and hence physical science papers can be included, although there is some indication that graphene is playing a role in some biomedical applications such as biosensors. It is, however, evident from the list of discoveries (see the Appendix) that the predominant focus of the PMC is molecular genomics and especially the role of DNA and RNA in disease. 3.2. The discovery list A listing of the top 128 biomedical discoveries and their associated articles is ranked by the number of “discovery citances” (see Appendix).3 The article publication years range in age from 1950 to 2012. In addition to eliminating the methods and tools articles, the list also excludes the two physical science articles noted above, four review articles which review discoveries but do not announce them, and one data base description. For each discovery the Appendix shows the number of “discovery citances” and the percentage of discovery citances of the total citances for each article retrieved from the PMC database. For example, for the ﬁrst article listed in the Appendix dealing with the ﬁrst identiﬁcation of a micro-RNA, the percentage of “discovery citances” is 38%. This compares to an overall average for the list of 13%. The total number of citances, or citing sentences, for an article can be thought of as a citation count which has been weighted by how often the article is cited within each citing text. The listing also includes the year the article was published, a brief description of the discovery, its primary author, and the article title (full bibliographic information is available online). The last column characterizes the type of discovery which will be described below. The ﬁrst column is a unique id number which we will refer to in the discussion. A number of prominent themes emerge on inspection of the list which reﬂect both the special nature of the PMC database as well as focal points of modern biomedical research (Table 1). Micro-RNAs and RNA research, in general, are the most prominent themes, comprising about 13% of the discoveries. Topics include the growing identiﬁcation of micro-RNAs performing regulatory functions in gene expression and development, their use as disease markers, and their role in RNA interference or gene silencing. Other RNA-related work includes the identiﬁcation of large non-protein-coding RNAs, pre-messenger RNA splicing, “interrupted genes”, and extra cellular RNA or shuttle RNA. Another prominent theme is cancer genetics, including mutations, and a number of genetic defects linked with the development of cancer. Subtopics include the study of oncogenes, gene fusions and mutation accumulation in cancer. If we broaden this theme to include all diseases with an apparent genetic basis, not just cancer, this would perhaps be the most prevalent theme in the list, including Crohn’s and Parkinson’s diseases, and type-2 diabetes with ﬁve discoveries on the list. Obesity and dementia also have genetic components and are classiﬁed elsewhere. A third broad theme is the role of viruses in disease, particularly in cancers, where we ﬁnd seven discoveries. Noncancer disease viruses include new respiratory viruses with six entries, work on HIV/AIDS with four entries, and two for

3

The full list of discoveries with bibliographic information is available online.

50

H. Small et al. / Journal of Informetrics 11 (2017) 46–62

Table 1 The discoveries listed in the Appendix are categorized by topic giving the number of discoveries for each topic and the id numbers of the corresponding entries in the Appendix. Topic

Discoveries

IDs (see Appendix)

Micro-RNA RNA general Cancer genetics Genetic diseases (non-cancer) Cancer viruses Disease viruses (non-cancer) Stem cells Neurobiology Regulatory molecules/receptors DNA sequencing, genes Copy number variations Immunology Microbiology Epigenetics Obesity/metabolism Other

13 5 13 7 7 12 5 12 11 11 5 5 4 4 7 7

1,2,3,5,11,13,18,20,25,73,75,123,124 24,40,49,117,119 22,26,30,41,45,57,64,70,98,107,110,112,115 61,67,79,84,93,103,113 8,28,34,36,42,81,92 10,17,31,37,60,63,69,72,85,94,109,121 4,7,23,106,127 14,16,43,46,52,54,58,76,78,86,114,126 29,35,44,48,74,88,90,96,99,105,118 19,27,33,56,71,80,82,87,91,102,122 47,50,51,62,128 68,95,104,108,125 38,39,53,116 9,12,83,120 6,15,21,66,77,100,111 32,55,59,65,89,97,101

giant DNA viruses. Stem cell research, including the newer induced pluripotent stem cells, is well represented with ﬁve discoveries (Chen, Hu, Liu, & Tseng, 2012). Neurobiology has a very strong showing with discoveries of mirror neurons, grid and place cells, synaptic plasticity, and peptides that regulate sleep. Another strong category is comprised of receptors and regulatory molecules. This includes several articles on toll-like receptors, the second estrogen receptor, cannabinoid receptors and three interleukins. A large, but somewhat diffuse, group deals with various aspects of DNA structure and sequencing including the classic discoveries of the double helix of Watson and Crick, “jumping genes” by McClintock, the initial draft of the human genome sequence, telomerase, and ribozymes. Smaller categories of discoveries deal with copy number variations in the genome, immunology, microbiology, epigenetics, speciﬁcally Tet family proteins, and aspects of obesity and metabolism, including peroxisomal proliferating-activated receptors and irisin. The immunology category includes the well-known CRISPR ﬁnding of a new prokaryotic immune mechanism which later led to gene editing. Only about seven discoveries do not fall into any of the above categories, and although there is no overarching theme that connects all of them, there is a strong emphasis on processes at the cellular and sub-cellular level. One item on this list classiﬁed in the cancer virus category (id 42) is a retracted article concerning an alleged virus associated with prostate tumors, and is hence a false discovery. 3.3. Multiple discoveries While we have listed each discovery separately in the Appendix, it would be a mistake to regard them as independent entities. In fact, citing authors see many of these as closely related and dependent on one another. Our description of subject matter groupings above can be replicated to some extent by an analysis of their co-citation patterns within individual citing sentences. There are a total of 6574 citances that cite two or more of the 128 discoveries, and 42 citances cite ﬁve or more. Such a within-context co-citation analysis has been shown to have higher accuracy than the standard within-article co-citation (Boyack, Klavans & Small, 2013). For example, when a micro-RNA discovery article is cited in a speciﬁc citing sentence, very often other micro-RNA discovery articles are cited. In the highest co-citing citances, citing ﬁve or more discoveries, the main topics are micro-RNAs, new respiratory viruses, genes for type-2 diabetes, disease-genes connections found by genome wide association studies, copy number variations, and human genome sequencing. It is also evident that a number of topics repeat. These can be independent co-discoveries if they are published in the same year, or follow-on discoveries that build on or extend the original discovery if they appear in successive years. Examples of follow-on discoveries are the creation of induced pluripotent stem cells ﬁrst in mice (id 4) and then in humans the following year (id 7). Also, mirror neurons were posited in 1992 (id 14) and conﬁrmed and posited in humans in 1996 (id 16). We can study why discovery articles are cited together by examining the co-citing sentences. For example, the most highly co-cited pair involves the creation of human induced pluripotent stem cells (ids 7 and 23). This discovery occurred nearly simultaneously by two groups, to quote: “Human iPS cells were ﬁrst independently produced by Yamanaka’s and Thomson’s groups from human ﬁbroblasts in late 2007.” (Ye, Swingen & Zhang, 2013). In another independent co-discovery, the two aptamers articles on the list (ids 89 and 101) were published in 1990 by two independent labs (Bai, Wang, Hargis, Lu, & Li, 2012). Based on co-cited articles having the same publication year and different authors, we estimate that about one-third of the articles on the list are in fact part of independent multiple discoveries, a phenomenon studied by Merton (1963). Independent discoveries involving three or more separate groups are prominent, for example, for a new class of micro-RNAs in C. elegans (ids 13, 18 and 25): “In 2001, three groups published independent reports on the discovery of a new class of small non-coding RNAs (sRNAs), which were named micro-RNAs (miRNAs).” (Agirre & Eyras, 2011) Also, three groups (ids 22, 64 and 107) found that a mutation in the epidermal growth factor receptor made non-small-cell lung cancer responsive

H. Small et al. / Journal of Informetrics 11 (2017) 46–62

51

to drug treatment (Carcereny et al., 2015). One of the biggest multiples involved ﬁve groups as indicated by this context: “In the spring of 2005, almost simultaneously, ﬁve reports appeared describing a mutation in the gene coding for JAK2... in patients with chronic myeloproliferative diseases.” (Bennett & Stroncek, 2006) Three of the relevant articles were on the list (ids 57, 98 and 110). 3.4. Characterization of discoveries Another type of question concerns the nature of these discoveries, whether paradigm breaking in Kuhn’s sense, or puzzle solving within the context of normal science. This is a difﬁcult task since it involves knowing the state of knowledge prior to and following each discovery, and some way of gauging the novelty and unexpected character of the ﬁndings (Foster, Rzhetsky & Evans, 2015). Each article was categorized as one of three types: a violation, innovation or extension based on a reading of the discovery articles themselves. An article was labeled as a “violation” if the ﬁnding seemed to run counter to, or violate, the then prevalent view, and the authors discussed alternative views or opinions held by others. This need not break a paradigm, but it should necessitate an adjustment to it. By this criterion, the Watson and Crick double helix (id 19) was classiﬁed as a “violation” because other authors were cited who espoused a triple helix structure of DNA, among them Linus Pauling (Olby, 1974, p. 376). An article was coded as an “innovation” if the ﬁnding was unexpected, marked a major leap in understanding, provided a new direction in research, but did not violate the then accepted opinion. For example, the discovery of RNA-interference in C. elegans (id 2) was categorized as an “innovation”, because it was a new phenomenon caused by a double-stranded RNA, and not merely an extension of the 1993 discovery of the ﬁrst micro-RNA in the same organism (id 1). The category of “extension” was assigned to articles that built on similar earlier discoveries or were new examples of entities that followed earlier models. For example, the ﬁnding of a second micro-RNA in C. elegans in 2000 (id 3) that also controlled worm development was an extension of the earlier 1993 ﬁnding. The three categories should not be thought of as separate boxes but rather as degrees on a scale of unexpectedness or surprise, with violations the most surprising and extensions the least. From this reading only 16 of 128 articles appear to be “violations” or counter to paradigms. For example, the article that identiﬁed the ﬁrst micro-RNA in 1993 (id 1) was coded a “violation” because the function of the RNA controlling the worm’s developmental staging ran counter to the then expected function of RNA which was to code a protein, according to the “central dogma” in molecular biology (Olby, 1974, p. 432). Similarly, the discovery of gene product leptin (id 6) which physiologically controls body weight was coded a “violation” because it ran counter to the idea that obesity was psychologically determined. The discovery of circulating endothelial progenitor cells (id 32) ran counter to the idea that new blood vessels could only arise from fully differentiated endothelial cells derived from preexisting blood vessels. The discovery of grid cells (id 46) in a particular brain region that allowed animals to navigate in space ran counter to the accepted notion that this function resided in the hippocampus. The discovery of Archaea (id 53) that can live in cold water by aerobically oxidizing ammonia to nitrite violated the idea that Archaea were extremophiles only living in extreme ocean environments such as hydrothermal vents. These and other instances of “violations” have in common some ﬁnding that breaks an existing dogma or challenges a point of view and provides an alternative path. We might hypothesize that articles deemed “violations” are more discovery-like by virtue of their more revolutionary effects on their ﬁelds, and that “innovations” should be more discovery-like than “extension”. This is borne out if we look at the percentage of citances for articles in each category that contain “discovery words” of the total number of citances received: 17.4% for violations, 8.2% for innovations and 7.6% for extensions. Furthermore, the log likelihoods of “discovery words” were highly signiﬁcant when “violation” citances were compared with citances of the other types. The word “ﬁrst” also occurred at a statistically signiﬁcant rate in citances for violation articles, and is used to indicate priority for a ﬁnding, such as “ﬁrst identiﬁed”, “ﬁrst reported” or “ﬁrst discovered”. We will have more to say about word usage in citances in Section 3.6. 3.5. Timing of recognition The question of when articles begin to be called “discoveries” by citing authors is also of interest, as well as the age of articles when that labeling is at its peak. It is evident that many articles in our list are quite old. The oldest article is from 1950 (McClintock’s “jumping genes”, id 81), and the average publication year is about 1998. This compares to an average age of 2006 for articles in the original list of 293 which were omitted because they were tools or methods. Recognition of discovery appears to be a process that extends over a period of years. We could hypothesize that the labeling of a ﬁnding as a “discovery” is made only in retrospect, in light of subsequent ﬁndings and theories much as Lakatos (1970, p. 158) argued that so-called “crucial experiments” are recognized only in retrospect. Some discoveries appear to be strengthened by subsequent discoveries of related entities, as exempliﬁed by micro-RNAs, or multiple novel respiratory viruses. A case in point is induced pluripotent stem cells published in 2006 (id 4). This paper was cited very soon after publication and the paper had accumulated 67 citances in the PMC database by 2008. However, none of these were “discovery citances”. In 2009, three years after publication, the paper began being cited as a discovery having ﬁve out of 92 citances (5%) labeling it as such. The year with peak “discovery citances” was 2013 when 7.3% of citances contained discovery words. Similarly RNA-interference published in 1998 (id 2) accumulated 28 citances by 2002, but only began being referred to as a discovery in 2003. The peak year for discovery citances was 2011 at 30%, nine years after publication.

52

H. Small et al. / Journal of Informetrics 11 (2017) 46–62

Percentage discovery citances by age range

percentage discovery citances

25.00%

20.00%

15.00%

10.00%

5.00%

0.00%

1-5

6 - 10

11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40

arcle age range at me of citaon in years Fig. 1. Percentage of discovery citances is the number of citances containing discovery words divided by the total number of citances in the PMC database. Age is determined by subtracting the publication year of the article from the year of publication of the citance and grouping the data by ﬁve year periods. Only articles listed in the Appendix are included.

An interesting case is the XMRV article (id 42) published in 2006 and retracted in 2012. This paper accumulated 26 citances in the ﬁrst four years after publication. In 2009, 5 of 14 citances (36%) labeled it as a discovery, and in 2010, the peak citation year, 13 of 109 (12%). Starting in 2011 the decline in citation began as questions about contamination and lack of replication were put forward. In the three years after the retraction by the journal in 2012, 4 of 32 (12%) citances found in PMC still called it a discovery. We can conclude that some discoveries show a time lag in recognition and even papers that are discredited continue to be labeled as discoveries, suggesting considerable inertia in the system. An overview of the time evolution of discovery recognition can be obtained by calculating the percentage of discovery citances for articles of different ages. Age is computed by subtracting the year of publication of the cited article from the year of publication of the citing sentence. Summing across all discovery papers, we ﬁnd that the percentage of discovery citances increases from about 5% in the ﬁrst ﬁve years after publication to a peak of 22% in the ﬁve-year period 26–30 years after publication (Fig. 1). In the ten-year period thereafter, the rate falls to about 18%. It is of interest to examine the discoveries that are recognized early on, within the ﬁrst ﬁve year period. The most prominent among these are discoveries on epigenetic Tet family proteins (ids 9 and 12), polyomavirus (id 8), the alleged new virus found in prostate tumors and later retracted (id 42), and the gene for Miller’s syndrome found by exome sequencing (id 27). Because the PMC coverage is biased toward the current period, to get a more accurate picture of the onset of discovery recognition, we excluded all discoveries published prior to 1997 and recomputed the age distribution of percentage discovery citances for 18 years on a year by year basis (Fig. 2). Age zero is the year of publication of the article, and the 4.5% discovery citance rate shows that recognition can begin very soon after publication with increasing intensity into the second decade after publication. Examining the articles gaining discovery recognition at age zero, that is, in the year they were published, we ﬁnd again the Miller’s syndrome gene (id 27), irisin and brown fat development (id 77), and gene fusions in cancer (id 45). Thus, when we look at the overall pattern across all discoveries, delay in recognition for individual cases, such as induced pluripotent stem cells and RNA interference, do not seem to be the norm. This indicates that recognition of some discoveries by at least some subset of the citing community is immediate, and hence the initial evidence presented by the discoverer is sufﬁciently compelling to win converts even before replication or conﬁrmation by others. On the other hand only one age zero reference out of a total of 47 was found for an article classiﬁed as a “violation”, namely for brain grid cells (id 46). This rate is lower than expected based on the number of “violation” articles within the post-1996 time frame, suggesting a delay in recognition for discoveries of this type. 3.6. Word analysis Citances provide a rich source of information on how citing authors characterize and utilize earlier literature. We expect that articles describing discoveries will be cited differently from, for example, methodologies, reﬂecting perhaps their cognitive signiﬁcance and epistemological status, and revealing associated concepts and terms. An initial approach is to analyze the individual words used when discovery articles are cited compared to the vocabulary used in non-discovery citances. For non-discovery citances we use those associated with the methods and tools articles which were not classiﬁed as discoveries.

H. Small et al. / Journal of Informetrics 11 (2017) 46–62

53

Percentage discovery citances by age: publicaon year > 1996 Percentage discovery citances

12 10 8 6 4 2 0

0

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18

arcle age at me of citaon Fig. 2. Percentage discovery citances for articles in Appendix published after 1996. Age is determined by subtracting the publication year of the article from the year of publication of the citance. Percentage of discovery citances is the number of citances containing discovery words divided by the total number of citances in the PMC database. Table 2 The top general scientiﬁc method words, excluding technical words and function words, are ranked by log likelihood as calculated by the Wordsmith Tools software. Words for the discovery citance set with respect to the non-discovery set are listed on the left, and non-discovery set words are listed on the right. The ratio of percentage frequencies is computed by dividing the frequency of the word by the total number of running words in the set. Discovery set: Word discovered ﬁrst important mechanism recently cause shown demonstrated found reported

Non-discovery set: Ratio of% frequencies 5.0 3.0 3.0 4.3 2.7 5.5 2.2 2.5 1.8 2.0

Log likelihood 2122 1940 1088 895 848 827 816 750 548 456

Word using analysis data used performed value version algorithm project tool

Ratio of% frequencies 8.0 8.0 7.2 4.1 6.7 37.6 41.5 60.1 6.5 10.0

Log likelihood 19,257 10,450 8549 5761 4255 3593 3555 2648 2585 2417

To perform this analysis we retrieved all the citing sentences for each of the discovery and non-discovery articles, not just those containing the discovery words. This increased the number of citances associated with each article by a factor of about 14. The frequency and percentage of individual words were analyzed for the separate sets using Wordsmith Tools, a software package widely used in corpus linguistics (Scott, 2004). The software calculates the percentage of each word by dividing the single word frequencies by the total number of running words in the set. To ﬁnd words that occurred prominently in one set but not the other, so-called keywords, the software computes the log likelihood statistic of one corpus relative to the other and also gives the p-value for each word. It is also interesting to compute the ratio of percentages of word frequencies. For example, the word “discovered” had a percentage of 0.1% in the discovery set and a percentage of 0.02% in the nondiscovery set for a ratio of 5.0 (see Table 2), which means that the word occurs ﬁve times more often on a relative basis in the discovery set than the non-discovery set. Similarly, the log likelihood of 2122 for “discovered” was the 30th ranked word by log likelihood in the discovery set relative to the non-discovery set. Of course, it is not surprising that the verb “discovered” appears at a higher rate in discovery citances since this set is concerned with descriptions of acts of discovery, for example, in phrases such as “X was discovered by Y”. Using this approach we can thus identify the words that occur at statistically signiﬁcant rates in the discovery citance set vis-à-vis the non-discovery set as a baseline, and vice versa. Examining the highest log likelihood words in the discovery citances, using the non-discovery set as the baseline, the most prominent are the technical words “cells”, “miRNAs”, “stem”, and “pluripotent”. Of course these technical words reﬂect the subject matters of the discoveries, and the prominence of the word “cells” indicates that many of them deal with cell biology. By contrast, the highest ranking words in the non-discovery set, using the discovery set as the baseline, are “using”, “analysis”, “blast” and “data”. These words are predominantly general scientiﬁc words or so-called scientiﬁc method words (with the exception of “blast” which stands for “Basic Local Alignment Search Tool”, a tool used in sequence analysis) that reﬂect actions and activities around the application of methods. The top word “using” is consistent with the methods and tools focus, appearing in formulations such as “X was determined using Y”. Further insights into the differences between the discovery and non-discovery sets can be gained by looking at the top 10 highest ranked general scientiﬁc terms by log likelihood in each set (Table 2). General scientiﬁc terms, a subset of “English for

54

H. Small et al. / Journal of Informetrics 11 (2017) 46–62

academic purposes” (Paquot & Bestgen, 2009), are words that reﬂect scientiﬁc practices but are, as far as possible, neutral in terms of technical content. Of course, such neutrality is hard to achieve because even general scientiﬁc terms can be content dependent. Nevertheless, in Table 2 the cognitive aims of each set are clearly visible. On the right side of Table 2 words such as “used”, “using”, “performed”, “data”, “analysis”, “algorithm” and “tool” clearly reﬂect a methodological and application orientation of the non-discovery set. These words have much higher log likelihoods due to the homogeneity of this set in terms of its methodological goals. By contrast, the discovery set on the left side of Table 2 shows much higher word diversity due to the variety of topics and also a predominance of technical vocabulary. In this set the general scientiﬁc words “ﬁrst” and “recently” reﬂect concerns with issues of priority and a focus on the latest research. These might be called “timing” words. These modiﬁers are often used to describe discoveries in formulations such as “the recent discovery of X” or “X was ﬁrst discovered by Y”. The word “important” is a value judgment used to describe the ﬁnding itself or to underline some critical feature such as the important role of a process or substance. The word “mechanism” reveals how scientists think about the inner workings of the cell. The word “cause” reﬂects the concern with ﬁnding the causes of diseases. The last four words in Table 2 could be called “outcome words”, or “evidentials” (Hyland, 2004, p. 191), describing the outcomes of research and evidence for ﬁndings. The term “discovered” could itself be considered an outcome word. The words “shown”, “demonstrated”, and “found” imply that a ﬁnding has achieved a degree of certainty or corroboration, while the term “reported” is a more neutral expression of outcome. Common expressions are “it has been shown that”, “has been demonstrated to play a role”, “have been found in”, and “was ﬁrst reported to be”. We can think of these words as aspects of the discovery process that give credibility to the results of research. In short, the work can be “important”, uncover a “cause”, “demonstrate” or “show” something, or describe a “mechanism”. Each of these auxiliary concepts associated with descriptions of discoveries, reinforce the certainty, importance and primacy of the new knowledge. Building on the ﬁndings of Section 3.6 on the timing of recognition, it is interesting to compare the word usage in citances made soon after the publication of a discovery versus those made during a later period. Here we compare citances made zero to 5 years after publication with those made 10 to 24 years after publication. Again, we have augmented the data to include all citances not just those containing “discovery words”. As in the comparison of discovery to non-discovery sets, we use the early citances as a baseline against which to gauge the later set and vice versa. Top general scientiﬁc terms in the late versus early corpus with signiﬁcant log likelihood include “ﬁrst”, “discovered”, “since”, “discovery” and “early”. Thus, “timing” words emphasizing past events are prominent, as well as “discovery” words, which show that labeling articles as discoveries gains strength with the passage of time. Top works in the early versus late corpus, in contrast, stress the urgency and engagement in active research: “recently”, “recent”, “previously”, “replicated”, “reported”, “detected” and “consistent”. The words “replicated” and “consistent” are especially interesting because they show the process of justiﬁcation of discovery is underway, involving both the repetition of ﬁndings as well as showing consistency with prior knowledge. Although citances occurring zero years after publication are fewer in number than zero to ﬁve years (600 versus 21,000), they illustrate the immediate concerns of citing authors. In addition to most of the general words listed above for the zero to ﬁve year set, keywords with signiﬁcant log likelihoods include “predicted”, “novel”, “conﬁrmed”, and “validated”. This shows that early discovery recognition may be inﬂuenced by the novelty of ﬁndings and the ability to make predictions which can be conﬁrmed or validated. 3.7. Machine learning The analysis of the word frequencies associated with the discovery and the non-discovery citances reveals that the word proﬁles for each type differ in systematic ways. These differences in word patterns suggest that a machine learning approach may be useful for differentiating discoveries from non-discoveries, thus facilitating our identiﬁcation task. Machine learning is a widely used tool in computer and information sciences, and its use in text classiﬁcation is particularly relevant (Demarest & Sugimoto, 2015). The plan was to use all citances associated with the 293 articles having 20 or more “discovery citances” as input to the training task for machine learning. This included 128 articles coded as discoveries and 165 non-discoveries. Input to machine learning training consisted of all citing sentences, including those with and without “discovery words” for the 293 articles. The citances for a particular article were aggregated to form a single “bag of words” for each article. In other words, each article was represented as a concatenation of all its citing sentences, without regard to word order. This resulted in a total of 188,566 sentences for the 293 articles or an average of 643 sentences per article. Articles in the training set were coded “1” for discovery and “0” for non-discovery. As a test set, an additional 100 articles were manually coded below the 20 cutoff but were not used for training. This set served as “unseen” data. This set of 100 included articles down to 16 “discovery citances” per article below the initial threshold of 20 and contained 37,988 total citances, considerably fewer than the training set because of its lower citation rate. The topics represented in the 100 additional articles were similar to those found in the larger training set. A total of 48 items were manually coded as discoveries.4 Notable examples are the BRCA1 and 2 genes in breast cancer, the molecular basis for motility in eukaryotic ﬂagellum, proteins in age-related macular degeneration, reverse transcriptase, and the Ebola virus. Again many were concerned with disease causation by genes or viruses. The Scikit-learn package was used for machine learning (Pedregosa et al., 2011). This software processes each document by applying a stop-word list and converting each document to a vector of features. Each vector consists of words (coded

H. Small et al. / Journal of Informetrics 11 (2017) 46–62

55

as numbers) and weights given by tf-idf scores. The document vectors deﬁne points in a hyper-dimensional space whose axes are individual words. The objective of the training is to ﬁnd an optimal hyperplane in feature space with instances of discovery documents on one side of the plane and non-discoveries on the other side. After training on the feature vectors for the 293 articles using various classiﬁers which deﬁned coefﬁcients for the hyperplane, the resulting solutions were applied to the test sample consisting of 100 additional documents which had not been used in the training. Ten different classiﬁers available in the Scikit-learn package were tested separately giving accuracies ranging from 0.6 to 0.94 with eight of the ten near 0.90. Best results were obtained with the ridge regression classiﬁer which is similar to standard linear regression but uses a noise reduction factor. An inspection of the coefﬁcients of the optimal hyperplane revealed that the words having the highest positive coefﬁcients are strongly correlated with the words having the highest log likelihoods for the discovery set (see Section 3.6), and words with the most negative coefﬁcients were correlated with the highest log likelihoods for the non-discoveries. For example, the word “cells” had the highest hyperplane coefﬁcient and highest log likelihood for the discovery set, and the word “used” had the most negative coefﬁcient and the highest log likelihood for the non-discoveries. The ridge regression classiﬁer gave an accuracy of 94%. This means that for 94 of 100 articles in the second set of data, the machine and manual classiﬁcations were identical. There were 44 true positives, two false positives and four false negatives, for an F1 value of 94%. The two false positives, where the machine learning said it was a discovery but the manual classiﬁcation said it was not, dealt with a micro-RNA discovery by deep sequencing and discovery of cryptic species in taxonomy. The four false negatives, where the manual classiﬁcation said it was a discovery but the machine did not, concerned a draft sequence of the human genome, a drug discovery of tuberculosis inhibitors, discovery of type-2 diabetes SNPs using genome-wide association studies, and discovery of autism genes using exome sequencing. While the four false negatives might be considered borderline cases, machine learning may have been inﬂuenced by numerous cases in the training data where DNA sequencing and genome-wide association were not associated with discovery. On the other hand, the human classiﬁer may have been inﬂuenced by well-known articles and biased against systematic biology. Nevertheless, we can conclude that machine learning can be of considerable assistance in discovery identiﬁcation. It remains to be seen how far down the list we can go in terms of number of “discovery citances”, that is, whether articles with only a few citances can be accurately classiﬁed and what the lower limit is, although classiﬁcation of another 1000 articles seems within reach. The ability to accurately classify discoveries using as few citances as possible is important for identiﬁcation of very recent discoveries which have only a short time to be cited. 4. Discussion While discovery in science can be an extremely private matter taking place in the mind of the discoverer, we have taken the view that it is the community of citing authors that ultimately decides what is and is not a discovery, and this community designation extends over a number of years and is potentially subject to revision, as in the case of retractions. Of course, this means that if a “discovery” is made and is overlooked or goes unrecognized by the community, it will be missed by this method. There have been well known cases in the history of science where discoveries have been overlooked only later to be resurrected by embarrassed scientists. Examples include the discovery of the kinetic theory of gases by Waterston in the 19th century (Gribbin, 2002, p. 389), and the discovery of the units of inheritance by Gregor Mendel (Mukherjee, 2016, p. 59). In these cases, papers were either rejected, or ignored and uncited until decades later. Gunther Stent’s theory of pre-mature discovery in science nicely covers such cases, where scientiﬁc ﬁndings that do not ﬁt with the “canonical knowledge”, or paradigm, in a given period can go unrecognized. On the other side of the coin, we could ask if spurious ﬁndings can be labeled as discoveries simply because they seem consistent with contemporary knowledge. Certainly such is the case with articles that have been labeled as discoveries only later to be retracted as errors. In a different sense, could there be cases of “bandwagon” discoveries when a particular topic becomes popular and a ﬁeld is inundated by incremental additions to knowledge? These are difﬁcult questions we will not be able to settle here. A possible deﬁciency in our method may be its failure to identify important methodologies such as recombinant DNA or the polymerase chain reaction. Other search strategies will be needed to systematically identify this type of innovative work perhaps by the use of other key word ﬁlters. It is also possible that discoveries are missed; or are overly represented; because of the limitations or biases inherent in the data base. It should be recalled that about one-ﬁfth of the discovery references were left out of the analysis due to the lack of source data in the PMC. An example of a missed discovery due to thresholding is the 2016 Nobel Prize in Physiology or Medicine which was awarded to Yoshinori Ohsumi for autophagy. One article associated with this discovery from 1993 received only eight discovery citances; and thus falls below the threshold of 20 used to compile our list. Such a low citance rate is possibly indicative of a subject bias in the data base. Finally; Merton’s “obliteration by incorporation” effect may also be at work; where ﬁndings become so imbedded and second nature in the working habits of scientists that explicit mention is no longer thought necessary; leading to the pre-mature disappearance of discoveries in the collective consciousness. Regarding the timing of discovery recognition, we have seen, in some cases, there can be a delay of a few years in the onset of discovery citances. Whether the community is withholding judgment until further evidence is forthcoming or some other factor is at work, such as the Stent effect, is not known. However, looking at all discovery articles in aggregate, this delay is not apparent, and recognition is often immediate, though at a low level, with an increasing rate of recognition across the years which may only begin to decay decades later. Such nearly immediate recognition requires us to examine

56

H. Small et al. / Journal of Informetrics 11 (2017) 46–62

our assumptions about the separation of the contexts of discovery and justiﬁcation as usually conceived by philosophers of science, and suggests that they cannot be so easily separated. A similar conclusion was reached by Simon and colleagues (Langley et al., 1987; p. 57) where discovery is conceived as a stepwise process where positive or negative conﬁrmation is generated at each step. Discovery then, if it is to be quickly labeled as such by others, must be accompanied by some degree of justiﬁcation, corroboration and ﬁt with the evidence. The picture may not be complete upon publication, with many blanks to ﬁll in and implications to be followed up, but some degree of ﬁt must be evident in the nascent moment. Such was certainly the case with Watson and Crick’s DNA double helix where numerous models were rejected until one that satisﬁed inter-atomic distances, base ratios and x-ray crystallography evidence ﬁnally emerged (Watson, 1968). Our approach to discovery identiﬁcation is also complicated by the phenomenon of multiple discovery, diluting the number of distinct discoveries on a given list. However, co-citation methods are an effective tool to identify multiples and offer a new approach to systematically identifying these events. Multiples, as Merton calls them, are an important phenomenon and can reveal how competition and cooperation operate in science (Small, 2016). On the question of the nature of discovery, it proved difﬁcult to ﬁnd evidence for new ideas from the linking of unconnected literatures or new combinations of existing chemical entities. Most discoveries examined here appeared to involve some kind of novelty. Combinations, to the extent they were in evidence, were of a more complex nature, involving biological processes, mechanisms, methods, diseases, viruses, genes, enzymes, and other entities. A theory of discovery as a combinatorial or bisociation process, as envisioned by Arthur Koestler in 1964, will require a more expanded deﬁnition of combining units. One way this might be investigated is to examine the references cited by discovery articles, and by comparing these with a cluster analysis of those references, seeing if different topics were drawn upon by the discoveries. 5. Conclusions We have explored the use of citation contexts, or more precisely citances, in the identiﬁcation of discoveries in biomedical science using full text from the PMC database. While we have found that the appearance of “discovery words” is not a reliable indicator of whether a given citance describes a discovery (the success rate is about 46%), it is relatively easy to differentiate these manually by inspecting a sample of citances. Because the majority of false hits are methodologies and tools for making discoveries, rather than actual scientiﬁc discoveries, the vocabulary used in citances in such cases has been shown to be highly instrumental, and to differ markedly from the vocabulary used when actual discoveries are cited. Hence, machine learning based on citance vocabulary has proved to be an effective tool for classifying articles into the two sets, achieving an accuracy of 94% on unseen data. This opens the door to a much broader discovery identiﬁcation task using machine learning that could possibly expand the list to 1000 discoveries using PMC data. Finally, other tropes or common themes in science are likely to be amenable to the type of analysis outlined in this paper. General scientiﬁc words associated with discoveries such as timing words (“ﬁrst” and “recently”), value judgements such as “importance”, or outcome or evidential words such as “shown”, “demonstrated” and “found” could themselves be explored by retrieving sentences, manual classiﬁcation, and application of machine learning to differentiate usage and automate the process. One direction for future research would be to explore whether outcome words can be correlated with the cognitive certainty or degree of corroboration of scientiﬁc ﬁndings, and the types and strength of evidence that are brought to bear. This sort of research agenda might take us closer to a literature based computational approach to the “context of justiﬁcation”. Discovery in science is the critical engine that drives the acquisition of new knowledge, combining the creative efforts of individual researchers and the importance of the scientiﬁc community in validating that knowledge. A systematic framework for identifying discoveries, such as the one we describe here, can contribute to the deeper study of the psychological process of making discoveries as well as the sociological process of bestowing recognition, and open these up to greater public and policy scrutiny. Also at stake is the question of how discoveries are justiﬁed and how the community is convinced to grant recognition, unless we are to assume that this is merely a matter of rhetorical persuasion. Author contributions Henry Small Conceived and designed the analysis, Collected the data, Contributed data or analysis tools, Performed the analysis, Wrote the paper. Hung Tseng Collected the data, Performed the analysis. Mike Patek Collected the data, Contributed data or analysis tools, Performed the analysis. Acknowledgements We would like to thank Richard Klavans and Kevin Boyack for valuable discussions, and the reviewers for suggesting additional analyses.

H. Small et al. / Journal of Informetrics 11 (2017) 46–62

57

Appendix. − List of Discoveries. This Appendix lists the 128 articles having 20 or more discovery citances and classiﬁed as discoveries. The number of discovery citances is given under “Disc Citances”, and the percentage of these of the total number of citances is in the “% Disc Citances” column. The publication year of the article is under “Pub Year”, and a short description is given under “Discovery”. The ﬁrst author of the article is given under “First Author”, followed by the Title in the next column. The ﬁnal column, “Type” is the classiﬁcation of the article as “violation”, “innovation” and “extension” abbreviated as “vio”, “inn” and “ext”. ID

Disc Citances

% Disc Citances

Pub Year

Discovery

First Author

Title

Type

1

436

38.0

1993

miRNA lin4

Lee, RC

vio

2

217

22.0

1998

RNA interference

Fire, A

3

141

22.7

2000

miRNA let7

Reinhart, BJ

4

123

5.6

2006

Takahashi, K

5

90

25.1

1993

induced pluripotent stem cells miRNA regulation of lin-14 by lin-4

6

77

13.4

1994

leptin

Zhang, Y

7

74

4.0

2007

Takahashi, K

8

73

17.1

2008

iPSC in human somatic cells polyomavirus

9

72

13.7

2009

epigenetic Tet family proteins

Tahiliani, M

10

60

20.7

2005

human bocavirus

Allander, T

11

59

7.0

2001

Elbashir, SM

12

57

18.8

2009

mammalian RNA interference epigenetic Tet family in brain

13

57

11.4

2001

14

54

30.7

15

49

16

The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14 Potent and speciﬁc genetic interference by double stranded RNA in Caenorhabditis elegans The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans Induction of pluripotent stem cells from mouse embryonic and adult ﬁbroblast cultures by deﬁned factors Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans Positional cloning of the mouse obese gene and its human homologue Induction of pluripotent stem cells from adult human ﬁbroblasts by deﬁned factors Clonal integration of a polyomavirus in human Merkel cell carcinoma Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in mammalian DNA by MLL partner TET1 Cloning of a human parvovirus by molecular screening of respiratory tract samples Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian cells The nuclear DNA base 5-hydroxymethylcytosine is present in Purkinje neurons and the brain Identiﬁcation of novel genes coding for small expressed RNAs Understanding motor events: a neurophysiological study Ghrelin is a growth-hormone-releasing acylated peptide from stomach Action recognition in the premotor cortex

Wightman, B

Feng, H

Kriaucionis, S

Lagos-Quintana, M

1992

novel genes coding for miRNAs mirror neurons

di Pellegrino, G

8.7

1999

hormone ghrelin

Kojima, M

47

16.3

1996

Gallese, V

17 18

45 45

27.1 10.5

2004 2001

19

45

20.8

1953

mirror neurons conﬁrmation human coronavirus abundance of miRNAs in C. elegans double helix

20

44

17.8

2000

evolutionary conserved miRNAs

Pasquinelli, AE

21

43

4.5

2007

FTO obesity gene

Frayling, TM

22

42

5.2

2004

Lynch, TJ

23

42

3.9

2007

24

42

14.1

2004

25

42

14.8

2001

26

42

9.9

2007

EGFR mutations and lung cancer treatment human iPSC conﬁrmation ﬁrst histone demethylase LSD1 miRNA and RNA sequencing EML4-ALK gene fusion in cancer

inn ext

inn

ext

vio ext inn inn

ext inn inn

ext inn inn ext

van der Hoek, L Lau, NC

Identiﬁcation of a new human coronavirus An abundant class of tiny RNAs with probable regulatory roles in Caenorhabditis elegans

ext ext

Watson, JD

Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid Conservation of the sequence and temporal expression of let-7 heterochronic regulatory RNA A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity Activating mutations in the epidermal growth factor receptor underlying responsiveness of non-small-cell lung cancer to geﬁtinib Induced pluripotent stem cell lines derived from human somatic cells Histone demethylation mediated by the nuclear amine oxidase homolog LSD1 An extensive class of small RNAs in Caenorhabditis elegans Identiﬁcation of the transforming EML4-ALK fusion gene in non-small-cell lung cancer

vio

Yu, J Shi, Y Lee, RC Soda, M

ext

inn

inn

inn inn ext inn

58

H. Small et al. / Journal of Informetrics 11 (2017) 46–62 human exomes sequence and Miller’s syndrome Kaposi’s sarcomaassociated herpes virus TLR and human immune response TMPRSS-ETS gene fusion and prostate cancer APOBEC3G gene inhibits HIV-1

Ng, SB

Exome sequencing identiﬁes the cause of a mendelian disorder

inn

Chang, Y

Identiﬁcation of herpesvirus-like DNA sequences in AIDS-associated Kaposi’s sarcoma

inn

Medzhitov, R

A human homologue of the Drosophila Toll protein signals activation of adaptive immunity Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer

ext

Isolation of a human gene that inhibits HIV-1 infection and is suppressed by the viral Vif protein. Isolation of putative progenitor endothelial cells for angiogenesis

inn

27

41

12.5

2010

28

40

15.3

1994

29

40

12.0

1997

30

40

8.4

2005

31

40

9.1

2002

32

38

5.7

1997

33

38

31.9

1997

34

38

21.6

1980

35

38

14.4

1996

36

37

19.0

1989

hepatitis C virus

Choo, QL

37

36

24.0

2007

Allander, T

38

36

25.2

2000

respiratory polyomavirus bacterial rhodopsin

39

36

5.1

2004

40

36

3.9

2007

41

36

2.0

2000

42

35

10.9

2006

43

35

25.0

1996

44

35

13.9

2003

45

35

21.3

2009

46

34

7.4

2005

47

33

5.0

2010

48

33

15.2

2003

49

33

5.0

2009

large non-coding RNAs

Guttman, M

50

33

8.5

2011

Mills, RE

51

32

3.2

2006

52

32

6.0

2011

copy number variations copy number variations genetic mutation in dementia

53

32

15.8

2005

54

31

5.4

2006

endothelial progenitor cells and vascular regeneration cell free fetal DNA HTLV-1/HTLV-2 human retroviruses immune function for Drosophila Toll

Tomlins, SA

Sheehy, AM

Asahara, T

Lo, YM Poiesz, BJ

Lemaitre, B

Beja , O

microbial diversity in seawater exosomal shuttle RNA

Venter, JC

breast cancer molecular subtypes new human retrovirus (retracted) mirror neurons conﬁrmation new interleukins

Perou, CM

gene fusions in cancer grid cells in the cortex copy number variation types new interleukins

Maher, CA

ammoniaoxidizing Archaea TDP-43 in amyotrophic lateral sclerosis

Valadi, H

Urisman, A

Rizzolatti, G Kotenko, SV

Hafting, T Conrad, DF Sheppard, P

Redon, R DeJesus-Hernandez, M

Konneke, M

Neumann, M

Presence of fetal DNA in maternal plasma and serum Detection and isolation of type C retrovirus particles from fresh and cultured lymphocytes of a patient with cutaneous T-cell lymphoma. The dorsoventral regulatory gene cassette spatzle/Toll/cactus controls the potent antifungal response in Drosophila adults. Isolation of a cDNA clone derived from a blood-borne non-A, non-B viral hepatitis genome Identiﬁcation of a third human polyomavirus Bacterial rhodopsin: evidence for a new type of phototrophy in the sea Environmental genome shotgun sequencing of the Sargasso sea Exosome-mediated transfer of mRNAs and microRNAs is a novel mechanism of genetic exchange between cells Molecular portraits of human breast tumours Identiﬁcation of a novel Gammaretrovirus in prostate tumors of patients homozygous for R462Q RNASEL variant Premotor cortex and the recognition of motor actions IFN-gammas mediate antiviral protection through a distinct class II cytokine receptor complex Transcriptome sequencing to detect gene fusions in cancer Microstructure of a spatial map in the entorhinal cortex Origins and functional impact of copy number variation in the human genome IL-28, IL-29 and their class II cytokine receptor IL-28R Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals Mapping copy number variation by population-scale genome sequencing. Global variation in copy number in the human genome. Expanded GGGGCC hexanucleotide repeat in noncoding region of C9ORF72 causes chromosome 9p-linked FTD and ALS Isolation of an autotrophic ammonia-oxidizing marine archaeon Ubiquitinated TDP-43 in frontotemporal lobar degeneration and amyotrophic lateral sclerosis

ext

vio

ext ext

ext

inn

ext inn ext inn

ext inn

ext inn

ext vio ext ext ext

ext ext inn

vio

inn

H. Small et al. / Journal of Informetrics 11 (2017) 46–62 55

31

19.4

1984

56

31

1.1

2001

57

30

13.6

2005

58

30

12.6

1971

59

30

18.1

1965

60

30

21.1

1983

61

30

7.2

1997

62

30

5.3

2004

63

30

19.7

2007

64

30

4.6

65

29

66

H. pylori and peptic ulcer disease human genome sequence mutation in myeloproliferative disorders hippocampal place cells

Marshall, BJ Lander, ES James, C

O’Keefe, J

bone morphogenetic proteins HIV causes AIDS

Urist, MR

Polymeropoulos, MH

2004

mutation in Parkinson’s disease copy number variations polyomavirus respiratory virus EGFR mutations

29.6

1973

dendritic cells

Steinman, RM

29

12.8

1990

Issemann, I

67

29

4.6

2007

68

29

14.6

2004

69

28

10.8

2001

70

28

3.2

2002

71

28

36.4

1982

Peroxisomal proliferatingactivated receptors TCF7L2 gene and type-2 diabetes autoantibodies in neuromyelitis optica Human metapneumovirus respiratory virus mutation in melanoma ribozymes and self-splicing RNA

72

28

12.3

2004

73

27

10.8

2004

74

27

14.3

1998

75

27

2.6

2008

76

27

6.8

2011

77

27

5.7

2012

78

27

11.1

1980

79

26

7.7

2006

80

26

24.5

1983

81

26

19.8

2005

82

26

41.9

1950

Barre-Sinoussi, F

Iafrate, AJ Gaynor, AM Paez, JG

Saxena, R Lennon, VA

van den Hoogen, BG

Davies, H Kruger, K

giant DNA mimivirus virus-encoded miRNAs in Epstein-Barr Virus orexins in sleep/wake state miRNAs as biomarkers mutation in dementia and ALS

Raoult, D

irisin and protection from metabolic disease nitric oxide as a relaxing factor

Bostrom, P

TCF7L2 polymorphism and type-2 diabetes ribozymes and catalysis HTLV-3/HTLV-4 retroviruses in Africa jumping genes

Grant, SF

Pfeffer, S

de Lecea, L Mitchell, PS Renton, AE

Furchgott, RF

Guerrier-Takada, C Wolfe, ND

McClintock, B

Unidentiﬁed curved bacilli in the stomach of patients with gastritis and peptic ulceration Initial sequencing and analysis of the human genome A unique clonal JAK2 mutation leading to constitutive signalling causes polycythaemia vera The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely-moving rat Bone: formation by autoinduction

Isolation of a T-lymphotropic retrovirus from a patient at risk for acquired immune deﬁciency syndrome (AIDS) Mutation in the alpha-synuclein gene identiﬁed in families with Parkinson’s disease Detection of large-scale variation in the human genome Identiﬁcation of a novel polyoma virus from patients with acute respiratory tract infections EGFR mutations in lung, cancer: correlation with clinical response to geﬁtinib therapy Identiﬁcation of a novel cell type in peripheral lymphoid organs of mice. I. Morphology, quantitation, tissue distribution. Activation of a member of the steroid hormone receptor superfamily by peroxisome proliferators

59 vio ext inn

inn

vio

inn

inn inn ext inn inn

inn

Genome-wide association analysis identiﬁes loci for type 2 diabetes and triglyceride levels A serum autoantibody marker of neuromyelitis optica: Distinction from multiple sclerosis

ext

A newly discovered human pneumovirus isolated from young children with respiratory tract disease. Mutations of the BRAF gene in human cancer

ext

Self-splicing RNA: autoexcision and autocyclization of the ribosomal RNA intervening sequence of Tetrahymena The 1.2-megabase genome sequence of Mimivirus. Identiﬁcation of virus-encoded microRNAs

The hypocretins: hypothalamus-speciﬁc peptides with neuroexcitatory activity Circulating microRNAs as stable blood-based markers for cancer detection A hexanucleotide repeat expansion in C9ORF72 is the cause of chromosome 9p21-linked ALS-FTD A PGC1-alpha-dependent myokine that drives brown-fat-like development of white fat and thermogenesis The obligatory role of endothelial cells in the relaxation of arterial smooth muscles by acetylcholine Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. The RNA moiety of ribonuclease P is the catalytic subunit of the enzyme Emergence of unique primate T-lymphotropic viruses among central African bushmeat hunters The origin and behavior of mutable loci in maize

inn

ext vio

vio ext

inn inn inn

inn

vio

ext

vio inn

inn

60

H. Small et al. / Journal of Informetrics 11 (2017) 46–62 Ito, S

83

26

9.9

2011

84

26

4.8

2007

85

25

22.7

2005

86

25

10.6

1973

synaptic plasticity in memory

Bliss, TV

87

25

14.0

1985

telomerase enzyme

Greider, CW

88

24

41.4

1984

Torres, CR

89

24

5.6

1990

90

24

17.0

1996

91

24

5.0

2004

92

23

41.1

1964

93

23

4.6

2007

94

23

23.0

2003

95

23

5.0

2005

96

23

4.9

1998

97 98

23 23

2.9 11.3

1956 2005

99

23

9.6

1991

protein-saccharide linkage on lymphocyte proteins aptamers that bind target molecules second estrogen receptor ultraconserved DNA sequences Epstein-Barr virus in Burkitt’s lymphoma cells TCF7L2 variation and type-2 diabetes Mimivirus in amoebae Interleuken 17 producing T helper cells Toll-like receptor sensors for bacterial lipopolysaccharide Warburg effect mutation in human myeloproliferative disorders odorant receptors

100

22

6.5

2001

101

22

4.4

1990

102

22

3.4

2004

103

22

4.0

2007

104

22

4.1

2007

105

22

6.3

1998

106

21

5.0

1992

107

21

5.8

2004

108

21

4.9

2005

epigenetic Tet family proteins type-2 diabetes genome wide association study coronavirus respiratory virus

adipocyte-secreted factor and insulin resistance aptamers bind to novel ligands extracellular traps (NETs) genome wide association study for type-2 diabetes CRISPR and immunity orexins regulate feeding behavior

Sladek, R

Woo, PC

Ellington, AD Kuiper, GGJM Bejerano, G Epstein, MA

Zeggini, E

La Scola, B Harrington, LE

Poltorak, A

Tet proteins can convert 5-methylcytosine to 5-formylcytosine and 5-carboxylcytosine A genome-wide association study identiﬁes novel risk loci for type 2 diabetes

inn ext

Characterization and complete genome sequence of a novel coronavirus, coronavirus HKU1, from patients with pneumonia Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path Identiﬁcation of a speciﬁc telomere terminal transferase activity in Tetrahymena extracts Topography and polypeptide distribution of terminal N-acetylglucosamine residues on the surfaces of intact lymphocytes: evidence for O-linked GlcNAc In vitro selection of RNA molecules that bind speciﬁc ligands Cloning of a novel estrogen receptor expressed in rat prostate and ovary Ultraconserved elements in the human genome Virus particles in cultured lymphoblasts from Burkitt’s lymphoma.

ext

Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes A giant virus in amoebae

ext

Interleukin 17-producing CD4+ effector T cells develop via a lineage distinct from the T helper type 1 and 2 lineages Defective LPS signaling in C3H/HeJ and C57BL/10ScCr mice: mutations in Tlr gene.

inn

inn vio

inn inn ext inn

ext vio

inn

Warburg, O Baxter, EJ

On the origin of cancer cells Acquired mutation of the tyrosine kinase JAK2 in human myeloproliferative disorders

inn inn

Buck, L

A novel multigene family may encode odorant receptors: a molecular basis for odor recognition The hormone resistin links obesity to diabetes

inn

Steppan, CM

Tuerk, C

Brinkmann, V Scott, LJ

Barrangou, R Sakurai, T

neural stem cells and neuron generation epidermal growth factor and trastuzumab

Reynolds, BA

Interleuken 17 producing T helper cells

Park, H

Pao, W

Systemic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase Neutrophil extracellular traps kill bacteria A Genome-Wide Association Study of type 2 diabetes in Finns detects multiple susceptibility variants CRISPR provides acquired resistance against viruses in prokaryotes Orexins and orexin receptors: a family of hypothalamic neuropeptides and G protein-coupled receptors that regulate feeding behavior Generation of neurons and astrocytes from isolated cells of the adult mammalian central nervous system EGF receptor gene mutations are common in lung cancers from never smokers and are associated with sensitivity of tumors to geﬁtinib and erlotinib A distinct lineage of CD4 T cells regulates tissue inﬂammation by producing interleukin 17.

inn

inn

inn ext

inn inn

inn

ext

inn

H. Small et al. / Journal of Informetrics 11 (2017) 46–62

61

HIV-1 transactivating protein Tat mutation in myeloproliferative discorders PGC-1 coactivator regulating thermogenesis Cancer stem cells

Frankel, AD

Cellular uptake of the tat protein from human immunodeﬁciency virus

inn

Kralovics, R

A gain-of-function mutation of JAK2 in myeloproliferative disorders

inn

Puigserver, P

A cold-inducible coactivator of nuclear receptors linked to adaptive thermogenesis

inn

Bonnet, D

vio

mutations in Crohn’s disease tau gene and familial neurodegenerative disorder oncogene Wnt in breast cancer

Hugot, JP

Acute myeloid leukemia is organized as a hierarch that originates from a primitive hematopoietic cell Association of NOD2 leucine-rich repeat variants with susceptibility to Crohn’s disease Association of missense and 5-splice-site mutations in tau with the inherited dementia FTDP-17

109

21

20.4

1988

110

21

10.8

2005

111

21

7.2

1998

112

21

4.5

1997

113

21

6.3

2001

114

21

9.3

1998

115

20

27.8

1982

116

20

23.8

2005

ammoniaoxidizing archaea

Treusch, AH

117

20

45.5

1977

Chow, LT

118

20

16.7

1992

119

20

43.5

1977

interrupted genes in Adenovirus2 mRNA cannabinoid receptors pre-mRNA splicing

120

20

7.8

2011

121

20

34.5

1988

122

20

11.2

2011

123

20

3.4

2008

miRNAs in bodily ﬂuids for diagnosis

Chen, X

124

20

4.1

2002

Calin, GA

125

20

13.9

2005

126

20

12.3

1987

127

20

1.9

1998

128

20

5.7

2004

miRNA in chronic lymphocytic leukemia neuromyelitis optica-IgG binds to aquaporin-4 water channel nitric oxide as endotheliumderived relaxing factor embryonic stem cells in therapy copy number variation in evolution

epigenetic Tet family proteins HIV-1 transactivating protein Tat mitochondrial calcium uniporter

Hutton, M

Nusse, R

Devane, WA Berget, SM He, YF Green, M

Baughman, JM

Lennon, VA

inn inn

Many tumors induced by the mouse mammary tumor virus contain a provirus integrated in the same region of the host genome Novel genes for nitrite reductase and Amo-related proteins indicate a role of uncultivated mesophilic crenarchaeota in nitrogen cycling. An amazing sequence arrangement at the 5 ends of adenovirus 2 messenger RNA

inn

Isolation and structure of a brain constituent that binds to the cannabinoid receptor Spliced segments at the 5 terminus of adenoviruses 2 late mRNA Tet-mediated formation of 5-carboxylcytosine and its excision by TDG in mammalian DNA Autonomous functional domains of chemically synthesized human immunodeﬁciency virus tat trans-activator protein Integrative genomics identiﬁes MCU as an essential component of the mitochondrial calcium uniporter Characterization of microRNAs in serum: a novel class of biomarkers for diagnosis of cancer and other diseases Frequent deletions and down-regulation of micro-RNA genes miR15 and miR16 at 13q14 in chronic lymphocytic leukemia IgG marker of optic-spinal multiple sclerosis binds to the aquaporin-4 water channel

ext

inn

vio

inn inn inn

inn

inn

inn

inn

Palmer, RM

Nitric oxide release accounts for the biological activity of endothelium-derived relaxing factor

ext

Thomson, JA

Embryonic stem cell lines derived from human blastocysts Large-Scale Copy Number Polymorphisms in the Human Genome.

inn

Sebat, J

inn

References Agirre, E., & Eyras, E. (2011). Databases and resources for human small non-coding RNAs. Human Genomics, 5(3), 192–199. http://dx.doi.org/10.1186/1479-7364-5-3-192 Bai, H., Wang, R., Hargis, R., Lu, H., & Li, Y. (2012). A SPR aptasensor for detection of avian inﬂuenza virus H5N1. Sensors, 12(9), 12506–12518. http://dx.doi.org/10.3390/s120912506 Bennett, M., & Stroncek, D. F. (2006). Recent advances in the bcr-abl negative chronic myeloproliferative diseases. Journal of Translational Medicine, 4(41) http://dx.doi.org/10.1186/1479-5876-4-41 Boyack, K. W., Small, H., & Klavans, R. (2013). Improving the accuracy of co-citation clustering using full text. Journal of the American Society for Information Science and Technology, 64(9), 1759–1767. http://dx.doi.org/10.1002/asi.22896 Carcereny, E., Moran, T., Capdevila, L., Cros, S., Vila, L., Gil, M., et al. (2015). The epidermal growth factor receptor (EGRF) in lung cancer. Translational Respiratory Medicine, 3(1), 1–8. http://dx.doi.org/10.1186/s40247-015-0013-z Chen, C., Hu, Z., Liu, S., & Tseng, H. (2012). Emerging trends in regenerative medicine: A scientometric analysis in CiteSpace. Expert Opinion on Biological Therapy, 12(5), 593–608. http://dx.doi.org/10.1517/14712598.2012.674507

62

H. Small et al. / Journal of Informetrics 11 (2017) 46–62

Demarest, B., & Sugimoto, C. R. (2015). Argue, observe, assess: Measuring disciplinary identities and difference through socio-epistemic discourse. Journal of the Association for Information Science and Technology, 66(7), 1374–1387. http://dx.doi.org/10.1002/asi.23271 Dubos, R. (1976). The professor, the institute and DNA: Oswald T. Avery, his life and scientiﬁc achievements. New York: The Rockefeller University Press. Foster, J. G., Rzhetsky, A., & Evans, J. (2015). Tradition and innovation in scientists’ research strategies. American Sociological Review, 80(5), 875–908. http://dx.doi.org/10.1177/0003122415601618 Garﬁeld, E. (1979). The 1976 articles most cited in 1976 and 1977. 1. Life sciences. Current Contents, 13, (March 26, 1979). [[Reprinted in Garﬁeld, E. (1983). Essays of an information scientist. 1979–1980, 4 (pp. 81–99). Philadelphia, PA : ISI Press.]. Gribbin, J. (2002). The scientists: A history of science told through the lives of its greatest inventors. New York: Random House. Holton, G. (1973). Thematic origins of scientiﬁc thought: Kepler to Einstein. Cambridge, Mass: Harvard University Press. Hyland, K. (2004). Disciplinary discourses: Social interactions in academic writing. Ann Arbor: The University of Michigan Press. Koestler, A. (1964). The act of creation. London: Hutchinson. Kuhn, T. S. (1962). The structure of scientiﬁc revolutions. Chicago: University of Chicago Press. Kuhn, T. S. (1977). Objectivity, value judgment and theory choice (pp. 320–339). In The essential tension. Chicago: University of Chicago Press. Lakatos, I. (1970). Falsiﬁcation and the methodology of scientiﬁc research programmes. In I. Lakatos, & A. Musgrave (Eds.), Criticism and the growth of knowledge (pp. 91–195). Cambridge: Cambridge University Press. Langley, P., Simon, H. A., Bradshaw, G. L., & Zytkow, J. M. (1987). Scientiﬁc discovery: Computational exploration of the creative processes. Cambridge, Mass: MIT Press. Liu, S. B., Chen, C. M., Ding, K., Wang, B., Xu, K., & Lin, Y. (2014). Literature retrieval based on citation contexts. Scientometrics, 101(2), 1293–1307. http://dx.doi.org/10.1007/s11192 Losee, J. (1972). A historical introduction to the philosophy of science. London: Oxford University of Press. Merton, R. K. (1957). Priorities in scientiﬁc discovery. American Sociological Review, 22(6), 635–659. Merton, R. K. (1963). Resistance to the systematic study of multiple discoveries in science. European Journal of Sociology, 4, 250–282. Moravcsik, M. J., & Murugesan, P. (1975). Some results on the function and quality of citations. Social Studies of Science, 5, 88–91. Mukherjee, S. (2016). The gene. New York: Simon and Schuster. Nakov, P., Schwartz, A., & Hearst, M. (2004). Citances: Citation sentences for semantic analysis of bioscience text. SIGIR workshop of search and discovery on bioinformatics. Olby, R. (1974). The path to the double helix. Seattle: University of Washington Press. Paquot, M., & Bestgen, Y. (2009). Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction. Corpora: Pragmatics and Discourse, 68, 247–269. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. Popper, K. R. (1959). The logic of scientiﬁc discovery. London: Hutchinson. Radev, D., & Abu-Jbara, A. (2012). Rediscovering ACL discoveries through the lens of ACL anthology network citing sentences. In Proceedings of the ACL-2012 special workshop on rediscovering 50 years of discoveries. Reichenbach, H. (1949). The philosophical signiﬁcance of the theory of relativity. In P. A. Schilpp (Ed.), Albert Einstein: Philosopher scientist. Evanston: The Library of Living Philosophers. Scott, M. (2004). WordSmith tools version 4. Oxford: Oxford University Press. Small, H., & Klavans, R. (2011). Identifying scientiﬁc breakthroughs by combining co-citation analysis and citation context. In Proceedings of the 13th international conference of the international society for scientometrics and informetrics. Small, H. (1978). Cited documents as concept symbols. Social Studies of Science, 8, 327–340. Small, H. (1982). Citation context analysis. In B. Dervin, & M. J. Voigt (Eds.), Progress in communication sciences (3) (pp. 287–310). Norwood N.J: Ablex Publishing Corp. Small, H. (2016). Referencing as cooperation or competition. In C. R. Sugimoto (Ed.), Theories of informetrics and scholarly communication. Berlin: De Gruyter Mouton. Stent, G. (1972). Prematurity and uniqueness in scientiﬁc discovery. Scientiﬁc American, 227(6), 84–93. Swanson, D. R. (1986). Undiscovered public knowledge. Library Quarterly, 56(2), 103–118. Teufel, S. (2010). The structure of scientiﬁc articles: Application to citation indexing and summarization. Stanford: CSLI Publications. Watson, J. D. (1968). The double helix: A personal account of the discovery of the structure of DNA. New York: Atheneum. Ye, L., Swingen, C., & Zhang, J. (2013). Induced pluripotent stem cells and their potentialfor basic and clinical sciences. Current Cardiology Reviews, 9(1), 63–72. http://dx.doi.org/10.2174/157340313805076278 Ziman, J. M. (1968). Public knowledge: An essay concerning the social dimension of science. Cambridge: Cambridge University Press.

Discovering discoveries: Identifying biomedical discoveries using citation contexts

Discovering discoveries: Identifying biomedical discoveries using citation contexts

Recommend Documents