Journal of Informetrics 11 (2017) 46–62
Contents lists available at ScienceDirect
Journal of Informetrics journal homepage: www.elsevier.com/locate/joi
Regular article
Discovering discoveries: Identifying biomedical discoveries using citation contexts Henry Small a,∗ , Hung Tseng b,1 , Mike Patek c a
SciTech Strategies, Inc., 105 Rolling Road, Bala Cynwyd, PA 19004, USA National Institute of Arthritis and Musculoskeletal and Skin Diseases, National Institutes of Health, 6701 Democracy Boulevard, Bethesda, MD 20892, USA c SciTech Strategies, Inc., 58 Russell Street, Keene, NH 03431, USA b
a r t i c l e
i n f o
Article history: Received 25 August 2016 Received in revised form 12 October 2016 Accepted 6 November 2016 Keywords: Discovery Biomedicine Citation contexts Citances Machine learning Pubmed central
a b s t r a c t A procedure for identifying discoveries in the biomedical sciences is described that makes use of citation context information, or more precisely citing sentences, drawn from the PubMed Central database. The procedure focuses on use of specific terms in the citing sentences and the joint appearance of cited references. After a manual screening process to remove non-discoveries, a list of over 100 discoveries and their associated articles is compiled and characterized by subject matter and by type of discovery. The phenomenon of multiple discovery is shown to play an important role. The onset and timing of recognition of the articles are studied by comparing the number of citing sentences with and without discovery terms, and show both early onset and delays in recognition. A comparative analysis of the vocabularies of the discovery and non-discovery sentences reveals the types of words and concepts that scientists associate with discoveries. A machine learning application is used to efficiently extend the list. Implications of the findings for understanding the nature and justification of scientific discoveries are discussed. © 2016 Elsevier Ltd. All rights reserved.
1. Introduction Discoveries are the sine qua non of science. They are how scientists learn new things about the world, make sense of reality, and advance the boundaries of the known into the realm of the unknown. They are also what scientists strive to make, what advances their careers and status, and what they fight over for priority (Merton, 1957). Discoveries can be solutions to known problems or solutions to problems that only later become manifest. But not every discovery pans out. Like polywater, cold fusion, or N-rays, some fail to gain acceptance and fall by the wayside. However, some are so compelling that they command immediate assent and even astonishment, like Archimedes jumping out of the bath tub exclaiming “Eureka!” (Koestler, 1964, p. 106) or Jim Watson saying that the double helix model of DNA is too beautiful not to be true (1968, p. 205). But what are the hallmarks of scientific discovery and how do we know when a discovery is made? Philosophers of science have drawn the distinction between the context of discovery and the context of justification (Losee, 1972, p. 115; Reichenbach, 1949). The context of discovery, or nascent moment as Holton (1973, p. 17) calls it, can be governed by chance, erroneous information, and even dreams. The context of justification, on the other hand, is where
∗ Corresponding author. E-mail addresses:
[email protected] (H. Small),
[email protected] (H. Tseng),
[email protected] (M. Patek). 1 Hung Tseng’s views expressed here are personal and do not represent those of the NIAMS/NIH. http://dx.doi.org/10.1016/j.joi.2016.11.001 1751-1577/© 2016 Elsevier Ltd. All rights reserved.
H. Small et al. / Journal of Informetrics 11 (2017) 46–62
47
cooler heads must evaluate cold facts. Popper (1959, p.31) argued that philosophy can only deal with the latter period where hypotheses are put to the most stringent tests. The crucial point is that discovery is more than just an insight, inspiration, or lucky guess. It must also pass some initial threshold of justification and survive a process of ongoing challenges. This second stage may involve corroboration, confirmation by others, and demonstrating consistency with existing experiment and theory (Stent, 1972). In this paper we will be primarily concerned with the context of justification, not the “aha” moment of initial inspiration, and with the process by which the scientific community comes to label a finding as a “discovery”, although in the end we will call the separation of these contexts into question. Kuhn (1962, p. 52) said that discovery is not possible without a paradigm which sets our expectations. When an expectation is violated, a problem is born which we can then attempt to solve. The solution to the anomaly may require a revolution or revamping of our understanding, which then opens up new questions. Problems that arise within the context of a paradigm are called puzzles. When DNA became recognized as critical to inheritance (Dubos, 1976), the natural question arose “What is the molecular structure of DNA and how does it enable inheritance?” When questions crystallize within a community, a competition among scientists can ensue to find a solution. Of course, the recognition of an unsolved problem or open question requires a deep understanding of the current state of knowledge. Scientists may even lack the framework to ask a question such as “How does gravity affect time?”, a question which would be unlikely to come up without relativity theory. Thus, earlier discoveries can set the stage for later problems and discoveries. As Olby (1974, p. 426) said about Watson and Crick’s double helix structure of DNA, it was not just that it fit with the known facts about DNA but that it opened up new questions and set the framework for future work. This has been variously described as the fruitfulness of a theory (Kuhn, 1977, p. 322). Others have attempted to model the discovery process in computer programs, conceiving all problems as puzzles whose solutions could be found by some kind of heuristic search (Langley, Simon, Bradshaw, & Zytkow, 1987). Another research tradition sees discovery as the finding of novel combinations. In 1964 Arthur Koestler introduced his ideas on “bisociation” − the joining of two frames of reference to arrive at a novel synthesis. Swanson’s work (1986) is in a similar vein, involving the connecting of previously unconnected areas of biomedical knowledge − or more accurately, indirectly connected areas − to gain new knowledge. More recently Foster, Rzhetsky and Evans (2015) explore scientists’ problem choices and show that risky choices that pay off result in greater recognition than conservative choices that remain within the paradigm. They operationalize this on a network of chemical entities that have been connected in article abstracts and look for novel combinations, the more risky and unlikely the combination, the greater the surprise and the reward. Perhaps all discoveries involve a degree of surprise, and the source of this may be the unexpected convergence between the conjecture and the evidence, or, as Ziman (1968, p. 48) describes it, as the falsification of a preconceived or vague notion. However, because all new knowledge is tentative and subject to revision, it can take a period of time and contributions by many researchers until the initial conjecture comes to be regarded as a “discovery” by the community. Hence all discoveries are retrospective designations even though some lags may be very short and others quite long. Despite efforts to construct a theory, the ability to systematically identify discoveries has remained elusive. No comprehensive inventory has been created. The usual approach is to rely on the pronouncements and press releases of scientists themselves or their interpretation by science writers. It would appear at first glance that citation analysis, where we can observe the impact of scientific articles over time, is an ideal tool for identification, and indeed simple citation counts do identify many scientific discoveries (Garfield, 1979). However, there are numerous reasons for citation, and highly cited lists tend to be dominated by methods, reviews and data compilations.2 Thus, simple citation counts do not provide enough information for a definitive identification. In this paper we propose a method that augments citation counting with the language used by citing authors, namely words that explicitly label referenced items as discoveries. This method, together with machine learning to omit false positives, greatly improves our ability to automate discovery identification. Once an accurate list is in hand we can begin to work backwards to find common characteristics that can shed light on the nature of discovery. 2. Data and methods Fortunately, we are now gaining access to an expanding corpus of machine readable scientific articles in full text and we can use this resource to study the contexts in which articles are cited, so called citation context analysis. An important source of curated full text for the analysis of scientific papers is PubMed Central® (PMC). This open repository was created in 2000 and includes papers that were required to be publically available under the National Institutes of Health public access policy and legislative mandates. We limited our study of biomedical discoveries to the full text from PubMed Central called the “open access subset”. This subset includes 1.1 million full texts of primarily biomedical articles covering publications mainly in the most recent period but also some coverage extending back several decades. The oldest article found in the subset was from 1896, but 90% of articles are from the last 13 years (counting through mid-2015). Over the time period, the coverage rapidly expanded from 4500 articles in 2000 to about 200,000 in 2014.
2
Of the 100 most cited articles in Pubmed Central only about four are discoveries and the remaining 96 either methods, reviews or data compilations.
48
H. Small et al. / Journal of Informetrics 11 (2017) 46–62
In addition, the references cited by these source papers have been captured, and PMC processing adds codes to the references that allow the user to connect the reference within the text to the bibliographic information at the end of the article, as well as, in most cases, providing a unique article identifier for each reference, the “Pub Med ID”, which can be used to find the item in the National Library of Medicine’s PubMed data base. The references, of course, span a much wider time period than do the source articles from which they are taken, going back to 1980 and earlier. However, about 56% of references come from the last 13 years. The PMC “open access subset” obtained from NLM was downloaded and formatted for loading into a MySQL data base. The subset included data up through mid-2015. In addition to the bibliographic information for each article, all reference lists from the articles were parsed, as well as all sentences from each of the full texts, including sentences where references appeared. Roughly 38 million references from the articles were loaded into the MySQL data base, 34 references per article, and 166 million sentences or 149 sentences per article. About 19% of these sentences contain one or more References Our objective was to compile a list of biomedical discoveries and the cited articles associated with those discoveries. To ascertain whether a cited article was a discovery, the citing sentences were searched for the string “*discover*” (where * denotes a wild card), which includes terms such as “discovery”, “discover”, and “discovered”. We will call these “discovery words”. A citing sentence containing “discovery words” will be called a “discovery citance”. A citance is defined as a single sentence from full text that contains one or more references (Nakov, Schwartz & Hearst, 2004), in contrast to the more general term, citation context, which may refer to more extended text around a given reference (Small & Klavans, 2011). Earlier studies of citation contexts were focused on categorizing the attitudes or sentiments of citing authors toward cited works (Moravcsik & Murugesan, 1975), while others were concerned with characterizing their types and content (Small, 1982). More recently Radev and Abu-Jbara (2012) have outlined a variety of applications for citation contexts including the characterization of discoveries in scientific fields. Our study builds on earlier research on citation contexts that viewed cited documents as shared symbols for specific methods or theories (Small, 1978). Hence in this study we seek to identify articles that have come to symbolize discoveries for a number of citing authors and thus reflect a shared definition of what constitutes a discovery. These earlier studies suffered from the lack of context data in electronic form, and precluded performing the kind of comprehensive analysis reported here. With the growing availability of full text databases, however, the information and computer sciences communities are now making increasing use of citation context and citance data (Teufel, 2010), and the PMC database in particular is being used as a test bed for various analyses, for example, using contexts to enhance information retrieval (Liu et al., 2014). We use citances rather than longer text excerpts so that the association of the discovery words and the references cited will be as close as possible. This strategy can still fail to find associations between the discovery words and a reference if they occur in separate sentences, or, conversely, create a false association if the reference and the discovery words occurring in the same sentence are semantically unrelated. These problems can be compensated for to some extent by requiring that the association of discovery words and specific references occur in multiple citances. For this analysis the minimum number of discovery citances was set to 20. Later on we will dip below the threshold of 20 in a machine learning experiment. This second set will consist of 100 additional articles going down to a threshold of 16 discovery citances. About 0.4% of the sentences having references contained “discovery words”, compared to 0.2% of sentences without references, suggesting that discoveries are more often than not associated with specific articles. While extending the search to sentences not containing references would perhaps have allowed us to find discoveries not associated with specific articles, this task was set aside for future study. The search for discovery citances in the PMC subset yielded around 126,000 citing sentences associated with one or more cited references. Because an individual citing sentence can contain more than one reference, the number of citancereference pairs expands to about 238,000 records, that is, nearly two references per citance. Roughly 50,000 (21%) of these references were not given a PubMed identifier in the PMC subset, and were dropped from the analysis. This left 188,000 citance-reference pairs for which full article information was readily available. By summarizing these data on the cited article and counting the number of discovery citances for each article, a list of roughly 126,000 articles was generated and ranked from the most to the least number of citances. Applying the cutoff of 20 citances per article, a list of 293 potential discovery articles was obtained. In this paper we will describe a series of exploratory analyses focused on identifying and characterizing discoveries with the help of citance data. First we will describe how discoveries are manually selected from the list of 293 articles. Then we will discuss the list of selected discoveries and their topic scope. We will show the role that multiple discovery plays in the list and how this is revealed by co-citation within citances. This is followed by an attempt to classify the discoveries by type. The timing of discovery recognition is addressed in the next section, followed by a word analysis of the citances to reveal associated concepts. Finally, a machine learning experiment is carried out to see if the manual process of discovery identification can be automated.
H. Small et al. / Journal of Informetrics 11 (2017) 46–62
49
3. Results 3.1. Manual screening Of course this relatively crude search for discovery terms does not guarantee that the retrieved sentence discusses a scientific discovery, and a second step is needed to screen out cases where the discovery words were being used in some other way. The screening process was done manually by retrieving and scanning a sample of citances for each cited article since at least 20 sentences were available per article. Only a small number of citances had to be scanned for each article because of the uniform or repetitive nature of the language used by citing authors (Small, 1978). If one or more citances explicitly identified the referenced article as a discovery it was coded as such. However, surprisingly only about 135 of 293 (46%) articles meeting the threshold of 20, could be classified as scientific discoveries. The majority of cases fell under the broad rubric of methods or tools designed to assist in the discovery process. Prominent among these were databases and computational algorithms for gene or drug discovery whose citances included discovery words but did not indicate that a discovery was actually made. For example, the article with the largest number of discovery citances concerned a tool called “Database for Annotation, Visualization and Integrated Discovery”. The second highest ranked article was a method for calculating False Discovery Rate statistics. The third ranked article, however, was the type of item we were seeking, namely the first identification of a micro-RNA in 1993. Micro-RNAs soon became one of the dominant themes in the list. However, not until the ninth ranked article did we find another actual discovery, namely RNA-interference from 1998. The authors of this article were awarded the Nobel Prize in 2006. While fewer than one-half of the articles turned out to be scientific discoveries, the importance of those discoveries for biomedicine provided some reassurance. Eleven of the 135 (8%) discoveries identified turned out to have been awarded the Nobel Prize. A potential ambiguity in the designation of scientific discovery versus discovery method comes about when application of a method resulted in an actual discovery, for example, when the method of the genome-wide association study (GWAS) was successfully applied to discover several genetic loci for type-2 diabetes. In such cases where methods or tools led to specific discoveries, the article was coded as a discovery. Fortunately only a few cases presented this difficulty. Another surprise was finding two physical science discoveries, specifically graphene and high temperature superconductivity. The PMC database has no stated subject matter scope, only the stipulation that the work was supported by Federal funding and hence physical science papers can be included, although there is some indication that graphene is playing a role in some biomedical applications such as biosensors. It is, however, evident from the list of discoveries (see the Appendix) that the predominant focus of the PMC is molecular genomics and especially the role of DNA and RNA in disease. 3.2. The discovery list A listing of the top 128 biomedical discoveries and their associated articles is ranked by the number of “discovery citances” (see Appendix).3 The article publication years range in age from 1950 to 2012. In addition to eliminating the methods and tools articles, the list also excludes the two physical science articles noted above, four review articles which review discoveries but do not announce them, and one data base description. For each discovery the Appendix shows the number of “discovery citances” and the percentage of discovery citances of the total citances for each article retrieved from the PMC database. For example, for the first article listed in the Appendix dealing with the first identification of a micro-RNA, the percentage of “discovery citances” is 38%. This compares to an overall average for the list of 13%. The total number of citances, or citing sentences, for an article can be thought of as a citation count which has been weighted by how often the article is cited within each citing text. The listing also includes the year the article was published, a brief description of the discovery, its primary author, and the article title (full bibliographic information is available online). The last column characterizes the type of discovery which will be described below. The first column is a unique id number which we will refer to in the discussion. A number of prominent themes emerge on inspection of the list which reflect both the special nature of the PMC database as well as focal points of modern biomedical research (Table 1). Micro-RNAs and RNA research, in general, are the most prominent themes, comprising about 13% of the discoveries. Topics include the growing identification of micro-RNAs performing regulatory functions in gene expression and development, their use as disease markers, and their role in RNA interference or gene silencing. Other RNA-related work includes the identification of large non-protein-coding RNAs, pre-messenger RNA splicing, “interrupted genes”, and extra cellular RNA or shuttle RNA. Another prominent theme is cancer genetics, including mutations, and a number of genetic defects linked with the development of cancer. Subtopics include the study of oncogenes, gene fusions and mutation accumulation in cancer. If we broaden this theme to include all diseases with an apparent genetic basis, not just cancer, this would perhaps be the most prevalent theme in the list, including Crohn’s and Parkinson’s diseases, and type-2 diabetes with five discoveries on the list. Obesity and dementia also have genetic components and are classified elsewhere. A third broad theme is the role of viruses in disease, particularly in cancers, where we find seven discoveries. Noncancer disease viruses include new respiratory viruses with six entries, work on HIV/AIDS with four entries, and two for
3
The full list of discoveries with bibliographic information is available online.
50
H. Small et al. / Journal of Informetrics 11 (2017) 46–62
Table 1 The discoveries listed in the Appendix are categorized by topic giving the number of discoveries for each topic and the id numbers of the corresponding entries in the Appendix. Topic
Discoveries
IDs (see Appendix)
Micro-RNA RNA general Cancer genetics Genetic diseases (non-cancer) Cancer viruses Disease viruses (non-cancer) Stem cells Neurobiology Regulatory molecules/receptors DNA sequencing, genes Copy number variations Immunology Microbiology Epigenetics Obesity/metabolism Other
13 5 13 7 7 12 5 12 11 11 5 5 4 4 7 7
1,2,3,5,11,13,18,20,25,73,75,123,124 24,40,49,117,119 22,26,30,41,45,57,64,70,98,107,110,112,115 61,67,79,84,93,103,113 8,28,34,36,42,81,92 10,17,31,37,60,63,69,72,85,94,109,121 4,7,23,106,127 14,16,43,46,52,54,58,76,78,86,114,126 29,35,44,48,74,88,90,96,99,105,118 19,27,33,56,71,80,82,87,91,102,122 47,50,51,62,128 68,95,104,108,125 38,39,53,116 9,12,83,120 6,15,21,66,77,100,111 32,55,59,65,89,97,101
giant DNA viruses. Stem cell research, including the newer induced pluripotent stem cells, is well represented with five discoveries (Chen, Hu, Liu, & Tseng, 2012). Neurobiology has a very strong showing with discoveries of mirror neurons, grid and place cells, synaptic plasticity, and peptides that regulate sleep. Another strong category is comprised of receptors and regulatory molecules. This includes several articles on toll-like receptors, the second estrogen receptor, cannabinoid receptors and three interleukins. A large, but somewhat diffuse, group deals with various aspects of DNA structure and sequencing including the classic discoveries of the double helix of Watson and Crick, “jumping genes” by McClintock, the initial draft of the human genome sequence, telomerase, and ribozymes. Smaller categories of discoveries deal with copy number variations in the genome, immunology, microbiology, epigenetics, specifically Tet family proteins, and aspects of obesity and metabolism, including peroxisomal proliferating-activated receptors and irisin. The immunology category includes the well-known CRISPR finding of a new prokaryotic immune mechanism which later led to gene editing. Only about seven discoveries do not fall into any of the above categories, and although there is no overarching theme that connects all of them, there is a strong emphasis on processes at the cellular and sub-cellular level. One item on this list classified in the cancer virus category (id 42) is a retracted article concerning an alleged virus associated with prostate tumors, and is hence a false discovery. 3.3. Multiple discoveries While we have listed each discovery separately in the Appendix, it would be a mistake to regard them as independent entities. In fact, citing authors see many of these as closely related and dependent on one another. Our description of subject matter groupings above can be replicated to some extent by an analysis of their co-citation patterns within individual citing sentences. There are a total of 6574 citances that cite two or more of the 128 discoveries, and 42 citances cite five or more. Such a within-context co-citation analysis has been shown to have higher accuracy than the standard within-article co-citation (Boyack, Klavans & Small, 2013). For example, when a micro-RNA discovery article is cited in a specific citing sentence, very often other micro-RNA discovery articles are cited. In the highest co-citing citances, citing five or more discoveries, the main topics are micro-RNAs, new respiratory viruses, genes for type-2 diabetes, disease-genes connections found by genome wide association studies, copy number variations, and human genome sequencing. It is also evident that a number of topics repeat. These can be independent co-discoveries if they are published in the same year, or follow-on discoveries that build on or extend the original discovery if they appear in successive years. Examples of follow-on discoveries are the creation of induced pluripotent stem cells first in mice (id 4) and then in humans the following year (id 7). Also, mirror neurons were posited in 1992 (id 14) and confirmed and posited in humans in 1996 (id 16). We can study why discovery articles are cited together by examining the co-citing sentences. For example, the most highly co-cited pair involves the creation of human induced pluripotent stem cells (ids 7 and 23). This discovery occurred nearly simultaneously by two groups, to quote: “Human iPS cells were first independently produced by Yamanaka’s and Thomson’s groups from human fibroblasts in late 2007.” (Ye, Swingen & Zhang, 2013). In another independent co-discovery, the two aptamers articles on the list (ids 89 and 101) were published in 1990 by two independent labs (Bai, Wang, Hargis, Lu, & Li, 2012). Based on co-cited articles having the same publication year and different authors, we estimate that about one-third of the articles on the list are in fact part of independent multiple discoveries, a phenomenon studied by Merton (1963). Independent discoveries involving three or more separate groups are prominent, for example, for a new class of micro-RNAs in C. elegans (ids 13, 18 and 25): “In 2001, three groups published independent reports on the discovery of a new class of small non-coding RNAs (sRNAs), which were named micro-RNAs (miRNAs).” (Agirre & Eyras, 2011) Also, three groups (ids 22, 64 and 107) found that a mutation in the epidermal growth factor receptor made non-small-cell lung cancer responsive
H. Small et al. / Journal of Informetrics 11 (2017) 46–62
51
to drug treatment (Carcereny et al., 2015). One of the biggest multiples involved five groups as indicated by this context: “In the spring of 2005, almost simultaneously, five reports appeared describing a mutation in the gene coding for JAK2... in patients with chronic myeloproliferative diseases.” (Bennett & Stroncek, 2006) Three of the relevant articles were on the list (ids 57, 98 and 110). 3.4. Characterization of discoveries Another type of question concerns the nature of these discoveries, whether paradigm breaking in Kuhn’s sense, or puzzle solving within the context of normal science. This is a difficult task since it involves knowing the state of knowledge prior to and following each discovery, and some way of gauging the novelty and unexpected character of the findings (Foster, Rzhetsky & Evans, 2015). Each article was categorized as one of three types: a violation, innovation or extension based on a reading of the discovery articles themselves. An article was labeled as a “violation” if the finding seemed to run counter to, or violate, the then prevalent view, and the authors discussed alternative views or opinions held by others. This need not break a paradigm, but it should necessitate an adjustment to it. By this criterion, the Watson and Crick double helix (id 19) was classified as a “violation” because other authors were cited who espoused a triple helix structure of DNA, among them Linus Pauling (Olby, 1974, p. 376). An article was coded as an “innovation” if the finding was unexpected, marked a major leap in understanding, provided a new direction in research, but did not violate the then accepted opinion. For example, the discovery of RNA-interference in C. elegans (id 2) was categorized as an “innovation”, because it was a new phenomenon caused by a double-stranded RNA, and not merely an extension of the 1993 discovery of the first micro-RNA in the same organism (id 1). The category of “extension” was assigned to articles that built on similar earlier discoveries or were new examples of entities that followed earlier models. For example, the finding of a second micro-RNA in C. elegans in 2000 (id 3) that also controlled worm development was an extension of the earlier 1993 finding. The three categories should not be thought of as separate boxes but rather as degrees on a scale of unexpectedness or surprise, with violations the most surprising and extensions the least. From this reading only 16 of 128 articles appear to be “violations” or counter to paradigms. For example, the article that identified the first micro-RNA in 1993 (id 1) was coded a “violation” because the function of the RNA controlling the worm’s developmental staging ran counter to the then expected function of RNA which was to code a protein, according to the “central dogma” in molecular biology (Olby, 1974, p. 432). Similarly, the discovery of gene product leptin (id 6) which physiologically controls body weight was coded a “violation” because it ran counter to the idea that obesity was psychologically determined. The discovery of circulating endothelial progenitor cells (id 32) ran counter to the idea that new blood vessels could only arise from fully differentiated endothelial cells derived from preexisting blood vessels. The discovery of grid cells (id 46) in a particular brain region that allowed animals to navigate in space ran counter to the accepted notion that this function resided in the hippocampus. The discovery of Archaea (id 53) that can live in cold water by aerobically oxidizing ammonia to nitrite violated the idea that Archaea were extremophiles only living in extreme ocean environments such as hydrothermal vents. These and other instances of “violations” have in common some finding that breaks an existing dogma or challenges a point of view and provides an alternative path. We might hypothesize that articles deemed “violations” are more discovery-like by virtue of their more revolutionary effects on their fields, and that “innovations” should be more discovery-like than “extension”. This is borne out if we look at the percentage of citances for articles in each category that contain “discovery words” of the total number of citances received: 17.4% for violations, 8.2% for innovations and 7.6% for extensions. Furthermore, the log likelihoods of “discovery words” were highly significant when “violation” citances were compared with citances of the other types. The word “first” also occurred at a statistically significant rate in citances for violation articles, and is used to indicate priority for a finding, such as “first identified”, “first reported” or “first discovered”. We will have more to say about word usage in citances in Section 3.6. 3.5. Timing of recognition The question of when articles begin to be called “discoveries” by citing authors is also of interest, as well as the age of articles when that labeling is at its peak. It is evident that many articles in our list are quite old. The oldest article is from 1950 (McClintock’s “jumping genes”, id 81), and the average publication year is about 1998. This compares to an average age of 2006 for articles in the original list of 293 which were omitted because they were tools or methods. Recognition of discovery appears to be a process that extends over a period of years. We could hypothesize that the labeling of a finding as a “discovery” is made only in retrospect, in light of subsequent findings and theories much as Lakatos (1970, p. 158) argued that so-called “crucial experiments” are recognized only in retrospect. Some discoveries appear to be strengthened by subsequent discoveries of related entities, as exemplified by micro-RNAs, or multiple novel respiratory viruses. A case in point is induced pluripotent stem cells published in 2006 (id 4). This paper was cited very soon after publication and the paper had accumulated 67 citances in the PMC database by 2008. However, none of these were “discovery citances”. In 2009, three years after publication, the paper began being cited as a discovery having five out of 92 citances (5%) labeling it as such. The year with peak “discovery citances” was 2013 when 7.3% of citances contained discovery words. Similarly RNA-interference published in 1998 (id 2) accumulated 28 citances by 2002, but only began being referred to as a discovery in 2003. The peak year for discovery citances was 2011 at 30%, nine years after publication.
52
H. Small et al. / Journal of Informetrics 11 (2017) 46–62
Percentage discovery citances by age range
percentage discovery citances
25.00%
20.00%
15.00%
10.00%
5.00%
0.00%
1-5
6 - 10
11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40
arcle age range at me of citaon in years Fig. 1. Percentage of discovery citances is the number of citances containing discovery words divided by the total number of citances in the PMC database. Age is determined by subtracting the publication year of the article from the year of publication of the citance and grouping the data by five year periods. Only articles listed in the Appendix are included.
An interesting case is the XMRV article (id 42) published in 2006 and retracted in 2012. This paper accumulated 26 citances in the first four years after publication. In 2009, 5 of 14 citances (36%) labeled it as a discovery, and in 2010, the peak citation year, 13 of 109 (12%). Starting in 2011 the decline in citation began as questions about contamination and lack of replication were put forward. In the three years after the retraction by the journal in 2012, 4 of 32 (12%) citances found in PMC still called it a discovery. We can conclude that some discoveries show a time lag in recognition and even papers that are discredited continue to be labeled as discoveries, suggesting considerable inertia in the system. An overview of the time evolution of discovery recognition can be obtained by calculating the percentage of discovery citances for articles of different ages. Age is computed by subtracting the year of publication of the cited article from the year of publication of the citing sentence. Summing across all discovery papers, we find that the percentage of discovery citances increases from about 5% in the first five years after publication to a peak of 22% in the five-year period 26–30 years after publication (Fig. 1). In the ten-year period thereafter, the rate falls to about 18%. It is of interest to examine the discoveries that are recognized early on, within the first five year period. The most prominent among these are discoveries on epigenetic Tet family proteins (ids 9 and 12), polyomavirus (id 8), the alleged new virus found in prostate tumors and later retracted (id 42), and the gene for Miller’s syndrome found by exome sequencing (id 27). Because the PMC coverage is biased toward the current period, to get a more accurate picture of the onset of discovery recognition, we excluded all discoveries published prior to 1997 and recomputed the age distribution of percentage discovery citances for 18 years on a year by year basis (Fig. 2). Age zero is the year of publication of the article, and the 4.5% discovery citance rate shows that recognition can begin very soon after publication with increasing intensity into the second decade after publication. Examining the articles gaining discovery recognition at age zero, that is, in the year they were published, we find again the Miller’s syndrome gene (id 27), irisin and brown fat development (id 77), and gene fusions in cancer (id 45). Thus, when we look at the overall pattern across all discoveries, delay in recognition for individual cases, such as induced pluripotent stem cells and RNA interference, do not seem to be the norm. This indicates that recognition of some discoveries by at least some subset of the citing community is immediate, and hence the initial evidence presented by the discoverer is sufficiently compelling to win converts even before replication or confirmation by others. On the other hand only one age zero reference out of a total of 47 was found for an article classified as a “violation”, namely for brain grid cells (id 46). This rate is lower than expected based on the number of “violation” articles within the post-1996 time frame, suggesting a delay in recognition for discoveries of this type. 3.6. Word analysis Citances provide a rich source of information on how citing authors characterize and utilize earlier literature. We expect that articles describing discoveries will be cited differently from, for example, methodologies, reflecting perhaps their cognitive significance and epistemological status, and revealing associated concepts and terms. An initial approach is to analyze the individual words used when discovery articles are cited compared to the vocabulary used in non-discovery citances. For non-discovery citances we use those associated with the methods and tools articles which were not classified as discoveries.
H. Small et al. / Journal of Informetrics 11 (2017) 46–62
53
Percentage discovery citances by age: publicaon year > 1996 Percentage discovery citances
12 10 8 6 4 2 0
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18
arcle age at me of citaon Fig. 2. Percentage discovery citances for articles in Appendix published after 1996. Age is determined by subtracting the publication year of the article from the year of publication of the citance. Percentage of discovery citances is the number of citances containing discovery words divided by the total number of citances in the PMC database. Table 2 The top general scientific method words, excluding technical words and function words, are ranked by log likelihood as calculated by the Wordsmith Tools software. Words for the discovery citance set with respect to the non-discovery set are listed on the left, and non-discovery set words are listed on the right. The ratio of percentage frequencies is computed by dividing the frequency of the word by the total number of running words in the set. Discovery set: Word discovered first important mechanism recently cause shown demonstrated found reported
Non-discovery set: Ratio of% frequencies 5.0 3.0 3.0 4.3 2.7 5.5 2.2 2.5 1.8 2.0
Log likelihood 2122 1940 1088 895 848 827 816 750 548 456
Word using analysis data used performed value version algorithm project tool
Ratio of% frequencies 8.0 8.0 7.2 4.1 6.7 37.6 41.5 60.1 6.5 10.0
Log likelihood 19,257 10,450 8549 5761 4255 3593 3555 2648 2585 2417
To perform this analysis we retrieved all the citing sentences for each of the discovery and non-discovery articles, not just those containing the discovery words. This increased the number of citances associated with each article by a factor of about 14. The frequency and percentage of individual words were analyzed for the separate sets using Wordsmith Tools, a software package widely used in corpus linguistics (Scott, 2004). The software calculates the percentage of each word by dividing the single word frequencies by the total number of running words in the set. To find words that occurred prominently in one set but not the other, so-called keywords, the software computes the log likelihood statistic of one corpus relative to the other and also gives the p-value for each word. It is also interesting to compute the ratio of percentages of word frequencies. For example, the word “discovered” had a percentage of 0.1% in the discovery set and a percentage of 0.02% in the nondiscovery set for a ratio of 5.0 (see Table 2), which means that the word occurs five times more often on a relative basis in the discovery set than the non-discovery set. Similarly, the log likelihood of 2122 for “discovered” was the 30th ranked word by log likelihood in the discovery set relative to the non-discovery set. Of course, it is not surprising that the verb “discovered” appears at a higher rate in discovery citances since this set is concerned with descriptions of acts of discovery, for example, in phrases such as “X was discovered by Y”. Using this approach we can thus identify the words that occur at statistically significant rates in the discovery citance set vis-à-vis the non-discovery set as a baseline, and vice versa. Examining the highest log likelihood words in the discovery citances, using the non-discovery set as the baseline, the most prominent are the technical words “cells”, “miRNAs”, “stem”, and “pluripotent”. Of course these technical words reflect the subject matters of the discoveries, and the prominence of the word “cells” indicates that many of them deal with cell biology. By contrast, the highest ranking words in the non-discovery set, using the discovery set as the baseline, are “using”, “analysis”, “blast” and “data”. These words are predominantly general scientific words or so-called scientific method words (with the exception of “blast” which stands for “Basic Local Alignment Search Tool”, a tool used in sequence analysis) that reflect actions and activities around the application of methods. The top word “using” is consistent with the methods and tools focus, appearing in formulations such as “X was determined using Y”. Further insights into the differences between the discovery and non-discovery sets can be gained by looking at the top 10 highest ranked general scientific terms by log likelihood in each set (Table 2). General scientific terms, a subset of “English for
54
H. Small et al. / Journal of Informetrics 11 (2017) 46–62
academic purposes” (Paquot & Bestgen, 2009), are words that reflect scientific practices but are, as far as possible, neutral in terms of technical content. Of course, such neutrality is hard to achieve because even general scientific terms can be content dependent. Nevertheless, in Table 2 the cognitive aims of each set are clearly visible. On the right side of Table 2 words such as “used”, “using”, “performed”, “data”, “analysis”, “algorithm” and “tool” clearly reflect a methodological and application orientation of the non-discovery set. These words have much higher log likelihoods due to the homogeneity of this set in terms of its methodological goals. By contrast, the discovery set on the left side of Table 2 shows much higher word diversity due to the variety of topics and also a predominance of technical vocabulary. In this set the general scientific words “first” and “recently” reflect concerns with issues of priority and a focus on the latest research. These might be called “timing” words. These modifiers are often used to describe discoveries in formulations such as “the recent discovery of X” or “X was first discovered by Y”. The word “important” is a value judgment used to describe the finding itself or to underline some critical feature such as the important role of a process or substance. The word “mechanism” reveals how scientists think about the inner workings of the cell. The word “cause” reflects the concern with finding the causes of diseases. The last four words in Table 2 could be called “outcome words”, or “evidentials” (Hyland, 2004, p. 191), describing the outcomes of research and evidence for findings. The term “discovered” could itself be considered an outcome word. The words “shown”, “demonstrated”, and “found” imply that a finding has achieved a degree of certainty or corroboration, while the term “reported” is a more neutral expression of outcome. Common expressions are “it has been shown that”, “has been demonstrated to play a role”, “have been found in”, and “was first reported to be”. We can think of these words as aspects of the discovery process that give credibility to the results of research. In short, the work can be “important”, uncover a “cause”, “demonstrate” or “show” something, or describe a “mechanism”. Each of these auxiliary concepts associated with descriptions of discoveries, reinforce the certainty, importance and primacy of the new knowledge. Building on the findings of Section 3.6 on the timing of recognition, it is interesting to compare the word usage in citances made soon after the publication of a discovery versus those made during a later period. Here we compare citances made zero to 5 years after publication with those made 10 to 24 years after publication. Again, we have augmented the data to include all citances not just those containing “discovery words”. As in the comparison of discovery to non-discovery sets, we use the early citances as a baseline against which to gauge the later set and vice versa. Top general scientific terms in the late versus early corpus with significant log likelihood include “first”, “discovered”, “since”, “discovery” and “early”. Thus, “timing” words emphasizing past events are prominent, as well as “discovery” words, which show that labeling articles as discoveries gains strength with the passage of time. Top works in the early versus late corpus, in contrast, stress the urgency and engagement in active research: “recently”, “recent”, “previously”, “replicated”, “reported”, “detected” and “consistent”. The words “replicated” and “consistent” are especially interesting because they show the process of justification of discovery is underway, involving both the repetition of findings as well as showing consistency with prior knowledge. Although citances occurring zero years after publication are fewer in number than zero to five years (600 versus 21,000), they illustrate the immediate concerns of citing authors. In addition to most of the general words listed above for the zero to five year set, keywords with significant log likelihoods include “predicted”, “novel”, “confirmed”, and “validated”. This shows that early discovery recognition may be influenced by the novelty of findings and the ability to make predictions which can be confirmed or validated. 3.7. Machine learning The analysis of the word frequencies associated with the discovery and the non-discovery citances reveals that the word profiles for each type differ in systematic ways. These differences in word patterns suggest that a machine learning approach may be useful for differentiating discoveries from non-discoveries, thus facilitating our identification task. Machine learning is a widely used tool in computer and information sciences, and its use in text classification is particularly relevant (Demarest & Sugimoto, 2015). The plan was to use all citances associated with the 293 articles having 20 or more “discovery citances” as input to the training task for machine learning. This included 128 articles coded as discoveries and 165 non-discoveries. Input to machine learning training consisted of all citing sentences, including those with and without “discovery words” for the 293 articles. The citances for a particular article were aggregated to form a single “bag of words” for each article. In other words, each article was represented as a concatenation of all its citing sentences, without regard to word order. This resulted in a total of 188,566 sentences for the 293 articles or an average of 643 sentences per article. Articles in the training set were coded “1” for discovery and “0” for non-discovery. As a test set, an additional 100 articles were manually coded below the 20 cutoff but were not used for training. This set served as “unseen” data. This set of 100 included articles down to 16 “discovery citances” per article below the initial threshold of 20 and contained 37,988 total citances, considerably fewer than the training set because of its lower citation rate. The topics represented in the 100 additional articles were similar to those found in the larger training set. A total of 48 items were manually coded as discoveries.4 Notable examples are the BRCA1 and 2 genes in breast cancer, the molecular basis for motility in eukaryotic flagellum, proteins in age-related macular degeneration, reverse transcriptase, and the Ebola virus. Again many were concerned with disease causation by genes or viruses. The Scikit-learn package was used for machine learning (Pedregosa et al., 2011). This software processes each document by applying a stop-word list and converting each document to a vector of features. Each vector consists of words (coded
H. Small et al. / Journal of Informetrics 11 (2017) 46–62
55
as numbers) and weights given by tf-idf scores. The document vectors define points in a hyper-dimensional space whose axes are individual words. The objective of the training is to find an optimal hyperplane in feature space with instances of discovery documents on one side of the plane and non-discoveries on the other side. After training on the feature vectors for the 293 articles using various classifiers which defined coefficients for the hyperplane, the resulting solutions were applied to the test sample consisting of 100 additional documents which had not been used in the training. Ten different classifiers available in the Scikit-learn package were tested separately giving accuracies ranging from 0.6 to 0.94 with eight of the ten near 0.90. Best results were obtained with the ridge regression classifier which is similar to standard linear regression but uses a noise reduction factor. An inspection of the coefficients of the optimal hyperplane revealed that the words having the highest positive coefficients are strongly correlated with the words having the highest log likelihoods for the discovery set (see Section 3.6), and words with the most negative coefficients were correlated with the highest log likelihoods for the non-discoveries. For example, the word “cells” had the highest hyperplane coefficient and highest log likelihood for the discovery set, and the word “used” had the most negative coefficient and the highest log likelihood for the non-discoveries. The ridge regression classifier gave an accuracy of 94%. This means that for 94 of 100 articles in the second set of data, the machine and manual classifications were identical. There were 44 true positives, two false positives and four false negatives, for an F1 value of 94%. The two false positives, where the machine learning said it was a discovery but the manual classification said it was not, dealt with a micro-RNA discovery by deep sequencing and discovery of cryptic species in taxonomy. The four false negatives, where the manual classification said it was a discovery but the machine did not, concerned a draft sequence of the human genome, a drug discovery of tuberculosis inhibitors, discovery of type-2 diabetes SNPs using genome-wide association studies, and discovery of autism genes using exome sequencing. While the four false negatives might be considered borderline cases, machine learning may have been influenced by numerous cases in the training data where DNA sequencing and genome-wide association were not associated with discovery. On the other hand, the human classifier may have been influenced by well-known articles and biased against systematic biology. Nevertheless, we can conclude that machine learning can be of considerable assistance in discovery identification. It remains to be seen how far down the list we can go in terms of number of “discovery citances”, that is, whether articles with only a few citances can be accurately classified and what the lower limit is, although classification of another 1000 articles seems within reach. The ability to accurately classify discoveries using as few citances as possible is important for identification of very recent discoveries which have only a short time to be cited. 4. Discussion While discovery in science can be an extremely private matter taking place in the mind of the discoverer, we have taken the view that it is the community of citing authors that ultimately decides what is and is not a discovery, and this community designation extends over a number of years and is potentially subject to revision, as in the case of retractions. Of course, this means that if a “discovery” is made and is overlooked or goes unrecognized by the community, it will be missed by this method. There have been well known cases in the history of science where discoveries have been overlooked only later to be resurrected by embarrassed scientists. Examples include the discovery of the kinetic theory of gases by Waterston in the 19th century (Gribbin, 2002, p. 389), and the discovery of the units of inheritance by Gregor Mendel (Mukherjee, 2016, p. 59). In these cases, papers were either rejected, or ignored and uncited until decades later. Gunther Stent’s theory of pre-mature discovery in science nicely covers such cases, where scientific findings that do not fit with the “canonical knowledge”, or paradigm, in a given period can go unrecognized. On the other side of the coin, we could ask if spurious findings can be labeled as discoveries simply because they seem consistent with contemporary knowledge. Certainly such is the case with articles that have been labeled as discoveries only later to be retracted as errors. In a different sense, could there be cases of “bandwagon” discoveries when a particular topic becomes popular and a field is inundated by incremental additions to knowledge? These are difficult questions we will not be able to settle here. A possible deficiency in our method may be its failure to identify important methodologies such as recombinant DNA or the polymerase chain reaction. Other search strategies will be needed to systematically identify this type of innovative work perhaps by the use of other key word filters. It is also possible that discoveries are missed; or are overly represented; because of the limitations or biases inherent in the data base. It should be recalled that about one-fifth of the discovery references were left out of the analysis due to the lack of source data in the PMC. An example of a missed discovery due to thresholding is the 2016 Nobel Prize in Physiology or Medicine which was awarded to Yoshinori Ohsumi for autophagy. One article associated with this discovery from 1993 received only eight discovery citances; and thus falls below the threshold of 20 used to compile our list. Such a low citance rate is possibly indicative of a subject bias in the data base. Finally; Merton’s “obliteration by incorporation” effect may also be at work; where findings become so imbedded and second nature in the working habits of scientists that explicit mention is no longer thought necessary; leading to the pre-mature disappearance of discoveries in the collective consciousness. Regarding the timing of discovery recognition, we have seen, in some cases, there can be a delay of a few years in the onset of discovery citances. Whether the community is withholding judgment until further evidence is forthcoming or some other factor is at work, such as the Stent effect, is not known. However, looking at all discovery articles in aggregate, this delay is not apparent, and recognition is often immediate, though at a low level, with an increasing rate of recognition across the years which may only begin to decay decades later. Such nearly immediate recognition requires us to examine
56
H. Small et al. / Journal of Informetrics 11 (2017) 46–62
our assumptions about the separation of the contexts of discovery and justification as usually conceived by philosophers of science, and suggests that they cannot be so easily separated. A similar conclusion was reached by Simon and colleagues (Langley et al., 1987; p. 57) where discovery is conceived as a stepwise process where positive or negative confirmation is generated at each step. Discovery then, if it is to be quickly labeled as such by others, must be accompanied by some degree of justification, corroboration and fit with the evidence. The picture may not be complete upon publication, with many blanks to fill in and implications to be followed up, but some degree of fit must be evident in the nascent moment. Such was certainly the case with Watson and Crick’s DNA double helix where numerous models were rejected until one that satisfied inter-atomic distances, base ratios and x-ray crystallography evidence finally emerged (Watson, 1968). Our approach to discovery identification is also complicated by the phenomenon of multiple discovery, diluting the number of distinct discoveries on a given list. However, co-citation methods are an effective tool to identify multiples and offer a new approach to systematically identifying these events. Multiples, as Merton calls them, are an important phenomenon and can reveal how competition and cooperation operate in science (Small, 2016). On the question of the nature of discovery, it proved difficult to find evidence for new ideas from the linking of unconnected literatures or new combinations of existing chemical entities. Most discoveries examined here appeared to involve some kind of novelty. Combinations, to the extent they were in evidence, were of a more complex nature, involving biological processes, mechanisms, methods, diseases, viruses, genes, enzymes, and other entities. A theory of discovery as a combinatorial or bisociation process, as envisioned by Arthur Koestler in 1964, will require a more expanded definition of combining units. One way this might be investigated is to examine the references cited by discovery articles, and by comparing these with a cluster analysis of those references, seeing if different topics were drawn upon by the discoveries. 5. Conclusions We have explored the use of citation contexts, or more precisely citances, in the identification of discoveries in biomedical science using full text from the PMC database. While we have found that the appearance of “discovery words” is not a reliable indicator of whether a given citance describes a discovery (the success rate is about 46%), it is relatively easy to differentiate these manually by inspecting a sample of citances. Because the majority of false hits are methodologies and tools for making discoveries, rather than actual scientific discoveries, the vocabulary used in citances in such cases has been shown to be highly instrumental, and to differ markedly from the vocabulary used when actual discoveries are cited. Hence, machine learning based on citance vocabulary has proved to be an effective tool for classifying articles into the two sets, achieving an accuracy of 94% on unseen data. This opens the door to a much broader discovery identification task using machine learning that could possibly expand the list to 1000 discoveries using PMC data. Finally, other tropes or common themes in science are likely to be amenable to the type of analysis outlined in this paper. General scientific words associated with discoveries such as timing words (“first” and “recently”), value judgements such as “importance”, or outcome or evidential words such as “shown”, “demonstrated” and “found” could themselves be explored by retrieving sentences, manual classification, and application of machine learning to differentiate usage and automate the process. One direction for future research would be to explore whether outcome words can be correlated with the cognitive certainty or degree of corroboration of scientific findings, and the types and strength of evidence that are brought to bear. This sort of research agenda might take us closer to a literature based computational approach to the “context of justification”. Discovery in science is the critical engine that drives the acquisition of new knowledge, combining the creative efforts of individual researchers and the importance of the scientific community in validating that knowledge. A systematic framework for identifying discoveries, such as the one we describe here, can contribute to the deeper study of the psychological process of making discoveries as well as the sociological process of bestowing recognition, and open these up to greater public and policy scrutiny. Also at stake is the question of how discoveries are justified and how the community is convinced to grant recognition, unless we are to assume that this is merely a matter of rhetorical persuasion. Author contributions Henry Small Conceived and designed the analysis, Collected the data, Contributed data or analysis tools, Performed the analysis, Wrote the paper. Hung Tseng Collected the data, Performed the analysis. Mike Patek Collected the data, Contributed data or analysis tools, Performed the analysis. Acknowledgements We would like to thank Richard Klavans and Kevin Boyack for valuable discussions, and the reviewers for suggesting additional analyses.
H. Small et al. / Journal of Informetrics 11 (2017) 46–62
57
Appendix. − List of Discoveries. This Appendix lists the 128 articles having 20 or more discovery citances and classified as discoveries. The number of discovery citances is given under “Disc Citances”, and the percentage of these of the total number of citances is in the “% Disc Citances” column. The publication year of the article is under “Pub Year”, and a short description is given under “Discovery”. The first author of the article is given under “First Author”, followed by the Title in the next column. The final column, “Type” is the classification of the article as “violation”, “innovation” and “extension” abbreviated as “vio”, “inn” and “ext”. ID
Disc Citances
% Disc Citances
Pub Year
Discovery
First Author
Title
Type
1
436
38.0
1993
miRNA lin4
Lee, RC
vio
2
217
22.0
1998
RNA interference
Fire, A
3
141
22.7
2000
miRNA let7
Reinhart, BJ
4
123
5.6
2006
Takahashi, K
5
90
25.1
1993
induced pluripotent stem cells miRNA regulation of lin-14 by lin-4
6
77
13.4
1994
leptin
Zhang, Y
7
74
4.0
2007
Takahashi, K
8
73
17.1
2008
iPSC in human somatic cells polyomavirus
9
72
13.7
2009
epigenetic Tet family proteins
Tahiliani, M
10
60
20.7
2005
human bocavirus
Allander, T
11
59
7.0
2001
Elbashir, SM
12
57
18.8
2009
mammalian RNA interference epigenetic Tet family in brain
13
57
11.4
2001
14
54
30.7
15
49
16
The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14 Potent and specific genetic interference by double stranded RNA in Caenorhabditis elegans The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans Positional cloning of the mouse obese gene and its human homologue Induction of pluripotent stem cells from adult human fibroblasts by defined factors Clonal integration of a polyomavirus in human Merkel cell carcinoma Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in mammalian DNA by MLL partner TET1 Cloning of a human parvovirus by molecular screening of respiratory tract samples Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian cells The nuclear DNA base 5-hydroxymethylcytosine is present in Purkinje neurons and the brain Identification of novel genes coding for small expressed RNAs Understanding motor events: a neurophysiological study Ghrelin is a growth-hormone-releasing acylated peptide from stomach Action recognition in the premotor cortex
Wightman, B
Feng, H
Kriaucionis, S
Lagos-Quintana, M
1992
novel genes coding for miRNAs mirror neurons
di Pellegrino, G
8.7
1999
hormone ghrelin
Kojima, M
47
16.3
1996
Gallese, V
17 18
45 45
27.1 10.5
2004 2001
19
45
20.8
1953
mirror neurons confirmation human coronavirus abundance of miRNAs in C. elegans double helix
20
44
17.8
2000
evolutionary conserved miRNAs
Pasquinelli, AE
21
43
4.5
2007
FTO obesity gene
Frayling, TM
22
42
5.2
2004
Lynch, TJ
23
42
3.9
2007
24
42
14.1
2004
25
42
14.8
2001
26
42
9.9
2007
EGFR mutations and lung cancer treatment human iPSC confirmation first histone demethylase LSD1 miRNA and RNA sequencing EML4-ALK gene fusion in cancer
inn ext
inn
ext
vio ext inn inn
ext inn inn
ext inn inn ext
van der Hoek, L Lau, NC
Identification of a new human coronavirus An abundant class of tiny RNAs with probable regulatory roles in Caenorhabditis elegans
ext ext
Watson, JD
Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid Conservation of the sequence and temporal expression of let-7 heterochronic regulatory RNA A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity Activating mutations in the epidermal growth factor receptor underlying responsiveness of non-small-cell lung cancer to gefitinib Induced pluripotent stem cell lines derived from human somatic cells Histone demethylation mediated by the nuclear amine oxidase homolog LSD1 An extensive class of small RNAs in Caenorhabditis elegans Identification of the transforming EML4-ALK fusion gene in non-small-cell lung cancer
vio
Yu, J Shi, Y Lee, RC Soda, M
ext
inn
inn
inn inn ext inn
58
H. Small et al. / Journal of Informetrics 11 (2017) 46–62 human exomes sequence and Miller’s syndrome Kaposi’s sarcomaassociated herpes virus TLR and human immune response TMPRSS-ETS gene fusion and prostate cancer APOBEC3G gene inhibits HIV-1
Ng, SB
Exome sequencing identifies the cause of a mendelian disorder
inn
Chang, Y
Identification of herpesvirus-like DNA sequences in AIDS-associated Kaposi’s sarcoma
inn
Medzhitov, R
A human homologue of the Drosophila Toll protein signals activation of adaptive immunity Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer
ext
Isolation of a human gene that inhibits HIV-1 infection and is suppressed by the viral Vif protein. Isolation of putative progenitor endothelial cells for angiogenesis
inn
27
41
12.5
2010
28
40
15.3
1994
29
40
12.0
1997
30
40
8.4
2005
31
40
9.1
2002
32
38
5.7
1997
33
38
31.9
1997
34
38
21.6
1980
35
38
14.4
1996
36
37
19.0
1989
hepatitis C virus
Choo, QL
37
36
24.0
2007
Allander, T
38
36
25.2
2000
respiratory polyomavirus bacterial rhodopsin
39
36
5.1
2004
40
36
3.9
2007
41
36
2.0
2000
42
35
10.9
2006
43
35
25.0
1996
44
35
13.9
2003
45
35
21.3
2009
46
34
7.4
2005
47
33
5.0
2010
48
33
15.2
2003
49
33
5.0
2009
large non-coding RNAs
Guttman, M
50
33
8.5
2011
Mills, RE
51
32
3.2
2006
52
32
6.0
2011
copy number variations copy number variations genetic mutation in dementia
53
32
15.8
2005
54
31
5.4
2006
endothelial progenitor cells and vascular regeneration cell free fetal DNA HTLV-1/HTLV-2 human retroviruses immune function for Drosophila Toll
Tomlins, SA
Sheehy, AM
Asahara, T
Lo, YM Poiesz, BJ
Lemaitre, B
Beja , O
microbial diversity in seawater exosomal shuttle RNA
Venter, JC
breast cancer molecular subtypes new human retrovirus (retracted) mirror neurons confirmation new interleukins
Perou, CM
gene fusions in cancer grid cells in the cortex copy number variation types new interleukins
Maher, CA
ammoniaoxidizing Archaea TDP-43 in amyotrophic lateral sclerosis
Valadi, H
Urisman, A
Rizzolatti, G Kotenko, SV
Hafting, T Conrad, DF Sheppard, P
Redon, R DeJesus-Hernandez, M
Konneke, M
Neumann, M
Presence of fetal DNA in maternal plasma and serum Detection and isolation of type C retrovirus particles from fresh and cultured lymphocytes of a patient with cutaneous T-cell lymphoma. The dorsoventral regulatory gene cassette spatzle/Toll/cactus controls the potent antifungal response in Drosophila adults. Isolation of a cDNA clone derived from a blood-borne non-A, non-B viral hepatitis genome Identification of a third human polyomavirus Bacterial rhodopsin: evidence for a new type of phototrophy in the sea Environmental genome shotgun sequencing of the Sargasso sea Exosome-mediated transfer of mRNAs and microRNAs is a novel mechanism of genetic exchange between cells Molecular portraits of human breast tumours Identification of a novel Gammaretrovirus in prostate tumors of patients homozygous for R462Q RNASEL variant Premotor cortex and the recognition of motor actions IFN-gammas mediate antiviral protection through a distinct class II cytokine receptor complex Transcriptome sequencing to detect gene fusions in cancer Microstructure of a spatial map in the entorhinal cortex Origins and functional impact of copy number variation in the human genome IL-28, IL-29 and their class II cytokine receptor IL-28R Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals Mapping copy number variation by population-scale genome sequencing. Global variation in copy number in the human genome. Expanded GGGGCC hexanucleotide repeat in noncoding region of C9ORF72 causes chromosome 9p-linked FTD and ALS Isolation of an autotrophic ammonia-oxidizing marine archaeon Ubiquitinated TDP-43 in frontotemporal lobar degeneration and amyotrophic lateral sclerosis
ext
vio
ext ext
ext
inn
ext inn ext inn
ext inn
ext inn
ext vio ext ext ext
ext ext inn
vio
inn
H. Small et al. / Journal of Informetrics 11 (2017) 46–62 55
31
19.4
1984
56
31
1.1
2001
57
30
13.6
2005
58
30
12.6
1971
59
30
18.1
1965
60
30
21.1
1983
61
30
7.2
1997
62
30
5.3
2004
63
30
19.7
2007
64
30
4.6
65
29
66
H. pylori and peptic ulcer disease human genome sequence mutation in myeloproliferative disorders hippocampal place cells
Marshall, BJ Lander, ES James, C
O’Keefe, J
bone morphogenetic proteins HIV causes AIDS
Urist, MR
Polymeropoulos, MH
2004
mutation in Parkinson’s disease copy number variations polyomavirus respiratory virus EGFR mutations
29.6
1973
dendritic cells
Steinman, RM
29
12.8
1990
Issemann, I
67
29
4.6
2007
68
29
14.6
2004
69
28
10.8
2001
70
28
3.2
2002
71
28
36.4
1982
Peroxisomal proliferatingactivated receptors TCF7L2 gene and type-2 diabetes autoantibodies in neuromyelitis optica Human metapneumovirus respiratory virus mutation in melanoma ribozymes and self-splicing RNA
72
28
12.3
2004
73
27
10.8
2004
74
27
14.3
1998
75
27
2.6
2008
76
27
6.8
2011
77
27
5.7
2012
78
27
11.1
1980
79
26
7.7
2006
80
26
24.5
1983
81
26
19.8
2005
82
26
41.9
1950
Barre-Sinoussi, F
Iafrate, AJ Gaynor, AM Paez, JG
Saxena, R Lennon, VA
van den Hoogen, BG
Davies, H Kruger, K
giant DNA mimivirus virus-encoded miRNAs in Epstein-Barr Virus orexins in sleep/wake state miRNAs as biomarkers mutation in dementia and ALS
Raoult, D
irisin and protection from metabolic disease nitric oxide as a relaxing factor
Bostrom, P
TCF7L2 polymorphism and type-2 diabetes ribozymes and catalysis HTLV-3/HTLV-4 retroviruses in Africa jumping genes
Grant, SF
Pfeffer, S
de Lecea, L Mitchell, PS Renton, AE
Furchgott, RF
Guerrier-Takada, C Wolfe, ND
McClintock, B
Unidentified curved bacilli in the stomach of patients with gastritis and peptic ulceration Initial sequencing and analysis of the human genome A unique clonal JAK2 mutation leading to constitutive signalling causes polycythaemia vera The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely-moving rat Bone: formation by autoinduction
Isolation of a T-lymphotropic retrovirus from a patient at risk for acquired immune deficiency syndrome (AIDS) Mutation in the alpha-synuclein gene identified in families with Parkinson’s disease Detection of large-scale variation in the human genome Identification of a novel polyoma virus from patients with acute respiratory tract infections EGFR mutations in lung, cancer: correlation with clinical response to gefitinib therapy Identification of a novel cell type in peripheral lymphoid organs of mice. I. Morphology, quantitation, tissue distribution. Activation of a member of the steroid hormone receptor superfamily by peroxisome proliferators
59 vio ext inn
inn
vio
inn
inn inn ext inn inn
inn
Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels A serum autoantibody marker of neuromyelitis optica: Distinction from multiple sclerosis
ext
A newly discovered human pneumovirus isolated from young children with respiratory tract disease. Mutations of the BRAF gene in human cancer
ext
Self-splicing RNA: autoexcision and autocyclization of the ribosomal RNA intervening sequence of Tetrahymena The 1.2-megabase genome sequence of Mimivirus. Identification of virus-encoded microRNAs
The hypocretins: hypothalamus-specific peptides with neuroexcitatory activity Circulating microRNAs as stable blood-based markers for cancer detection A hexanucleotide repeat expansion in C9ORF72 is the cause of chromosome 9p21-linked ALS-FTD A PGC1-alpha-dependent myokine that drives brown-fat-like development of white fat and thermogenesis The obligatory role of endothelial cells in the relaxation of arterial smooth muscles by acetylcholine Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. The RNA moiety of ribonuclease P is the catalytic subunit of the enzyme Emergence of unique primate T-lymphotropic viruses among central African bushmeat hunters The origin and behavior of mutable loci in maize
inn
ext vio
vio ext
inn inn inn
inn
vio
ext
vio inn
inn
60
H. Small et al. / Journal of Informetrics 11 (2017) 46–62 Ito, S
83
26
9.9
2011
84
26
4.8
2007
85
25
22.7
2005
86
25
10.6
1973
synaptic plasticity in memory
Bliss, TV
87
25
14.0
1985
telomerase enzyme
Greider, CW
88
24
41.4
1984
Torres, CR
89
24
5.6
1990
90
24
17.0
1996
91
24
5.0
2004
92
23
41.1
1964
93
23
4.6
2007
94
23
23.0
2003
95
23
5.0
2005
96
23
4.9
1998
97 98
23 23
2.9 11.3
1956 2005
99
23
9.6
1991
protein-saccharide linkage on lymphocyte proteins aptamers that bind target molecules second estrogen receptor ultraconserved DNA sequences Epstein-Barr virus in Burkitt’s lymphoma cells TCF7L2 variation and type-2 diabetes Mimivirus in amoebae Interleuken 17 producing T helper cells Toll-like receptor sensors for bacterial lipopolysaccharide Warburg effect mutation in human myeloproliferative disorders odorant receptors
100
22
6.5
2001
101
22
4.4
1990
102
22
3.4
2004
103
22
4.0
2007
104
22
4.1
2007
105
22
6.3
1998
106
21
5.0
1992
107
21
5.8
2004
108
21
4.9
2005
epigenetic Tet family proteins type-2 diabetes genome wide association study coronavirus respiratory virus
adipocyte-secreted factor and insulin resistance aptamers bind to novel ligands extracellular traps (NETs) genome wide association study for type-2 diabetes CRISPR and immunity orexins regulate feeding behavior
Sladek, R
Woo, PC
Ellington, AD Kuiper, GGJM Bejerano, G Epstein, MA
Zeggini, E
La Scola, B Harrington, LE
Poltorak, A
Tet proteins can convert 5-methylcytosine to 5-formylcytosine and 5-carboxylcytosine A genome-wide association study identifies novel risk loci for type 2 diabetes
inn ext
Characterization and complete genome sequence of a novel coronavirus, coronavirus HKU1, from patients with pneumonia Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path Identification of a specific telomere terminal transferase activity in Tetrahymena extracts Topography and polypeptide distribution of terminal N-acetylglucosamine residues on the surfaces of intact lymphocytes: evidence for O-linked GlcNAc In vitro selection of RNA molecules that bind specific ligands Cloning of a novel estrogen receptor expressed in rat prostate and ovary Ultraconserved elements in the human genome Virus particles in cultured lymphoblasts from Burkitt’s lymphoma.
ext
Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes A giant virus in amoebae
ext
Interleukin 17-producing CD4+ effector T cells develop via a lineage distinct from the T helper type 1 and 2 lineages Defective LPS signaling in C3H/HeJ and C57BL/10ScCr mice: mutations in Tlr gene.
inn
inn vio
inn inn ext inn
ext vio
inn
Warburg, O Baxter, EJ
On the origin of cancer cells Acquired mutation of the tyrosine kinase JAK2 in human myeloproliferative disorders
inn inn
Buck, L
A novel multigene family may encode odorant receptors: a molecular basis for odor recognition The hormone resistin links obesity to diabetes
inn
Steppan, CM
Tuerk, C
Brinkmann, V Scott, LJ
Barrangou, R Sakurai, T
neural stem cells and neuron generation epidermal growth factor and trastuzumab
Reynolds, BA
Interleuken 17 producing T helper cells
Park, H
Pao, W
Systemic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase Neutrophil extracellular traps kill bacteria A Genome-Wide Association Study of type 2 diabetes in Finns detects multiple susceptibility variants CRISPR provides acquired resistance against viruses in prokaryotes Orexins and orexin receptors: a family of hypothalamic neuropeptides and G protein-coupled receptors that regulate feeding behavior Generation of neurons and astrocytes from isolated cells of the adult mammalian central nervous system EGF receptor gene mutations are common in lung cancers from never smokers and are associated with sensitivity of tumors to gefitinib and erlotinib A distinct lineage of CD4 T cells regulates tissue inflammation by producing interleukin 17.
inn
inn
inn ext
inn inn
inn
ext
inn
H. Small et al. / Journal of Informetrics 11 (2017) 46–62
61
HIV-1 transactivating protein Tat mutation in myeloproliferative discorders PGC-1 coactivator regulating thermogenesis Cancer stem cells
Frankel, AD
Cellular uptake of the tat protein from human immunodeficiency virus
inn
Kralovics, R
A gain-of-function mutation of JAK2 in myeloproliferative disorders
inn
Puigserver, P
A cold-inducible coactivator of nuclear receptors linked to adaptive thermogenesis
inn
Bonnet, D
vio
mutations in Crohn’s disease tau gene and familial neurodegenerative disorder oncogene Wnt in breast cancer
Hugot, JP
Acute myeloid leukemia is organized as a hierarch that originates from a primitive hematopoietic cell Association of NOD2 leucine-rich repeat variants with susceptibility to Crohn’s disease Association of missense and 5-splice-site mutations in tau with the inherited dementia FTDP-17
109
21
20.4
1988
110
21
10.8
2005
111
21
7.2
1998
112
21
4.5
1997
113
21
6.3
2001
114
21
9.3
1998
115
20
27.8
1982
116
20
23.8
2005
ammoniaoxidizing archaea
Treusch, AH
117
20
45.5
1977
Chow, LT
118
20
16.7
1992
119
20
43.5
1977
interrupted genes in Adenovirus2 mRNA cannabinoid receptors pre-mRNA splicing
120
20
7.8
2011
121
20
34.5
1988
122
20
11.2
2011
123
20
3.4
2008
miRNAs in bodily fluids for diagnosis
Chen, X
124
20
4.1
2002
Calin, GA
125
20
13.9
2005
126
20
12.3
1987
127
20
1.9
1998
128
20
5.7
2004
miRNA in chronic lymphocytic leukemia neuromyelitis optica-IgG binds to aquaporin-4 water channel nitric oxide as endotheliumderived relaxing factor embryonic stem cells in therapy copy number variation in evolution
epigenetic Tet family proteins HIV-1 transactivating protein Tat mitochondrial calcium uniporter
Hutton, M
Nusse, R
Devane, WA Berget, SM He, YF Green, M
Baughman, JM
Lennon, VA
inn inn
Many tumors induced by the mouse mammary tumor virus contain a provirus integrated in the same region of the host genome Novel genes for nitrite reductase and Amo-related proteins indicate a role of uncultivated mesophilic crenarchaeota in nitrogen cycling. An amazing sequence arrangement at the 5 ends of adenovirus 2 messenger RNA
inn
Isolation and structure of a brain constituent that binds to the cannabinoid receptor Spliced segments at the 5 terminus of adenoviruses 2 late mRNA Tet-mediated formation of 5-carboxylcytosine and its excision by TDG in mammalian DNA Autonomous functional domains of chemically synthesized human immunodeficiency virus tat trans-activator protein Integrative genomics identifies MCU as an essential component of the mitochondrial calcium uniporter Characterization of microRNAs in serum: a novel class of biomarkers for diagnosis of cancer and other diseases Frequent deletions and down-regulation of micro-RNA genes miR15 and miR16 at 13q14 in chronic lymphocytic leukemia IgG marker of optic-spinal multiple sclerosis binds to the aquaporin-4 water channel
ext
inn
vio
inn inn inn
inn
inn
inn
inn
Palmer, RM
Nitric oxide release accounts for the biological activity of endothelium-derived relaxing factor
ext
Thomson, JA
Embryonic stem cell lines derived from human blastocysts Large-Scale Copy Number Polymorphisms in the Human Genome.
inn
Sebat, J
inn
References Agirre, E., & Eyras, E. (2011). Databases and resources for human small non-coding RNAs. Human Genomics, 5(3), 192–199. http://dx.doi.org/10.1186/1479-7364-5-3-192 Bai, H., Wang, R., Hargis, R., Lu, H., & Li, Y. (2012). A SPR aptasensor for detection of avian influenza virus H5N1. Sensors, 12(9), 12506–12518. http://dx.doi.org/10.3390/s120912506 Bennett, M., & Stroncek, D. F. (2006). Recent advances in the bcr-abl negative chronic myeloproliferative diseases. Journal of Translational Medicine, 4(41) http://dx.doi.org/10.1186/1479-5876-4-41 Boyack, K. W., Small, H., & Klavans, R. (2013). Improving the accuracy of co-citation clustering using full text. Journal of the American Society for Information Science and Technology, 64(9), 1759–1767. http://dx.doi.org/10.1002/asi.22896 Carcereny, E., Moran, T., Capdevila, L., Cros, S., Vila, L., Gil, M., et al. (2015). The epidermal growth factor receptor (EGRF) in lung cancer. Translational Respiratory Medicine, 3(1), 1–8. http://dx.doi.org/10.1186/s40247-015-0013-z Chen, C., Hu, Z., Liu, S., & Tseng, H. (2012). Emerging trends in regenerative medicine: A scientometric analysis in CiteSpace. Expert Opinion on Biological Therapy, 12(5), 593–608. http://dx.doi.org/10.1517/14712598.2012.674507
62
H. Small et al. / Journal of Informetrics 11 (2017) 46–62
Demarest, B., & Sugimoto, C. R. (2015). Argue, observe, assess: Measuring disciplinary identities and difference through socio-epistemic discourse. Journal of the Association for Information Science and Technology, 66(7), 1374–1387. http://dx.doi.org/10.1002/asi.23271 Dubos, R. (1976). The professor, the institute and DNA: Oswald T. Avery, his life and scientific achievements. New York: The Rockefeller University Press. Foster, J. G., Rzhetsky, A., & Evans, J. (2015). Tradition and innovation in scientists’ research strategies. American Sociological Review, 80(5), 875–908. http://dx.doi.org/10.1177/0003122415601618 Garfield, E. (1979). The 1976 articles most cited in 1976 and 1977. 1. Life sciences. Current Contents, 13, (March 26, 1979). [[Reprinted in Garfield, E. (1983). Essays of an information scientist. 1979–1980, 4 (pp. 81–99). Philadelphia, PA : ISI Press.]. Gribbin, J. (2002). The scientists: A history of science told through the lives of its greatest inventors. New York: Random House. Holton, G. (1973). Thematic origins of scientific thought: Kepler to Einstein. Cambridge, Mass: Harvard University Press. Hyland, K. (2004). Disciplinary discourses: Social interactions in academic writing. Ann Arbor: The University of Michigan Press. Koestler, A. (1964). The act of creation. London: Hutchinson. Kuhn, T. S. (1962). The structure of scientific revolutions. Chicago: University of Chicago Press. Kuhn, T. S. (1977). Objectivity, value judgment and theory choice (pp. 320–339). In The essential tension. Chicago: University of Chicago Press. Lakatos, I. (1970). Falsification and the methodology of scientific research programmes. In I. Lakatos, & A. Musgrave (Eds.), Criticism and the growth of knowledge (pp. 91–195). Cambridge: Cambridge University Press. Langley, P., Simon, H. A., Bradshaw, G. L., & Zytkow, J. M. (1987). Scientific discovery: Computational exploration of the creative processes. Cambridge, Mass: MIT Press. Liu, S. B., Chen, C. M., Ding, K., Wang, B., Xu, K., & Lin, Y. (2014). Literature retrieval based on citation contexts. Scientometrics, 101(2), 1293–1307. http://dx.doi.org/10.1007/s11192 Losee, J. (1972). A historical introduction to the philosophy of science. London: Oxford University of Press. Merton, R. K. (1957). Priorities in scientific discovery. American Sociological Review, 22(6), 635–659. Merton, R. K. (1963). Resistance to the systematic study of multiple discoveries in science. European Journal of Sociology, 4, 250–282. Moravcsik, M. J., & Murugesan, P. (1975). Some results on the function and quality of citations. Social Studies of Science, 5, 88–91. Mukherjee, S. (2016). The gene. New York: Simon and Schuster. Nakov, P., Schwartz, A., & Hearst, M. (2004). Citances: Citation sentences for semantic analysis of bioscience text. SIGIR workshop of search and discovery on bioinformatics. Olby, R. (1974). The path to the double helix. Seattle: University of Washington Press. Paquot, M., & Bestgen, Y. (2009). Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction. Corpora: Pragmatics and Discourse, 68, 247–269. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. Popper, K. R. (1959). The logic of scientific discovery. London: Hutchinson. Radev, D., & Abu-Jbara, A. (2012). Rediscovering ACL discoveries through the lens of ACL anthology network citing sentences. In Proceedings of the ACL-2012 special workshop on rediscovering 50 years of discoveries. Reichenbach, H. (1949). The philosophical significance of the theory of relativity. In P. A. Schilpp (Ed.), Albert Einstein: Philosopher scientist. Evanston: The Library of Living Philosophers. Scott, M. (2004). WordSmith tools version 4. Oxford: Oxford University Press. Small, H., & Klavans, R. (2011). Identifying scientific breakthroughs by combining co-citation analysis and citation context. In Proceedings of the 13th international conference of the international society for scientometrics and informetrics. Small, H. (1978). Cited documents as concept symbols. Social Studies of Science, 8, 327–340. Small, H. (1982). Citation context analysis. In B. Dervin, & M. J. Voigt (Eds.), Progress in communication sciences (3) (pp. 287–310). Norwood N.J: Ablex Publishing Corp. Small, H. (2016). Referencing as cooperation or competition. In C. R. Sugimoto (Ed.), Theories of informetrics and scholarly communication. Berlin: De Gruyter Mouton. Stent, G. (1972). Prematurity and uniqueness in scientific discovery. Scientific American, 227(6), 84–93. Swanson, D. R. (1986). Undiscovered public knowledge. Library Quarterly, 56(2), 103–118. Teufel, S. (2010). The structure of scientific articles: Application to citation indexing and summarization. Stanford: CSLI Publications. Watson, J. D. (1968). The double helix: A personal account of the discovery of the structure of DNA. New York: Atheneum. Ye, L., Swingen, C., & Zhang, J. (2013). Induced pluripotent stem cells and their potentialfor basic and clinical sciences. Current Cardiology Reviews, 9(1), 63–72. http://dx.doi.org/10.2174/157340313805076278 Ziman, J. M. (1968). Public knowledge: An essay concerning the social dimension of science. Cambridge: Cambridge University Press.