Computers and Chemistry 26 (2002) 511– 519 www.elsevier.com/locate/compchem
Deciphering Arabidopsis thaliana gene neighborhoods through bibliographic co-citations A. Louis a,*, H. Chiapello b, C. Fabry a,c, E. Ollivier a, A. He´naut a ´ 6ry Cedex, France Laboratoire Ge´nome et Informatique, Tour E´6ry 2, 523 place des Terrasses, 91034 E b Laboratoire MIG, INRA, 78026 Versailles Cedex, France c Institut Pasteur, Unite´ Ge´ne´tique des Ge´nomes Bacte´riens, 28 rue du Docteur Roux, 75724 Paris Cedex 15, France a
Received 25 May 2001; received in revised form 02 November 2001; accepted 20 November 2001
Abstract In the framework of genome annotation, scientific literature is obviously the major source of biological knowledge. The aim of the work described in this paper is to exploit this source of data for the model plant Arabidopsis thaliana. The first step has consisted in constituting a relevant bibliographic references dataset for plant genomic research. Genes co-citations have then been systematically annotated in this reference dataset, starting from the simple idea that if genes are cited in the same publication, they must probably share some related functional properties. In order to deal with the synonymous gene name problem; a gene name reference list has been constituted starting from A. thaliana SwissProt entries. This list was used to build clusters of co-cited genes by a single linkage procedure such that any gene in a given cluster possesses at least one co-cited partner in the same cluster. Analysis of the clusters demonstrate the biological consistency of this approach, with only very few fortuitous links. As an example, a cluster including genes related to flowering time is more deeply described in the paper. Finally, a graphical representation of each cluster was performed, which provides a convenient way to retrieve the genes (the nodes of the graphs) and the references in which they were co-cited (the edges of the graphs). All the results can be accessed at the URL http://chlora.Igi.infobiogen.fr:1234/bib – arath/. © 2002 Elsevier Science Ltd. All rights reserved. Keywords: Information retrieval; Gene names; Single linkage procedure; Representation of knowledge
1. Introduction In December 2000, the first plant genome sequence has been published with the Arabidopsis thaliana small crucifere complete DNA sequence (The Arabidopsis Genome Initiative, 2000). Despite this important achievement, the annotation of the A. thaliana genome and the exploitation of the results derived from this knowledge are still partially performed. In the framework of genome annotation, the bibliographical data are obviously the primary source of biological knowledge. The progress in genomic research (transcriptome, * Corresponding author. Tel.: + 33-1-6087-3713; fax: + 331-6087-3799. E-mail address:
[email protected] (A. Louis).
proteome, metabolome, two-hybride,…) leads to an accumulation of biological results in literature and the bibliographical information is growing at an exhausting pace. It seems necessary to develop efficient methods to access this information. This has been already done by the NCBI with the PubMed-ENTREZ retrieval system, an efficient tool linking literature and sequence databases (Wheeler et al., 2001). However, the initial PubMed database was especially developed in the framework of the human genome project and biomedical research. Consequently, the bibliographical knowledge in plant biology is not systematically present in PubMed. This can have serious consequences since PubMed is the commonly bibliographical database used by biological scientists, even in the plant community. So, a first aim of our work was to identify the pertinent
0097-8485/02/$ - see front matter © 2002 Elsevier Science Ltd. All rights reserved. PII: S 0 0 9 7 - 8 4 8 5 ( 0 2 ) 0 0 0 1 1 - 6
512
A. Louis et al. / Computers & Chemistry 26 (2002) 511–519
sources of bibliographical informations in the framework of plant genomic research. The genome projects permit to consider the genes of an organism as a whole. This is the major consequence of this new type of study and this approach changes the way of dealing with biological objects. In particular, the biological function of a gene can be more precisely described through the set of properties this gene shares with other genes: we call this concept ‘neighborhood’. Many kind of neighborhoods can be considered between two genes: similarity of their protein sequence (Louis et al., 2001), similarity of their codon usage, physical proximity within the chromosome, implication of their corresponding proteins in a common metabolic pathway or similar expression profile… This definition of neighborhood makes it possible to build families of genes and constitutes a powerful tool to analyze their biological function through the finding of unexpected kinship. This concept has been described and included in the Indigo database (Nitschke et al., 1998). The bibliographic neighborhood has been efficiently implemented in the Entrez retrieval system which, given a reference, finds the ‘related papers’ according to the occurrence of common words (Wilbur and Coffee, 1994; Schuler et al., 1996). In a similar way, we consider in this paper the co-citations of genes in the scientific plant literature, starting from the idea that if two genes are cited in the same bibliographic reference, then they probably share some biological properties (sequence similarity, implication in the same metabolic pathway, interaction…). Obviously, this is not always true. However, this initial assumption permits to develop a new method (taking into account the huge source of knowledge which is scientific literature) to help at the annotation of the A. thaliana genome. As shown in this paper, the A. thaliana gene co-citations are indeed of biological relevance, and indirect relations between genes may permit to infer functional hypothesis. A major problem we were faced with is the plant gene nomenclature. Despite many efforts to define some standards (CPGN,1 TAIR,2 (Meinke and Kornneef, 1997)), it is not yet possible to define lexical rules to extract plant gene names in bibliographical references. Generally, a gene is cited in the literature with several synonymous names. Moreover, most of the main bibliographic databases (MEDLINE, BIOSIS, CURRENT CONTENTS,…) do not index gene names in their references. Thus, in order to identify different nomenclatures in plant gene citations, a manual extraction of gene names was first performed. This constituted a first step in the more ambitious project, which is to develop
1 2
http://mbclserver.rutgers.edu/CPGN/Guide.html. http://www.arabidopsis.org/info/guidelines.html.
automatic tools to identify gene names from texts written in natural language. As expected, it proved essential to supplement the pool of (free) PubMed references with (charged for) BIOSIS references. This allowed to constitute a relevant dataset for A. thaliana gene names extraction. Then, those references in which at least two different A. thaliana gene names were cited were manually selected. The genes cited in this dataset were subsequently compared to a reference gene list to take into account the possible synonyms. Finally, clusters of co-cited genes were built by a single linkage procedure such as any gene in a given cluster possessed at least one co-cited partner in the same cluster, ending with a graphical representation of each cluster. Results can be fetched from http://chlora.lgi.infobiogen.fr:1234/bib – arath/.
2. Materials and methods The main hypothesis of this work is that if two gene names are co-cited in the same document or abstract document, then they likely share a related biological function (whatever the kind of relationship they share). To check this hypothesis, it is necessary to build a complete dataset of documents pertaining to the genomic study of A. thaliana.
2.1. Identification of the rele6ant source of bibliographic documents The World Wide Web free access to the PubMed bibliographic database makes this system the main bibliographic tool for biomolecular researchers. However, we have to keep in mind that this databank has initially been developed in the framework of the human genome and biomedical research. Thus, the plant bibliographic knowledge is not exhaustively present in this system. In a previous work (unpublished), we indeed found that PubMed presents severe gaps in the indexation of papers in plant biology. According to the results described in Table 1, we decided that the BIOSIS bibliographic database was the most relevant supplement to PubMed. A set of 3352 PubMed documents containing the word ‘arabidopsis’ in the title or abstract and published before 1999 constituted the first document collection of the following work. A set of 7000 references (published before 1999 as well) was retrieved from the BIOSIS database, and the redundancy with the PubMed dataset was removed. A final set of 2025 references containing the term ‘arabidopsis’ in the title or in the abstract was then kept and submitted to the gene names retrieval process.
A. Louis et al. / Computers & Chemistry 26 (2002) 511–519
513
Table 1 Comparative study of bibliographic databases pertaining to A. thaliana
SciSearch CAB ABSTRACTS AGRICOLA PASCAL MEDLINE BIOSIS previews
Number of references containing ‘arabidopsis’ in fields ti or de before July 1997
51990
1991
1992
1993
1994
1995
1996
First semester 1997
3245 3140 2968 2302 2015 6328
493 414 109 607 154 1694
168 236 185 171 74 430
271 343 292 293 134 483
426 413 331 354 247 617
448 500 425 312 334 826
601 577 394 244 423 1006
624 569 421 259 510 1051
214 90 28 62 139 234
The information system retrieval DIALOG (http://www.dialog.com) permits the same query on multiple databases. The term ‘ARABIDOPSIS’ was searched in title (ti) and description (de) fields. The number of references responding to this query is reported for each year from the creation of the database to half-year of 1997. As shown in the second column, the BIOSIS database seems to be the most complete for the knowledge about this crucifere.
2.2. Retrie6al of the A. thaliana gene names from the dataset For each BIOSIS and PubMed reference, several fields were added that contained the gene names (GE), the protein names (PE) and the mutant names (MU) as found in the title and abstract of the bibliographic reference. In the work reported here, we first focused on the occurrence of gene names and used only the GE field (see Fig. 1). Since the retrieval of specific terms in a bibliographical reference is far from trivial, we resorted to a manual extraction procedure. First, the citations provided by the authors in an article are in natural language and not standardized. Although, there exists a nomenclature database for sequenced plant genes (Price and Reardon, 2001), it is too recent to be widely used. Moreover, for A. thaliana there are no lexical rules such as those used in the bacterial field (three lower-case letters followed by one upper-case letter, e.g. argC), which could be used for automatic retrieval. Here are some examples of lexical ambiguities we have to face with: names referring to the same gene can include special characters, such as upper cases, hyphen, or blank, e.g. Apetala1, apetala-1, APETALA 1, AP1, Ap-1,… part of the name can belong to English natural language e.g. ‘transparent testa glabra’, ‘unusual floral organs’… Moreover, the existence of gene synonyms is common. To complicate a little more the retrieval process, the gene names can be found in abstracts under three categories: complete gene names, acronyms or symbols and synonyms (Table 2). Finally, elements of a multigene family are generally not individually cited but more often in compacted forms. For example, the occurrence of PHY(A –E) means the citation of PHYA, PHYB, PHYC, PHYD, PHYE. In the same way, ‘APR1,-2,-3’ refers to the genes APR1, APR2, APR3.
Unfortunately, the hyphen does not always define the occurrence of different genes belonging to the same family. As an example, the sentence ‘Pre6iously, we described the isolation and characterization of two threemember gene families, designated AtUBC1 – 3 and AtUBC4 – 6, encoding two of these E2 types’ refers to six different genes (AtUBC1, AtUBC2, AtUBC3, AtUBC4, AtUBC5, AtUBC6), but ‘Expression of the Arabidopsis AtAux2 – 11 auxin-responsi6e gene in transgenic plants’ refers to only one (AtAux2 –11). We have to notice that these observations are also true for gene products with the addition that here a given entity may be referenced through very different types of constructions. In order to deal with the problem of synonymous gene names and to study the relevance of our hypothesis, we constituted a gene reference list containing a sample of 1000 A. thaliana gene names. This list was elaborated according to the A. thaliana sequences available in SWISSPROT in 1999. Common synonyms were added to this list during the manual gene names retrieval step. For the clustering step, our document datasets were cleaned up to keep only the most common used gene name. The bibliographical references were so filtered and the names of genes initially quoted were replaced by generic names (generally extract from the GN field of SWISSPROT file). We will call this step of standardization of gene names: ‘FILTER’. The genes that were not in our local database and for which we did not have other kind of informations were removed, in order to allow biological study of the results (see Fig. 1). The names of genes in our reference datasets were so homogenized. Thus, we finally kept a set of 295 PubMed references containing a total of 495 different genes, such that in a given PubMed document at least two genes were cited in the title or abstract (Fig. 2). For the BIOSIS dataset, 313 references containing 274 genes were kept.
514
A. Louis et al. / Computers & Chemistry 26 (2002) 511–519
Fig. 1. Retrieval of the A. thaliana gene names from two PubMed references (these are truncated here). In the first step, the names are retrieved (field GE) as they appear in the title (field TI) or in the abstract (field AB). In a second step, a standardization is done with the comparison to the local database of gene names and synonyms; generally the gene names are replaced by the acronym found in the SWISSPROT databank (FILTER). In this example, we can notice that each reference refers to the gene ‘AP1’, with a different spelling (APETALA1 and APETALA 1). The FILTER step allows to quote the two references with the same gene name (AP1 which is the gene name define in SWISSPROT). The PMID permits links to PubMed.
2.3. Clusterization of the gene names We consider the co-citation of gene names in bibliographical document as likely significant. Furthermore, if a given gene A is related to another gene B, and if gene B is related to yet another gene C, it is then possible that the three genes may have some related function even if A does not co-occur with C. In order to confirm this hypothesis, we built clusters of co-cited genes by a single linkage procedure such that any gene in a given cluster possesses at least one co-cited partner in the same cluster. In that way, we can gather genes together even if they are not explicitly co-cited in a
reference. A JAVA program has been developed to build the clusters from the dataset of gene occurrences in references. This program creates the files needed for the graphical representation.
2.4. Graphical representation of the extracted knowledge In order to build a summary of the neighborhoods between the genes, we devised a graphical representation of each cluster based upon the software da Vinci 3 (Fro3 hlich and Werner, 1994). The nodes of the graph 3
http://www.informatik.uni-bremen.de/daVinci/.
A. Louis et al. / Computers & Chemistry 26 (2002) 511–519
represent the genes, and the edges represent the references in which two genes were co-cited. This software allows a high quality graph layout to represent objects and the relationships between them, with a very good optimization. The graph of each cluster was then saved in postscript format, available from http:// chlora.lgi.infobiogen.fr:1234/bib – rath/.
515
3. Results and discussion From the PubMed set of documents, 83 clusters containing at least two genes were built, with only four of them containing more than six genes. From the BIOSIS set, only 60 clusters containing at least two gene names were built.
Table 2 The gene names can be found under different forms and often have several synonyms Complete gene name
Acronym
Synonyms
Entry in Mendel database
TAIR
APETALA1
AP1
Apetala 1 Apetala-1 AP 1 AGL7 ATAP1
ARAth;Madsl ;8
AP1
Phytochrome A
PHYA
ARAth;PhyA;l
PHYA
Tubulin alpha 1 Tubulin-1 alpha
TUBA1
ARAth;TubA;l
Not found
Unusual floral organs
UFO
Not found
UFO
PHYA FHY2 HY8 F14J9.23 FRE1 TUA1 Tub1A
The synonyms in column 3 are found in sequence databanks and in literature. The synonyms underlined are referred in the specialized Mendel database.
Fig. 2. The different steps in the representation of co-citation of gene names of A. thaliana in literature. (1) Two non-redundant sets of bibliographic references relevant to A. thaliana are retrieved from BIOSIS and PubMed databases. (2) Each reference is indexed with the gene names that occurred in the title and abstract field. (3) A standardization of gene names is done with a local list of generic and synonymous gene names. (4) The clusters of genes are built by a single linkage procedure.
516
A. Louis et al. / Computers & Chemistry 26 (2002) 511–519
3.1. E6aluation of the results: rele6ant references and cluster analysis The first observation is that the clusters that were built with this single linkage strategy are biologically coherent. Genes belonging to a multigene family are grouped in individual clusters or in subclusters easily identifiable on a graph (see the tubulin beta family in Fig. 3). The use of the manual extraction of co-citation of gene names leads to the observation of few fortuitous links between entities. If so, they are mainly due to the fact that the genes are involved in experimental construction. We can see such a case in Fig. 3, with the link between PHYA and UBQ1, where the promoters of UBQ1 are used for studying expression of PHYA.
3.2. Analysis of a cluster containing genes related to flowering time As an example, the cluster represented on Fig. 3 is the largest one built from the PubMed dataset. It aggregates 63 gene names and contains in particular two interesting subclusters. The first one is built up by genes related to the floral development, the second by genes related to phytochrome activity. The gene CO (CONSTANS), which links the two subclusters, induces the floral identity genes in A. thaliana (AP1, LFY) (Simon et al., 1996). Its expression is regulated by photoperiod with the antagonist action of PHYB and CRY2 (Guo et al., 1998). Interestingly, the key role of the CO gene, which acts between the circadian-clock and the control of flowering time, has been confirmed in a recent paper (Suarez-Lopez et al., 2001). The link between CO and the genes related to flavonoid biosynthesis (CHS) is due to their physical proximity on the chromosome as they were sequenced together (Putterill et al., 1993). A subgroup of floral homeotic protein genes (AGL1, AGL2, AGL3,…) is clearly visible in the graphical representation. As these genes are early expressed during flower development, the link to the cluster containing the AP1 gene is biologically relevant. Finally, another subcluster contains genes related to tubulin, which is the major constitutive element of microtubules. This may look surprising at first sight, but the link between TUBB1 and PHYB (and PHYA) is not fortuitous since PHYA and PHYB mediate the hypocotyl-specific downregulation of TUBB1 (Leu et al., 1995).
3.3. Discussion The main limitation of the work described in this paper is the manual extraction of gene names. This step is highly time consuming and raises the problem of future releases. But it presents some advantages. The major one is that the gene names extraction is exhaus-
tive and that the co-citations are extracted from the abstract in its entirety, not only from sentences as in the case of automatic methods. Another benefit comes from the constitution of the reference list of genes with their common synonyms found in the literature. This list can be very useful in many contexts as for example the development of an automatic extraction method based on a gene list or a new plant genome annotation. The manual constitution of this list of gene names and synonymous as well as its use in the indexation process may probably be improved by using linguistic analysis tools such as Intex (Silberztein, 1993). Recently, several papers described automatic methods to identify gene or protein names in scientific literature (Fukuda et al., 1998; Proux et al., 1998). From these works, it turns out that the automatization of this step is possible only in two situations: either a dictionary of gene names is available for the considered species, or syntactic rules are systematically applied for gene naming. Typically the FlyBase system (The FlyBase Consortium, 1999) permits to build a dictionary of Drosophila gene symbols, which can be used to extract the relevant sentences in the literature (Pillet, 2000). The same ‘dictionary method’ has been recently applied to Saccharomyces cere6isiae (Stapley and Benoit, 2000; Ono et al., 2001). In the cases of bacterial species (Bacillus subtilis, Escherichia coli ), simple lexical rules can be defined for gene names extraction. In practice; a manual step is still needed. For example gene names can be cited in an article while bearing no relation with the main topic of the article. They can be cited either in the framework of an experimental construction (reporter genes) or in the framework of a comparison with a gene model reference (lactose operon). The use of the BIOSIS bibliographic database raises the problem of the free access to the literature for the scientific community. For example, at the moment, it is not possible to include the BIOSIS references in a public WWW database without paying annually high fees. Fortunately, it can be noticed that the number of articles concerning A. thaliana in Medline has grown rapidly during these last years. The most probable explanation is that the articles on this plant are more and more published in scientific journals dealing with molecular biology and genetic. These reviews are indexed in Medline while the more ancient agronomy or pure plant physiology journals were not. The clustering process permits to establish relations between genes or proteins that do not initially co-occurred in the same document. The results obtained in the present work show that this process is valid. However, due to the growth of biological knowledge on A. thaliana, this method will probably lead to an unique cluster including all A. thaliana genes grouped by transitivity. So in the future it will probably be necessary to limit the graph depth for convenient visualization.
Fig. 3. This graph is the representation of the cluster containing the greatest number of gene names (63) from the PubMed dataset. The subfamily in blue represents the multigene family of tubulin –beta chain. This subset is related to the cluster by the fact that TUBB1 is regulated by the phytochromes. It is possible to distinguish groups of genes involved in the same functional process (here, each color represents a functional group). The entire cluster is related to floral development.
A. Louis et al. / Computers & Chemistry 26 (2002) 511–519 517
518
A. Louis et al. / Computers & Chemistry 26 (2002) 511–519
4. Conclusion and perspectives The first aim of this work was to exploit bibliographic knowledge in the context of the A. thaliana genome annotation. Despite many difficulties related to the plant context, this approach turned out to be useful. At this stage, it would be interesting to develop and evaluate an automatic extraction pipeline on the same dataset. Linguistic tools have been recently developed and lead to new prospects for bibliographic data exploitation (Thomas et al., 2000; Sekimizu et al., 1998; Blaschke et al., 1999). The plant A. thaliana is a quite recent model organism compared to other model species like drosophila, yeast, or E. coli. In this context, many gene studies have been performed on other species of agronomical interest (tobacco, rice, colza, maize, tomato,…). For this reason, it would be useful to take into account the textual information available on other plants. Although, our document dataset has been selected on the basis of ‘Arabidopsis’ citation, many of them refer in fact to other plant species. Therefore, the corresponding references contain gene name specific to other plant species. A quite simple improvement would be to systematically index these other plant gene names in our dataset. An other more powerful approach would be to rebuild a more complete dataset including all bibliographic references related to plant biology. This would be more difficult since the dataset will become very large and gene nomenclature is still more anarchic for non-model organisms. In this paper, we described a study on co-occurrences of gene names in text abstracts or titles from biological papers. The study of one cluster shows the relevance of the method and implies the necessity to develop automatic methods to perform such study on bigger datasets (at this time, other clusters are available but principally groups genes belonging to multigene families). Such improvement will probably imply collaborations between scientists with different skills (biologists, linguists, computer scientists, statisticians…). We are aware that the next interesting step would be to extract the biological nature of the interaction of the co-cited genes. This would permit both to remove the several false positive co-occurrences (e.g. resulting to two physically close genes) and to classify the genes on the basis of the nature of their biological interactions.
Acknowledgements We thank The Unite´ Centrale de Documentation of INRA Versailles for its help in retrieval and indexing the BIOSIS dataset, especially Ge´ rard Grozel, Claudine Mader, Marie-Claude Dieulot and Angelica Onteniente. We are grateful to Marie-Odile Delorme and
Jean-Loup Risler for their linguistic corrections and to Pierre Brezellec for his supportive help.
References Blaschke, C., Andrade, M., Ouzounis, C., Valencia, A., 1999. Automatic extraction of biological information from scientific text: protein – protein interactions. ISMB, pp. 60 – 67. Fro3 hlich, M., Werner, M., 1994. The Graph Visualization System daVinci —A User Interface for Applications. Technical Report No. 594, Department of Computer Science, Universitat Bremen. Fukuda, K., Tamura, A., Tsunoda, T., Takagi, T., 1998. Toward information extraction: identifying protein names from biological papers. Pac. Symp. Biocomput., pp. 707 – 718. Guo, H., Yang, H., Mockler, T., Lin, C., 1998. Regulation of flowering time by Arabidopsis photoreceptors. Science 279, 1360 – 1363. Leu, W., Cao, X., Wilson, T., Snustad, D., Chua, N., 1995. Phytochrome A and phytochrome B mediate the hypocotyl-specific downregulation of TUB1 by light in arabidopsis. Plant Cell 7, 2187 – 2196. Louis, A., Ollivier, E., Aude, J.-C., Risler, J.-L., 2001. Massive sequence comparisons as a help in annotating genomic sequences. Genome Res. 11, 1296 – 1303. Meinke, D., Kornneef, M., 1997. Community standards for Arabidopsis genetics. Plant J. 12, 247 – 253. Nitschke, P., Guerdoux-Jamet, P., Chiapello, H., Faroux, G., Henaut, C., Henaut, A., Danchin, A., 1998. Indigo: a World-Wide-Web review of genomes and gene functions. FEMS Microbiol. Rev. 22, 207 – 227. Ono, T., Hishigaki, H., Tanigami, A., Takagi, T., 2001. Automated extraction of information on protein – protein interactions from the biological literature. Bioinformatics 17, 155 – 161. Pillet, V., 2000. Me´ thodologie d’extraction automatique d’information a` partir de la litte´ rature scientifique en vue d’alimenter un nouveau syste`me d’information. PhD Thesis, Aix-Marseille 111. Price, C., Reardon, E., 2001. Mendel, a database of nomenclature for sequenced plant genes. Nucleic Acids Res. 29, 118 – 119. Proux, D., Rechenmann, F., Julliard, L., Pillet, V., Jacq, B., 1998. Detecting gene symbols and names in biological texts: a first step toward pertinent information extraction. Genome Inform. Ser. Workshop Genome Inform. 9, 72 – 80. Putterill, J., Robson, F., Lee, K., Coupland, G., 1993. Chromosome walking with YAC clones in Arabidopsis: isolation of 1700 kb of contiguous DNA on chromosome 5, including a 300 kb region containing the flowering-time gene CO. Molecular Gen. Genet. 239, 145 – 157. Schuler, G., Epstein, J., Ohkawa, H., Kans, J., 1996. Entrez: molecular biology database and retrieval system. Methods Enzymol. 266, 141 – 162. Sekimizu, T., Park, H., Tsujii, J., 1998. Identifying the interaction between genes and gene products based on frequently seen verbs in Medline abstracts. Genome Inform. Ser. Workshop Genome Inform. 9, 62 – 71.
A. Louis et al. / Computers & Chemistry 26 (2002) 511–519 Silberztein, M., 1993. Dictionnaires E´ lectroniques et Analyse Automatique de Langaes: le Syste`me Intex. Masson, Paris. Simon, R., Igeno, M., Coupland, G., 1996. Activation of floral meristem identity genes in Arabidopsis. Nature 384, 59 –62. Stapley, B., Benoit, G., 2000. Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. Pac. Symp. Biocomput., pp. 529 – 540. Suarez-Lopez, P., Wheatley, K., Robson, F., Onouchi, H., Valverde, F., Coupland, G., 2001. CONSTANS mediates between the circadian clock and the control of flowering in Arabidopsis. Nature 410, 1116 –1120. The Arabidopsis Genome Initiative, 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796 –815.
519
The FlyBase Consortium, 1999. The FlyBase database of the Drosophila Genome Projects and community literature. Nucleic Acids Res. 27, 85 –88. Thomas, J., Milward, D., Ouzounis, C., Pulman, S., Carroll, M., 2000. Automatic extraction of protein interactions from scientific abstracts. Pac. Symp. Biocomput., pp. 541 – 552. Wheeler, D., Church, D., Lash, A., Leipe, D., Madden, T., Pontius, J., Schuler, G., Schriml, L., Tatusova, T., Wagner, L., Rapp, B., 2001. Database resources of the national center for biotechnology in formation. Nucleic Acids Res. 29, 11 – 16. Wilbur, W.J., Coffee, L., 1994. The effectiveness of document neighboring in search enhancement. Inform. Process. Management 30, 253 – 266.