Searching biotechnology information: A case study

Searching biotechnology information: A case study

World Patent Information 31 (2009) 36–47 Contents lists available at ScienceDirect World Patent Information journal homepage: www.elsevier.com/locat...

537KB Sizes 16 Downloads 73 Views

World Patent Information 31 (2009) 36–47

Contents lists available at ScienceDirect

World Patent Information journal homepage: www.elsevier.com/locate/worpatin

Searching biotechnology information: A case study Luca Falciola * RiboVax Biotechnologies SA, 12 Avenue Morgines, 1213 Petit-Lancy, Switzerland

a r t i c l e

i n f o

a b s t r a c t

Keywords: Patents Patent classification Antibody Internet Search engines Time factors Literature Biotechnology Text Mining Databases

The elaboration of strategies for the effective search of biotechnology information is a challenging task. In fact, the large amount of data in the public domain on biotechnology products and technologies is scattered among many databases and provided in different formats of document. This situation can make particularly difficult the identification, the extraction, and the aggregation of the information that are needed for performing detailed patent or scientific analyses. The article presents a case study presenting different text-based approaches for searching and analyzing biotechnology information in patent and scientific literature using a series of exemplary searches on antibodies, a class of biological products having broad scientific and commercial interest. The results show the complexity of defining how biotechnology information is actually searchable through the variety of available resources and to what extent. Some major factors that should be taken into consideration when searches are performed for evaluating scientific/patent trends, selecting documents potentially relevant for patentability, or identifying valuable technical information. Ó 2008 Elsevier Ltd. All rights reserved.

1. Introduction

ogy, together with the decreasing cost and time that is needed for generating and analyzing biological data, lead to the steady growth in the amount and in the complexity of biotechnology information that is available in the public domain. Moreover, such information can be:

1.1. Defining and searching biotechnology information Modern biotechnology provides a large variety of products and technologies broadly applicable in scientific, medical, and industrial environments. Biotechnology is also a model for studies of social and economic research due to the high degree of innovation and the important commercial perspectives [1,2]. Previous studies combined quantitative and qualitative analysis of scientific and patent activity in biotechnology for determining their impact on innovation [3] and as well as on science policies [4]. According to the National Center for Biotechnology Information (NCBI): ‘‘Biotechnology is the body of knowledge related to the use of organisms, cells or cell-derived constituents for the purpose of developing products which are technically, scientifically and clinically useful.” [5] This definition acknowledges the many opportunities, mainly due to genetic engineering, for providing a large variety of products and technologies. The search and the analysis of the information on these subjects is a more and more challenging task. In fact, the high number of public and private entities that are active in biotechnol-

e 59120 Loos, France. Tel.: +33 * Present address: Genfit SA, 885 av Eugène Avine 3 20 16 40 00; fax: +33 3 20 16 40 01. E-mail address: luca.falciola@genfit.com 0172-2190/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.wpi.2008.05.006

– Provided in different (non-)textual formats (articles, biological sequences, patent documents, tables summarizing and comparing biological data, images of biological samples, graphics representing experiments, etc.). – Scattered among many types of publications and databases. – Published directly through Internet. These factors increase considerably the efforts for comparing and aggregating the information at the scope of evaluating patents, products and technologies, or of interpreting trends in patent and scientific production. 1.2. Biotechnology information and biotechnology inventions The search and the analysis of biotechnology information is a major element in the proceedings for establishing the patentability of biotechnology inventions which, according to the EPO Guidelines for Examination, are: ‘‘Inventions which concern a product consisting of or containing biological material or a process by means of which biological material is produced, processed or used”. [6]

L. Falciola / World Patent Information 31 (2009) 36–47

The Trilateral Search Guidebook in Biotechnology, a document that has been recently published by the Trilateral Project EPOJPO-USPTO [7], provides some indications on how the information on such products, uses, and processes can be searched for establishing the patentability of biotechnological inventions. Exemplary searches are herein presented, giving a sufficiently broad overview of techniques, of major databases, as well as of keywords and search criteria that are applied at EPO, JPO and USPTO. This guidebook is particularly helpful since both EPO and PCT guidelines give limited technical information on search issues for biotechnology related subject-matters. Patent Offices have defined specific examination criteria for biotechnology inventions. The Trilateral web site provides a comparative study on requirement for disclosure and claims for biotechnology, including legal and technical details on the patentability criteria at the three Patent Offices [8]. This document can be compared with the specific Examination Guidelines that are issued by USPTO [9] and other authorities, such as CIPO [10] or UKIPO [11]. 1.3. Searching biotechnology information: the example of antibodies Biotechnology information is associated with a large and heterogeneous panel of subject-matter: protein and DNA sequences, medical uses, cells, biochemical analysis, microbiological processes, etc. How are all these technical aspects are actually captured and searchable in patent and scientific resources? The present article suggests how a similar question can be addressed by providing an overview of text-based information resources that are searched using different ‘‘biotextology” [12] techniques. In particular, quantitative analyses of documents and/or database records were performed by extracting information using specific criteria related to antibodies, a specific class of glycoproteins that circulate in blood and biological fluids and represent the principal effectors of mammal immune system. Antibodies have very distinctive molecular and biological features that have been studied, modified, and exploited for many different uses, as summarized in reviews [13,14]. Due to their strong binding to organic or inorganic targets (defined as antigens) and the great structural and functional adaptability, antibodies are possibly the molecular tools most extensively used in modern biology for both medical and non-medical applications having scientific and, at least in some cases, major commercial importance [15,16]. Structurally, antibodies are multimeric protein complexes formed by heavy and light chains that are held together by disulfide and non-covalent bonds (Fig. 1). Amongst the different type of antibodies, those best characterized and widely used are monoclonal antibodies, which are clonally derived from a single cell and produced using cell biology and recombinant DNA technologies. The high sequence diversity of monoclonal antibodies is responsible of the large number of distinct antigens that can be bound by them in vivo and/or in vitro. The complexity and the amount of information and specific technical language increased dramatically in the last 30 years for defining all the new antibody-based products and technologies. The main area of interest when searching information on antibodies can be summarized and put in relationship (Fig. 1). An antibody is a biological product characterized by a structure and a format that provide an activity potentially useful for (non-)medical applications. Distinct R&D technologies allow generating and characterizing antibodies having a specific activity for the desired uses (for example, biological assays for selecting, within a library, the antibodies that bind a pathogenic agent). General aspects of the search and analysis of antibody information have been described in the Trilateral Search Guidebook on Biotechnology [7], as well as in materials prepared by information

37

providers on biotechnology in general [17] or more specifically on antibodies [18]. Compared to these references, the present study intends to identify some general issues when searching biotechnology information by analyzing the results of identical search strategies in different databases and/or platforms for text-based searches. This article does not take into consideration the search of biological sequences (i.e. DNA and protein sequences) which deserve specific considerations. Some general guidance on main databases and search techniques in connection to specific scopes has been provided in previous articles [19,20] and in the Trilateral Search Guidebook on Biotechnology [7]. Most recently, a study has critically compared the content of databases that allow searching for biological sequences disclosed within patent documents [21]. 2. Methodologies Patent databases were accessed directly through their specific websites as in the case of PATENTSCOPE (http://www.wipo.int/ pctdb/en/), EPO (http://www.epo.org) and USPTO (http:// www.uspto.gov) databases. Commercial databases were accessed through internet for MICROPATENT (http://www.micropat.com) or through STN on the Web platform (http://stnweb.fiz-karlsruhe.de/html/english/) for WPINDEX. The literature databases were accessed through STN on the Web platform (http://stnweb.fiz-karlsruhe.de/html/english/) in the case of MEDLINE, EMBASE, BIOSIS, and SCISEARCH, or directly from Internet as in the case of HIGHWIRE (http://highwire.stanford.edu/). Other relevant websites are indicated in the text. The search criteria applied in each analysis are indicated in the description of the figure and/or in the text. Such criteria were assembled using the operators, the field restriction, and the format appropriate for the specific database. Searches performed with truncated terms are indicated with an asterisk (e.g. antibod*), while terms used as a phrase are indicated within ‘‘ ” (e.g. ‘‘monoclonal antibody”). The graphs were elaborated using Microsoft Office suite. 3. Searching for antibody information in patent literature 3.1. Patent protection & disclosures related to antibodies The task of drafting patent applications that claim inventions arising from biological research is complicated by the steady pace with which biotechnology evolves and knowledge is accumulated. In fact, a quick search in main databases shows that, weekly, several thousands of articles and several hundreds of patent documents are published in this area. The amount, the heterogeneity, and the complexity of such disclosures can make particularly difficult the task of defining correctly prior art at a specific date, and consequently patentable subject-matters, becoming a source of frustration [22]. On their side, Patent Offices are presented with the problem of preparing meaningful Search Reports as basis for the examination. Many patent applications in the field of biotechnology are considered as ‘‘complex applications” since they contain large sets of claims and contemplate multiple possibilities or sophisticated technical parameters [23]. This inherent complexity seems to have a direct effect on patent proceedings, since biotechnology-related patent applications are those combining the highest number of claims and of pages with the slowest patent prosecution [24]. The highly competitive field of antibody research and discovery is no exception, given also that the technical progress now provides large varieties of antibody formats, applications, and activities [13–15]. As summarized in recent presentations [25,26], the case law in both Europe and USA demonstrates how applicants are now required to consider attentively enablement and inventive

38

L. Falciola / World Patent Information 31 (2009) 36–47

PRODUCT STRUCTURE - Antibody Sequence - Antibody Format

R&D TECHNOLOGIES - Discovery, screening platforms - Cloning, expression systems VL VH

VH VL

CL CH1

CH1 CL

CH2 CH2 CH3 CH3 PRODUCT ACTIVITY - Antigen, Epitope Binding - Functional, Biological Activity

COMMERCIAL USES - Non-/Medical applications - Administration, formulations

Fig. 1. Structure and properties of antibodies. A schematic representation of the relationships among the different antibody-related subject-matters is provided together with the structure of Immunoglobulin G (in the center), an example of human antibody. An Immunoglobulin G results from the interactions of two identical heavy chains between each other and with two identical copies of light chains. Each heavy chain comprises a variable region (VH) and three domains (CH1, CH2, and CH3) forming the constant region. Each light chain comprises a variable region (VL) and a constant region (CL). The variable regions comprises the sequences that are responsible of the specific binding to antigens (see [5,6] for more details).

step for establishing patentability in view of the maturity of antibody technologies. For example, the relevance of disclosures that are provided in both prior art and the examined application on the antigen characterization [27], or elaborated functional definitions [28] has to be critically evaluated. 3.2. Use of antibody-related keywords for searching patent documentation Antibodies represent a particularly telling example on the many ways biological products can be structurally and functionally defined in both scientific and patent literature. In particular, keyword-based searches need to take into consideration the terminology that inventors and patent attorneys elaborated for defining antibodies and their activities in patent applications [29]. Even a quick review of claims that have been granted by EPO and USPTO clearly shows how large is the variety of functional and/or structural definitions for antibodies, sometimes applied in equivocal or questionable manner. Claims are drafted and, in certain cases, allowed in formats specifically adapted to uses, methods, and combinations of features that may be related to antigen, biological assays, methods of production, and relevant prior art. Some examples: – Immunoglobulins, in general and/or of specific isotypes (IgG, IgM, etc.). – Specific phrases (e.g. ‘‘monoclonal antibody”, ‘‘human antibody”). – Names of synthetic, recombinant variants (e.g. diabody, peptabody). – Names including ‘‘anti-” (even shortened as ‘‘a-” or ‘‘a-”) coupled with the recognized antigen (e.g. for TNF: anti-TNF, antiTNF, a-TNF, aTNF) or with the biological activity (e.g. antitumoral). – The DNA/protein sequence characterizing a single antibody or a class of antibodies, with or without indicating a specific SEQ ID NO. – A generic wording for defining a compound binding a specific (non-)biological antigen with a measurable affinity and/or showing a specific activity (e.g. listing a ligand, an heterodimer, an agonist, an antigen-binding compound, etc. in independent claims and then indicating that it can be an antibody in dependent claims or in the description only).

– Process-related features (e.g. antibodies that are isolated and/or produced using specific methods). – (Non-)competition with a known antibody for the binding to an antigen. – Presence/absence of a biological activity with reference to specific antibody concentration, antigen variants, and/or biological assays. The precise meaning and applicability of such descriptors may be extensively analyzed during patent proceedings. In fact, many decisions issued by the EPO Technical Board of Appeal [26], as well as comparative studies on practices at major patent offices [8] deal with the problem of the information that is required for establishing which antibodies fall within the scope of the claims. This aspect is connected to the analysis of the disclosure in both prior art and the patent application, wherein the technical contribution for supporting an inventive step (e.g. involving the specificity and/or the biological consequences of the antigen binding) should be clearly identified.

3.3. Use of patent classification codes for searching antibody information Patent classification systems are valuable supports to keywordbased strategies for searching patent documents, given the large variety of patentable antibody-related subject-matters (Fig. 1). The following analysis will limited to IPC codes, given the extensive use of this classification system for restricting the prior art analysis to a subset of patent documents that are potentially more relevant for establishing patentability [30]. The recent IPC8 reform involved the revision of subclass definitions, the introduction of core and advanced levels, and the reclassification of patent documents [31]. IPC provides several codes for classifying patent applications that apply to antibody-related subject matters but at least one of the following three IPC codes are commonly present: – C07K 16/* (immunoglobulins, as a class of proteins). – A61K 39/395 (medicinal preparations containing antibodies; see also A61K 39/40 and 39/42). – G01N 33/53 (assays involving the use of antibodies; see also G01N 33/531-577).

L. Falciola / World Patent Information 31 (2009) 36–47

These codes do not appear to be always assigned consistently, and sometimes they broadly cover a panel of subject-matters, but they are used to distinguish patent applications quite precisely since they coexist in the same PCT application in no more than 10% of PCT applications (data not shown). The IPC-based antibody searches can be improved by using other relevant and more precise IPC codes that are present in a significant number of patent documents and indicating, for example, the preparation of monoclonal antibodies (C12P 21/08), recombinant DNA technologies involving antibody genes (C12N 15/13), or methods for screening antibody–antigen binding (C40B 30/04). As also indicated in the Trilateral Search Guidebook in Biotechnology [7], ECLA and USPC codes may give additional help by providing more detailed definitions, for example in connection to specific categories of antibodies and antigens (within C07K 16/*) or therapeutic uses (within A61K 39/395).

3.4. Identifying antibody-related trends in patent documents The trends in antibody information present in PCT applications can be quickly analyzed using PATENTSCOPE, the search interface provided by WIPO. The number of PCT applications claiming priority documents filed during the last 10 relevant years and including ‘‘antibod*” in the original abstract has a chronological distribution showing a peak for year 2000 that was followed by a steep decrease. Interestingly, this trend is similar to those observed for both PCT applications including ‘‘antibod*” within the corresponding WPINDEX abstract and PCT applications that generically claim peptides or proteins under the IPC code C07K */*. However, a different trend is observed by searching PATENTSCOPE with a combination of the main antibody-related IPC codes defined in par. 3.3. Infact, there is a constant growth in the number of these PCT applications, only interrupted in most recent years, when this combination of codes anyway identifies more antibody-related PCT applications than the simple presence of ‘‘antibod*” in the original abstract (Fig. 2A). When the trend in the use of each major antibody-related IPC code is analyzed separately, the continuous growth that was previously identified seems to be actually determined by combination of such trends. The search shows a quickly growing number of PCT applications containing IPC codes that are related to novel pharmaceutical compositions and assays based on antibodies, rather than on new antibodies, whose production appears to be stagnating or even decreasing in most recent years (Fig. 2B). At the level of granted patents, the ‘‘monoclonal antibody” phrase search shows how the intensity of patent granting in a specific technological field may present major difference throughout time and between patent offices (Fig. 3). During the last 20 years, the number of EP patents peaked a first time in 1994–1995, and then it decreased until 2002–2003 when the growth restarted. Moreover, this phrase is present in the title, abstract and/or claims within at least 40% of EP patents that are published in late 1980’s and have this phrase in the description, meanwhile most recently the value decreased to about 20%. The trends in US patents appear shifted in a peculiar manner, with a peak in 1998–1999, followed by a decrease that stopped only in 2006–2007. The disappearance of ‘‘monoclonal antibody” phrase in the title, abstract, and/or claims is even more evident in US patents, which present it in slightly more than 10% of all those including this phrase in the description. This chronological analysis can be extended to the distribution of USPC and ECLA codes that correspond to the selected IPC codes (see Fig. 2). Not so surprisingly, the trends for both EP and US patents having codes associated to antibodies as class of proteins follow trends similar to those identified when searching ‘‘monoclonal antibody” in title, abstract, and/or claims. However, the situation

39

clearly differs between EP and US patents when looking to codes associated to pharmaceutical preparations or immunoassays, with a strong growth in US patents not observed in EP patents (Fig. 4). Previous chronological analysis of patent production, related to either monoclonal antibodies or biotechnology in general, have been generated by means of different databases and search criteria, showing various trends hardly comparable to each other and to the present results [3,4,32]. However, confirming previous analysis of the prosecution for biotechnology patent applications [24], it has been observed also that the grant rate for antibody-related EP patent applications is significantly lower when compared to the overall grant rate (data not shown). Even though it is tempting to make some general conclusions using the data presented above, a much more detailed analysis is necessary to identify to which extent these simple searches can be indicative of deeper trends related to patent filing and prosecution that may be actually due to combination of factors, for instance: – The choices made by applicants when establishing and performing research and patent strategies (e.g. more oriented in generating and protecting patentable matters using existing antibodies rather than generating novel antibodies). – The technical evolution (i.e. the availability of new technologies, and possibly related new wording, for antibody-related products and uses). – The wording chosen by patent attorneys when drafting patent applications and claims (i.e. introducing antibodies within the description of any drafted patent application or identifying more precisely antibody-related inventions). – The attitude of patent examiners when assigning patent classification codes, as well as when searching and examining claims. Finally, these trends may affect patent searching since the different distributions that are identified throughout time, among patent classification codes, and at different patent offices, indicate the importance of ‘‘fine tuning” (e.g. combining more keywords, using full-text search and/or patent classification codes in the more appropriate manner) for performing a detailed analysis of the patent literature, in general and for specific technologies or time frames.

4. Searching for antibody information in the scientific literature 4.1. Accessibility of biotechnology information in the scientific literature The complexity of searching patent literature is counterbalanced by at least a few certainties: – The dates in which a patent application is filed, published, abandoned, or granted. – The document stability, unless specific and regulated proceedings are performed (e.g. translation, amendments to the claims). – The limited number of issuing authorities (i.e. the patent offices). These features do not clearly apply to scientific literature, in general and particularly in biotechnology, given the more and more extensive dissemination of data and articles through literature-specific or generic internet-available platforms. In fact, several circumstances can make it difficult to establish the information actually disclosed and searchable by means of an article at a certain date, since more and more publishers give different opportunities by:

L. Falciola / World Patent Information 31 (2009) 36–47

a

Published PCT applications

40

6000 5000

Antibod* in WPINDEX

4000

Antibod* in original abstract

3000

C07K 16/* or A61K 39/395 or G01N 33/53

2000 1000

C07K */* 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

Priority Year 1000

Published PCT applications

b

900 800 700

C07K 16/*

600

A61K 39/395

500 G01N 33/53

400 300 200 100 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

Priority Year Fig. 2. Trends of antibody-related patent production examined in PCT applications using WPINDEX or PATENTSCOPE. (A) Number of published PCT applications claiming priority patent applications filed in the years 1996–2005 and comprising antibod* in the WPINDEX abstract, in the original abstract, classified using any of the main antibodyrelated IPC codes (C07K 16/*, A61K 39/395, or G01N 33/53; see Section 3.3), or classified using the generic IPC code for peptides and proteins (C07K */*). (B) Number of published PCT applications claiming priority patent applications filed in the years 1996–2005 and classified using specific antibody-related IPC codes. Original data on published PCT applications have been extracted from PATENTSCOPE database.

Fig. 3. Analysis of US and EP patents that contain ‘‘monoclonal antibod*” and that were granted in the period 1988–2007.

– Including alternative versions of the article (i.e. prior to official publication, with supplementary materials in the electronic version only, with comments later added by authors or readers).

– Anticipating the electronic version of an article weeks or months before the official printed publication (or even not having at all such an official printed version).

41

L. Falciola / World Patent Information 31 (2009) 36–47

US Patents: USPC Codes

EP Patents: ECLA Codes

Documents (x 100)

C07K16/* A61K39/395* G01N33/53

Documents (x 100)

5.0

150

4.5

135

4.0

120

3.5

105

3.0

90

2.5

75

2.0

60

1.5

45

1.0

30

0.5

15 1988- 1990- 1992- 1994- 1996- 1998- 2000- 2002- 2004- 20061989 1991 1993 1995 1997 1999 2001 2003 2005 2007

Publication Year

530/387, 530/388, 530/389 424/130 435/7.1

1988- 1990- 1992- 1994- 1996- 1998- 2000- 2002- 2004- 20061989 1991 1993 1995 1997 1999 2001 2003 2005 2007

Publication Year

Fig. 4. Analysis of patent production according to the major antibody-related ECLA (for EP patents; A) and USPC (for US patents; B) during the period 1988–2007.

In fact, the actual searchability of biotechnology information in scientific literature depends on the choices made by: – Publishers  Policies on date and content of printed and electronic version(s);  Requirements for submitting sequences to databases;  Availability of the article for full-text search. – Authors  Anticipating the information on a web site or at a congress;  Use of terms in title, abstract, keywords, and in the full text;  Selection of the journal where the article is submitted. – Database providers  Type, completeness, and update of indexing;  Selection and coverage of journals;  Frequency of database update.

These aspects need appropriate analysis as demonstrated in recent EPO presentations with reference, for instance, to electronic literature archives such as those available at the websites such as SCIENCEDIRECT (http://www.sciencedirect.com) or arXiv (http:// www.arxiv.org) [33] and in general on the search in non-patent literature [34]. 4.2. Searching database indexing and full-text literature Databases of scientific literature offer different search and indexing features to retrieve records for potentially relevant articles from the thousands of journals they cover, acting as interfaces between the authors and the public. In general, there is a proliferation of databases that become difficult to compare. For instance, according to a recent article [35], at least 1000 molecular biology databases, covering more and more specific topics, are potentially searchable. However, databases are also becoming helpful as additional research tools that may be instrumental for studying biolog-

ical process or generating biological products, since they allow aggregating heterogeneous data in a meaningful manner at the scope of elaborating and/or testing biological hypotheses [36]. In the case of biotechnology information, the most commonly used tool is certainly PubMed (http://www.ncbi.nlm.nih.gov/ sites/entrez), a freely available internet interface for accessing MEDLINE database and many other related resources, for example related to biological sequences [37,38]. At least 2 million of searches are performed daily in PubMed/MEDLINE, which now contains almost 17 millions of records, more than 0.7 millions of them added in 2007 only. In addition to bibliographic information, most of the PubMed/ MEDLINE records are indexed using MeSH (Medical Subject Headings) terms. Full text searching may allow a deeper analysis of scientific terminology (in general or for selected topics/authors), as well as of the citing and/or cited articles, in particular for identifying references that describe antibody-related matters (e.g. immunoglobulins, variable region, heavy chain, monoclonal). Given the growing usage and size of PubMed/MEDLINE, different authors presented more elaborated approaches for improving the access and retrieval of relevant articles, for example by integrating text and MeSH search [39] and text mining technologies [38,40,41]. Platforms for searching and analyzing PubMed/MEDLINE using such techniques are now freely available, like eTBLAST [42] or GoPubMed [43], and provide alternative means to search and select records from this database. Moreover, scientists and patent examiners are now presented with web-based solutions alternative to PubMed/MEDLINE with multiple search limitations coupled to full text search (including Google Scholar, HIGHWIRE, and SCIRUS), that have been compared in terms of search features, retrieval speed and precision, showing considerable differences [44,45]. Full text searching can allow a deeper analysis for evaluating the use of specific wording (in general or for selected topics/authors) and specific citing and/or cited references, in particular for identifying articles that describe the use of specific biological materials (antibodies, cell lines, proteins,

42

L. Falciola / World Patent Information 31 (2009) 36–47

etc.) or technical details hard to identify using database searching only. The next paragraph will show how multi-database and full-text searching can provide useful (but sometimes equivocal) results when searching a general topic like antibodies. 4.3. Identifying antibody-related trends in scientific literature Several databases of scientific literature are simultaneously accessible using a platform such as STN On the Web, where it is easy to perform different search strategies and compare the results across databases in terms of number of records. In this manner, the antibody-related content was searched in MEDLINE, EMBASE, SCISEARCH, and BIOSIS databases for the period 1988–2007, similarly to what has been done above for patent literature (Figs. 3 and 4). Major quantitative differences can be observed in the number of records that are returned when searching these databases using a simple keyword (antibod*) or phrase (‘‘monoclonal antibod*”) and when distinguishing records containing the search query in the original title/abstract or in the indexing only (Fig. 5). Apart from the differences in the absolute number of records, the indexing contribution for ‘‘antibod*” is clearly higher for certain databases (e.g. MEDLINE) compared to others (e.g. BIOSIS). The contribution of database annotation appears as even more important for ‘‘monoclonal antibod*”, for example in SCISEARCH where approximately 40% of the records are retrieved only by searching in the indexing. However, the number of articles that are additionally retrieved using database indexing is still a small fraction of the whole scientific literature in which antibody-related wording appears. In fact, if the search is repeated in a platform that allow the full-text search of large sets of biomedical journals such as HIGHWIRE, the number of records presenting the relevant keyword is multiplied by a factor 4 (for antibody) and even by a factor 6 (for monoclonal antibody) when extending the search to the full text. The same analysis can be performed chronologically for comparing the number of records that have been generated for articles published in the last twenty years. When searching literature databases with ‘‘antibod*” (Fig. 6A) or ‘‘monoclonal antibod*” (Fig. 6B) in title/abstract only, there is a common trend (even more evident for the search phrase) for a general decrease and a convergence in the number of records that are retrieved for most recent years. However, more complex trends are observed when focusing on the database annotation, especially for ‘‘monoclonal antibod*”. The trend that is observed in the title/abstract search, is confirmed in SCISEARCH only, while other databases present more diversified trends, suggesting some important differences in indexing strategy among databases, and possibly throughout the time. If the same analysis is repeated in HIGHWIRE, the growth in the use of antibody-related wording in the full-text articles appears constant throughout the last 20 years, when compared to the stagnating or even decreasing number of potentially searchable title and abstracts using such wording (Fig. 7). In fact, if 30–40% of the articles including antibody-related wording in the text also presented such wording in title and abstract in late 1980’s, this percentage decreased to 10–15% in recent years, a percentage quite similar to that observed in patent literature (see Fig. 3). The antibody-specific trends can be interpreted in various manners (‘‘commoditization” of antibodies as both a word and a research tool; preference of authors for using acronyms or alternative wording in the title and abstracts) but they basically show how appropriately extending the search to the full-text and/or indexing can be important for retrieving potentially relevant information, for recent publications in particular. Thus, an antibody-related search may be performed using strategies that can be adapted not only to the specific scope but also to

specific database indexing and time frame that are searched. Moreover, it becomes more and more important in the search strategy if and how analyzing more appropriately scientific literature in the four possible formats in which articles can be searched: – – – –

As As As As

a a a a

printed publication. record in a databases. full-text searchable document through Internet. citation in a list of cited/citing references.

Finally, these data suggest that, in general, the searchability of relevant records for a given topic in literature databases: – Should be attentively compared across databases in order to perform a complete search, since it may be affected by the criteria in the database that are applied for selecting, amongst the large number of articles, those mostly ‘‘deserving” antibodyspecific indexing. – May change not only amongst databases but within the same database over time. – may be insufficient for searching those specific technical details that only full text search can possibly allow identifying in scientific literature.

5. Searching for antibody information in the world wide web 5.1. Internet-based retrieval of scientific information: the Googleome Every day, thousands of researchers explore the World Wide Web searching for relevant biotechnology information. The searches are performed not only in dedicated web pages and databases (see Section 4.2) but also in various internet-based platforms that are deemed either to both foster rapid release and direct exchange of information. Examples of such platforms are the Public Library of Sciences PLoS (http://www.plos.org), Nature Precedings (http://precedings.nature.com), BioMed Central (http://www.biomedcentral.com), or other ‘‘Wikimedia” platforms for publishing, sharing, and aggregating scientific and technical information [46]. Such tools can be important instruments for accessing and distributing scientific information, even though it is still uncertain if and how they can guarantee long-term stability and preservation of documents [47]. We are possibly entering in the Science 2.0 era [48], during which scientists will more routinely disclose raw experimental results, large set of data, preliminary hypotheses, or draft papers directly on the Internet for others to see, comment on, and maybe use for further investigations. Such Open Access is supposed to make research more collaborative and productive, even though, as a consequence of the increase in the number of online journals, the articles now referenced tend to be more recent, and fewer journals and articles appear to be cited [49]. In any case, the most immediate effect of type of electronic publishing is the increase of Internet-only/e-gray biomedical literature. Patent Offices are both providers, through their websites, and users of Internet contents. Specific considerations for searching Internet are present in the PCT International Search and Preliminary Examination Guidelines [50]. The EPO perspective on prior art search has been presented in connection to technical and legal aspects of disclosure through Internet [51]. However, a recent decision of EPO Technical Board of Appeal insisted on the importance of taking into consideration Internet content whenever it is made available through website of ‘‘regulated and trusted publishers” which allow date and content of information retrieved to be established with an high degree of certainty [52]. This decision, together

43

L. Falciola / World Patent Information 31 (2009) 36–47

Fig. 5. Quantitative analysis of antibody-related scientific production in literature databases in the years 1988–2007. The databases were searched using ‘‘antibody*” or ‘‘monoclonal antibod*”, and limiting the search to title/abstract only, to indexing only, or to full text only.

a

Antibod* in:

Records (x 1000)

MEDLINE BIOSIS EMBASE SCISEARCH

b

Monoclonal antibod* in:

Records (x 1000)

MEDLINE BIOSIS EMBASE SCISEARCH

16

40 35

14

30

12

25

10

20

8

15

6

10

4

5

2

1988- 1990- 1992- 1994- 1996- 1998- 2000- 2002- 2004- 20061989 1991 1993 1995 1997 1999 2001 2003 2005 2007

Publication Year

1988- 1990- 1992- 1994- 1996- 1998- 2000- 2002- 2004- 20061989 1991 1993 1995 1997 1999 2001 2003 2005 2007

Publication Year

Fig. 6. Trends of antibody-related scientific production in different literature databases for the years 1988–2007. The databases were searched using ‘‘antibod*” (A) or ‘‘monoclonal antibod*” (B), further limiting the search to title/abstract (normal line) or in the indexing only (dotted line) for the indicated publication years.

with other ones and ‘‘tricks” for using Internet as a source of prior art, have been object of presentations made by EPO representatives [34,53]. Regarding the importance of Internet searching, the USPTO Search Templates for USPC codes often cite internet search engines such as Google or Yahoo [54]. In fact, these templates clearly states that: ‘‘An Internet search should be considered when a search of the resources listed above fails to locate relevant prior art. A preliminary Internet search may be performed to obtain an overview of the technology, and to identify additional search terms and related product information.”

Some general conditions for citing an electronic document as a printed publication for prior art purposes as applied at USPTO and at EPO have been summarized in recent articles [55,56]. The following paragraph will focus on the most used internet tool for searching information: Google. 5.2. Overview of the antibody-related Googleome Technical features and possible developments of Google, and of Internet search technologies in general, have been recently reviewed, showing how the means for ranking and filtering search results should be improved for finding the most relevant informa-

44

L. Falciola / World Patent Information 31 (2009) 36–47

Fig. 7. Quantitative analysis of the articles within full-text searchable journals that are hosted by HIGHWIRE. The database was searched using ‘‘antibody” or ‘‘monoclonal antibody” and published in the indicated years, with the indicated limitations.

tion among billions of web pages having different structures, formats, and content quality [57]. However, a specific aspect not fully evaluated in the literature is the importance of ‘‘basic” Google in the dissemination and searchability of scientific information in connection to the specific electronic format. In fact, even though HTML In fact, even though HTML files largely predominant on the Internet, some file formats are indexed by Google and searchable using the Advanced Search page in order to retrieve documents that present specific keywords and file formats [58]. In particular, the PDF file format has the advantage of presenting, at least many times, formal aspects that make these files more trustable for analyzing patent matters (at least when compared to HTML pages) since they are usually characterized not only by precise authorship and a publication date, but their stability throughout time. The possibility of citing documents provided in the PDF format for prior art analysis has been described in a decision of EPO Technical Board of Appeal, even though specific conditions apply [59], similarly to documents available through FTP servers, as recently decided in USA [60]. My experience in searching Google is that millions of articles in the PDF format are directly searchable as full text documents by including the limitation to PDF files to a combination of keywords. In addition to official resources (publishers, conference organizers, research institutions), Google explores documents made available in the public domain by many other types of websites, including those of the authors themselves (to complete their bibliography), university students or professors (as materials for journal clubs or courses), or companies (for scientific and/or business communications). In general, a search through Google provides a specific ‘‘Googleome” [61], a term that has been proposed for indicating the Google hits returned for a search query (e.g. all the information found on the Internet on a person, a company, or a scientific subject). The Googleome is a continuously evolving aggregation of information that can be analyzed in a more controlled manner within subsets of web pages by combining: – One or more keywords or distinctive phrase for a search topic. – A specific file format (e.g. PDF).

– One or more descriptors for a specific type of document, e.g.:  For articles or books, they can be ‘‘abstract”, ‘‘summary”, ‘‘bibliography”, ‘‘references”, ‘‘chapter”, ‘‘acknowledgements”, ‘‘conclusions”, ‘‘summary”, journal name, etc.  For company communications or legal decisions, they can be the company and/or product name(s), ‘‘patent”, ‘‘patentability”, etc.

Similar full text searches can be sufficiently complex given the number of characters Google search window accepts and the power of Google search engine. This approach may accelerate the retrieval of relevant documents by avoiding reading sets of documents that are identified using databases in the search of specific technical details. Instead, very different resources, presenting however information in the same format, are directly and simultaneously searchable through Internet by making use of Google. Some interesting observations can be made by searching the Internet for quantifying the antibody-related Googleome. In February 2008, the ‘‘antibody” and ‘‘monoclonal antibody” search queries provide more than 70 millions of pages and more than 19 millions of pages, respectively. However, if such searches are limited to PDF files, the difference is reduced (about 1.9 and 1.7 million of pages). Moreover, the use of descriptors for articles or books in the search strategy allow a further limitation to subsets of documents that may be considered for defining prior art at a specific date (Fig. 8). These numbers can give a broad idea of the full-text information in the form of books, articles, or other documents in PDF format that are searchable using Google. However, this approach can be used to demonstrate that, in addition to the differences in the relevant content among databases, interesting observations can be made even with the level of the searchable information that is present in a single journal on a specific subject. Proc. Natl. Acad. Sci. USA (PNAS) is one of the world’s most-cited multidisciplinary scientific weekly journals, which covers biological, physical, and social sciences and is included in most scientific databases. PNAS articles are fully searchable using both the search interface within the journal website (http://www.pnas. org) and Google. The searchable information in PNAS on mono-

45

L. Falciola / World Patent Information 31 (2009) 36–47

Summary, Abstract

References, Bibliography

(0.5 million)

(1.2 million)

“Monoclonal Antibody” PDF Googleome

Keywords

Chapter (0.02 million)

(approx. 1.7 million documents)

(0.2 million)

Acknowledgments

Introduction, background

(0.2 million)

(1 million)

Thesis, dissertation (0.05 million)

Fig. 8. Representation of the number of hits that are obtained by searching ‘‘monoclonal antibody” PDF Googleome with different literature-related keywords. The search was performed in February 2008 without limitations on the publication date.

Title and/or Abstract of PNAS (using www.pnas.org) Title, abstract, indexing of PNAS (using SCISEARCH) Title, abstract, indexing of PNAS (using BIOSIS) Title, abstract, indexing of PNAS (using EMBASE) Title, abstract, indexing of PNAS (using MEDLINE) Full text search in PNAS (using www.pnas.org) Full text search in PNAS (using Google Scholar) Full-text PDF search in PNAS (using Google in www. pnas.org only) Documents (x 1000)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Fig. 9. Quantitative analysis of the records or articles in Prot. Natl. Acad. Sci. USA (PNAS) that were retrieved by using the search phrase ‘‘monoclonal antibody” in different databases and different search platforms. The searches were performed in February 2008 without limitations on the publication date.

clonal antibody can be quickly evaluated by applying such phrase in various search platforms for accessing the journal content (Fig. 9). In this manner it is possible to show some substantial difference in the antibody-related content amongst databases in the non-negligible range of some hundreds of records (possibly due to coverage of the journal over the time, coupled to the indexing strategy for each database; see Section 4.3). Moreover, when the ‘‘monoclonal antibody” search is extended to the full text, the use of a specific search engine makes the result different, with Google finding even more hits than the own PNAS search engine both as PDF and HTML files. Thus, Google can be considered a useful tool not only for identifying additional search criteria but also for completing more traditional searches, by identifying documents with precise technical details (e.g. a protein name, a drug, etc.) that are rarely, late, or not at all included in database records and that only full-text searches

may quickly identify. Google would be even more useful if structured searches within PDF files were possible, or if the use of Digital Object Identifiers (DOI) for providing an immutable reference to an internet-available document (see http://www.doi.org ; [62]) were more common. 6. Conclusions This article provides an overview of the different tools and approaches for extracting potentially relevant biotechnology information in general, by using antibodies, as an exemplary topic, and simple search strategies. Despite the fact that large amounts of information are unceasingly added to databases and websites, a simple keyword such ‘‘antibody” or search phrase such ‘‘monoclonal antibody” may provide some general indications on the information made available in

46

L. Falciola / World Patent Information 31 (2009) 36–47

KNOWLEDGE BASE FORMED BY Text/Tabulated Structure/Sequence Graphic/ Visual Information Information Information

Cited/Linked Information

Structured / Sequential Searches By Means Of: - Patent / Literature / Structure / Sequence Databases - Techniques for Text Mining, Cascade Searching, Internet Analysis - Human Analysis Of Technical Problems And Of Documents

sion by performing initial exploratory, general searches within databases (e.g. phrase searching, IPC main/advanced codes, limited to indexed information). These searches may provide an insight to any hard-to-anticipate, ‘‘deep” biasing effects that may substantially affect the time needed and the outcome of the search. Similar considerations can be applied to searches that are intended for establishing either claim patentability or trends in the scientific and patent production. Disclaimer

R&D Product Commercial Product Technologies Activity Applications Structure SUPPORT FOR SPECIFIC STRATEGIC, TECHNICAL, PATENT EVALUATIONS Fig. 10. The knowledge base and the central role of search activities in supporting the different types of evaluation.

the public domain on a complex subject. In fact, these search terms are still the most generally recognized and applied in any kind of literature and databases indexing, as well as in technical and scientific communications. Consequently, they may generate the most comparable results across the different resources. Without being exhaustive or intended to show any superiority of a database compared to others, this analysis of resources and data shows some major issues of general interest for a professional searcher, in particular: – The actual ‘‘searchability” of a document within different resources is a variable that may affect the search outcome, and may heavily depend on the type of:  Information format (e.g. text, sequence, figure, table, structure);  Documentation (e.g. scientific literature or patent documents);  Database (e.g. due to specific indexing, availability of fulltext search features);  Document format (e.g. classical scientific journal, PDF file, HTML page). – A structured approach should take into account the different degrees of usefulness for each search criteria before applying and combining them in the more appropriate manner [63]. – An appropriate technical/scientific analysis (prior to and during the search) helps establishing cause-effect relationships, technical subject and related use of search terms in the literature. – The wording used within documents and/or databases may evolve throughout time. – The scientific literature should be analyzed not only in relationship with citing/cited documents, but more broadly in its online context [64], since Internet provides powerful means for rapidly modifying, searching, and linking information to documents. The examples provided in this article point to the fact that the type and the amount of biotechnology information actually searchable heavily depends on complex criteria that may be difficult to identify, select, and combine efficiently. Professional searchers should be aware that they play a central role in defining a specific knowledge base including the information that is disclosed in a variety of formats, each having its own features and importance according to the type of evaluation that is requested (Fig. 10). The complexity of searching biotechnology information may be, if not eliminated, understood and reduced to a manageable dimen-

The views and the opinions expressed in this article are the author’s personal thoughts on these subjects. They are not intended to be considered opinions and positions of RiboVax Biotechnologies or Genfit, nor imply any commitment by RiboVax Biotechnologies or Genfit to any particular course of action.

Acknowledgements I would like to thank RiboVax management and Marianne Murphy for the support and the review of the article. This article is partly based on the author’s presentations at IQPC Pharmaceutical Patent Litigation Strategies (London, UK; March 2008), PIUG Meeting 2007 (Costa Mesa, USA; May 2007), and Pharma-Bio-Med 2006 (Lisbon, Portugal; November 2006). Thanks to the organizers of these conferences for the opportunity they give me to present and discuss the initial studies that lead to this article.

References [1] Various authors. Biotechnology: its origins, organization and outputs. In: Ebers M, Powell W, editors. Research policy, vol. 36; 2007 [4, special issue]. [2] Van Beuzekom B, Arundel A. OECD biotechnology statistics; 2006. . [3] Tansey M, Stembridge B. The challenge of sustaining the research and innovation process. World Pat Inform 2005;27:212–26. [4] Oldham P, Cutter A. Mapping global status and trends in patent activity for biologic and genetic material. Genomics Soc Policy 2006;2:62–91. [5] Biotechnology Medical Subject Heading (MeSH; Scope Note); NCBI. . [6] Guidelines for Examination in the European Patent Office, December 2007. CIV, 3.1. [7] Trilateral Search Guidebook in Biotechnology, Ver. 2; 28 November, 2007. . [8] Biotechnology Patent Practices project 24.1 comparative study. . [9] Manual of Patent Examination Procedure, Chapter 2400, Biotechnology, August 2001; USPTO. [10] Manual of Patent Office Practice, Chapter 17, Biotechnology, version revised on November 2007; CIPO. [11] Examination Guidelines for Patent Applications relating to Biotechnological Inventions in the UK Intellectual Property Office, September 2007; UKIPO. [12] Biotextology – Text Search Techniques for Biological Information (presentation available on ). [13] Kim S et al. Antibody engineering for the development of therapeutic antibodies. Mol Cells 2005;20:17–29. [14] Laffly E, Sodoyer R. Monoclonal and recombinant antibodies, 30 years after. Human Antibodies 2005;14:33–55. [15] Baker M. Upping the ante on antibodies. Nat Biotech 2005;23:1065–72. [16] Van Brunt J. The monoclonal maze. Signals 2005. . [17] STN Workshop Biotechnology Searching on STN, April 2002. . [18] STN: Finding Antibodies & Immunoglobulins. Webinar; February 2006. . [19] Xu G et al. Patent sequence databases. World Pat Inform 2002;24:95–101. [20] Yoo H et al. Intellectual property management of biosequence information from a patent searching perspective. World Pat Inform 2005;27:203–11. [21] Andree PJ et al. A comparative study of patent sequence databases. World Pat Inform 2008;30:300–8. [22] Latimer M. Patenting inventions arising from biological research. Genome Biol 2004;6:203. [23] Scott J. When is a search not a search? The EPO approach. World Pat Inform 2007;29:108–16.

L. Falciola / World Patent Information 31 (2009) 36–47 [24] Van Zeebroeck N et al. Patent inflation in Europe. World Pat Inform 2008;30:43–52. [25] Shea T. Patenting antibodies the past, present, and future. In: Presentation at 3rd annual antibody therapeutics IBC life sciences conference; 2005. [26] Yeats S. Patenting antibodies: EPO practice. In: Presentation at Swiss Biotech Association; April 25, 2007. [27] Noelle v. Lederman, 355 F.3d 1343. CAFC; 2004. [28] Technical Board of Appeal of European Patent Office. T1300/05; 2006. [29] Webber P. Patenting antibodies. Nat Rev Drug Disc 2006;5:97. [30] Sternitzke C. Reducing uncertainty in the patent application procedure Insights from invalidating prior art in European patent applications. World Pat Inform 2009;31:48–53. [31] Foglia P. Patentability search strategies and the reformed IPC: a patent office perspective. World Pat Inform 2006;29:33–53. [32] Lawrence S. Biotech patenting upturn. Nat Biotech 2006;24:1190. [33] Van Staveren M. Academic publications on the internet. Contribution to search matters conference, 10/11 September 2007, The Hague, The Netherlands. Available as Macromedia file in . [34] Verbandt Y, Martin E. Non-patent literature. Contribution to search matters conference, 10/11 September 2007, The Hague, The Netherlands. Available as Macromedia file in . [35] Galperin M. The molecular biology database collection: 2008 update. Nucl Acids Res 2008;36(Database Issue):D2–4. [36] Philippi S, Köhler J. Addressing the problems with life-science databases for traditional uses and systems biology. Nat Rev Genet 2006;7:482–8. [37] Wheeler D et al. Database resources of the National Center for Biotechnology Information. Nucl Acids Res 2007;35(Database Issue):D5–D12. [38] Ananiadiou S, Kell D, Tsujii J. Text mining and its potential applications in system biology. Trends Biotechnol 2006;24:571–9. [39] Camous F et al. On combining text and MeSH searches to improve the retrieval of MEDLINE documents. In: Proceedings of the third conférence en recherche d’Informations et applications. Lyon, France: CORIA; 2006. [40] Krallinger M, Valencia A. Text-mining and information retrieval services for molecular biology. Genome Biol 2005;6:224. [41] Cohen K, Hunter L. Getting started in text mining. PLoS Comput Biol 2008;4:e20. [42] Errami M et al. eTBLAST: a web server to identify expert reviewers, appropriate journals and similar publications. Nucl Acids Res 2007;35(Web Server Issue):W12–5. [43] Doms A, Schroeder M. GoPubMed: exploring PubMed with the Gene Ontology. Nucl Acids Res 2005;33(Web Server Issue):W783–6. [44] Felter L. Google scholar, scirus, and the scholarly search revolution. Searcher 2005;13:43–8. [45] Vanhecke T et al. PubMed vs. highwire press: a head-to-head comparison of two medical literature search engines. Comput Biol Med 2006;37:1252–8. [46] Keim B. Wikimedia. Nat Med 2007;13:231–3.

47

[47] Commission of European Communities. Scientific information in the digital age: access, dissemination and preservation. COM 56 (Final); 2007. [48] Waldrop M. Science 2.0. Sci Am 2008;298:68–73. [49] Evans JA. Electronic publication and the narrowing of science and scholarship. Science 2008;321:395–9. [50] Guidelines for the processing by international searching and preliminary examining authorities of international applications under the patent cooperation treaty; 2004. p. 11.01–11.26. [51] Archontopoulos E. Prior art search tools on the Internet and legal status of the results: a European Patent Office perspective. World Pat Inform 2004;26:113–21. [52] Technical Board of Appeal of European Patent Office; 2007. T1134/06. [53] Internet citations, Andlauer D, Tsitsilonis L. Contribution to search matters conference, 10/11 September 2007, The Hague, The Netherlands. Available as Macromedia file in . [54] . [55] Leoni R. The Internet as a source of prior art: The European Patent Office’s view; 2008. . [56] Coggins W. United States: When is an electronic document a printed publication for prior art purposes?; 2007. . [57] Henzinger M. Search technologies for Internet. Science 2007;317:468–71. [58] See . [59] Technical Board of Appeal of European Patent Office; 2005. T373/03. [60] SRI International v. Internet Security Systems. United States Court of Appeals for the Federal Circuit 2007-1065. [61] BioLOG blog in RG Discovery web site; 2004. . [62] Bourne P. Will a Biological Database Be Different from a Biological Journal? PLoS Comput Biol 2005;1:e34. [63] Nijhof E. Subject analysis and search strategies – Has the searcher become the bottleneck in the search process. World Pat Inform 2007;29:20–5. [64] Kleinberg J. Analyzing the scientific literature in its online context. Nature Web Focus on the Access to the Literature; 2004. .

Luca Falciola is Director for Intellectual Property at Genfit (Loos, France). He previously held positions at RiboVax Biotechnologies (Petit-Lancy, Geneva, Switzerland) as Director for Intellectual Property & Collaborative Research, and at Serono as Patent Information Specialist and Patent Attorney. Luca holds a PhD in applied genetics and has done post-doc research in molecular biology in Italian and Swiss scientific institutions. He is Member of AIDB (Italian Association of Patent Searchers; http://www.aidb.it/en/home.htm), wherein he acts as editor of the electronic quarterly ‘‘AIDB Newsletter” and coordinator of the AIDB working group on professional certification. He provided courses and presentations on searching patent & scientific information to PhD students and professionals.