Expert Systems with Applications 26 (2004) 269–277 www.elsevier.com/locate/eswa
Automated categorization of German-language patent documents C.J. Falla,*, A. To¨rcsva´rib, P. Fie´vetc, G. Karetkac a
ELCA, E-doc division, Avenue de la Harpe 22-24, CH-1000 Lausanne 13, Switzerland b Arcanum Development, Baranyai utca 10, H-1117 Budapest, Hungary c World Intellectual Property Organization, 34 Chemin des Colombettes, CH-1211 Gene`ve 20, Switzerland
Abstract The categorization of patent documents is a difficult task that we study how to automate most accurately. We report the results of applying a variety of machine learning algorithms for training expert systems in German-language patent classification tasks. The taxonomy employed is the International Patent Classification, a complex hierarchical classification scheme in which we make use of 115 classes and 367 subclasses. The system is designed to handle natural language input in the form of the full text of a patent application. The effect on the categorization precision of indexing either the patent claims or the patent descriptions is reported. We describe several ways of measuring the categorization success that account for the attribution of multiple classification codes to each patent document. We show how the hierarchical information inherent in the taxonomy can be used to improve automated categorization precision. Our results are compared to an earlier study of automated English-language patent categorization. q 2003 Elsevier Ltd. All rights reserved. Keywords: Patent; Intellectual Property; Categorization; Hierarchical taxonomy; International Patent Classification
1. Introduction When a patent application is considered or submitted, the search for previous inventions in the field is facilitated by prior accurate patent classification. Most classified patent documents are made available publicly over the Internet, and can be browsed within each category. National and international patent offices worldwide currently spend large amounts on manually classifying all patent applications in complex taxonomies. The number of patent applications is currently rising rapidly worldwide, creating the need for patent categorization expert systems to help speed up and streamline the process (Calvert and Makarov, 2001; Hull et al., 2001; Smith, 2002). The retrieval of classified patent documents is crucial to patent-issuing authorities, potential inventors, research and development units, and others concerned with the application or development of technology. In industry, patents are a major source for gathering intelligence about competitors’ activities, but this source necessitates sophisticated tools for meaningful data mining (Teichert and Mittermayer, 2002; Vachon, Grandjean, & Parisot, 2001). A patent categorization expert system would * Corresponding author. Tel.: þ 41-21-613-2111; fax: þ41-21-613-2100. E-mail address:
[email protected] (C.J. Fall). 0957-4174/$ - see front matter q 2003 Elsevier Ltd. All rights reserved. doi:10.1016/S0957-4174(03)00141-6
also be valuable to individual inventors or patent attorneys, to assist them in identifying similar earlier patents and to help them make sense of the very detailed classification systems employed. The World Intellectual Property Organization (WIPO) is currently developing an automated patent categorization tool, designed to classify documents in the International Patent Classification (IPC), which is a standard taxonomy developed and administered by WIPO. The IPC covers all areas of technology and is currently used by the industrial property offices of more than 90 countries. The expert system is designed to handle raw natural-language text input and to avoid using any humanly-selected keywords for classification. A tool for automating or assisting patent categorization ideally needs to support the classification of documents in a variety of languages. This report presents the research into German-language patent categorization that was performed in the course of this project. German serves as a demanding test case, as it has a very diverse vocabulary due to its extensive use of compound words. In view of the large vocabulary, the range of topics, and the number of categories, this application serves as a demanding test of machine learning categorization algorithms. A number of authors have previously reported on procedures for automating English-language patent
270
C.J. Fall et al. / Expert Systems with Applications 26 (2004) 269–277
classification using machine learning techniques. Chakrabarti, Dom, Agrawal, and Raghavan (1997, 1998a) developed a Bayesian hierarchical patent classification system into 12 subclasses organized in three levels. In these smallscale tests, the authors found that by accounting for the already-known classifications of cited patents, the effectiveness of the categorization could be improved (Chakrabarti, Dom, & Indyk, 1998b). Larkey created a tool for attributing US patent codes based on a nearest-neighbors (kNN) approach (Larkey, 1998, 1999). The inclusion of phrases during indexing is reported to have increased the system’s precision for patent searching but not for categorization (Larkey, 1998). The overall system precision is, however, not described. Gey, Buckland, Chen, & Larson (2001) have created a web-based solution for attributing patent codes in the US and IPC taxonomies, but have not performed detailed tests of precision. Kohonen et al. (2000) achieved a precision of 60.6% when classifying patents into 21 categories in the course of creating a two-dimensional map of patent documents. Teichert and Mittermayer (2002) have performed comparative tests of k-NN and centroidbased algorithms applied to patent documents split into five technology categories, achieving 74 – 78% precision. A comprehensive set of patent categorization tests is reported in Krier and Zacca` (2002). These authors organized a comparative study of various academic and commercial categorizers, but do not disclose detailed results. The participant with the best results has published his findings separately (Koster, Seutter, & Beney, 2001). Categorization is performed at the level of 44 or 549 categories specific to the internal administration of the European Patent Office, with around 78 and 68% precision, respectively, when measured with a customized success criterion. Overall, published results on patent categorization are often smallscale or focused on an industry-specific application. We reported earlier on the automation of IPC categorization with English-language documents (Fall, To¨rcsva´ri, Benzineb, Karetka, 2003) and found good results with a Support Vector Machine (SVM) algorithm. We also found that the categorization difficulty varied throughout the IPC and that the precision obtained with expert systems could be predicted from various statistics computed on the training set. To our knowledge, there is only one previous report about automated categorization of German-language patents Zakaria (1989). In this early work, a categorization assistance system identifying keywords in the titles and descriptions of patents is proposed but not implemented. In this paper, we provide an overview of the complex IPC taxonomy used for classifying patents in Section 2. We describe a collection of German-language documents we have created to train and test IPC categorization expert systems in Section 3, and report our methodology in Section 4 and our results with a variety of algorithms and parameters in Section 5. We compare these results with those obtained when classifying English-language patent documents, and finally conclude in Section 6.
2. IPC Taxonomy The IPC is a complex hierarchical classification system comprising sections, classes, subclasses and groups. The latest edition of the IPC contains eight sections, about 120 classes, about 630 subclasses, and approximately 69,000 groups. To date, the taxonomy has been updated every 5 years. The IPC taxonomy specification, listing all categories and placement rules, is published online (www.wipo.int/ classifications) and in printed form by WIPO (1999). Complete texts of the IPC are available in English, French, German, and several other major world languages. The German Patent and Trade Mark Office (DPMA, Deutsches Patent- und Markenamt, www.dpma.de) publishes the IPC specification in German (www.depatisnet.de/ipc/index. html) (Deutsches Patent- und Markenamt, 1999a). German patent documents are available online in the so-called DEPAROM collection (www.deparom.de), which contains several hundred thousand records with IPC codes manually attributed by the DPMA. The IPC divides all technological fields into sections designated by one of the capital letters A to H, according to A: ‘Human necessities’; B: ‘Performing operations, transporting’; C: ‘Chemistry, metallurgy’; D: ‘Textiles, paper’; E: ‘Fixed constructions’; F: ‘Mechanical engineering, lighting, heating, weapons, blasting’; G: ‘Physics’; H: ‘Electricity’. Each section is subdivided into classes, whose symbols consist of the section symbol followed by a two-digit number, such as A01. In turn, each class is divided into several subclasses, whose symbols consist of the class symbol followed by a capital letter, for example, A01B. Table 1 shows a portion of the IPC specification at the start of Section A. A detailed introduction to the taxonomy may be found elsewhere (Adams, 2000). In this work we do not make use of IPC subdivisions below subclass level, which consist of main groups and of a hierarchy of subgroups. As we will see below, training expert systems to predict accurate subclasses is difficult enough. A list of keywords for facilitating the manual classification of German-language patents has been collected by hand. This index contains a compilation of over 130,000 words and phrases linking technical subject matters and processes with the corresponding IPC codes. It is available in printed (Deutsches Patent- und Markenamt, 1999b) and electronic form from DPMA. Similar indexes are also available in English and French. As we are most interested in building language-independent solutions avoiding the use of keywords, the use of this information for enhancing categorization accuracies has yet to be investigated. Suggestions for its application are given in Zakaria (1989). Patent categorization in the IPC is complicated by the following factors: References. Many IPC categories contain references and notes, which serve to guide the classification procedure, as illustrated in Table 1. There are two main types of references: limitations of scope, which serve to restrict
C.J. Fall et al. / Expert Systems with Applications 26 (2004) 269–277
271
Table 1 Sample portion of the IPC taxonomy at the start of Section A, in English and German Section A
Human necessities
Ta¨glicher Lebensbedarf
Class A01
Agriculture, forestry, animal husbandry, hunting, trapping, fishing Soil working in agriculture or forestry; parts, details, or accessories of agricultural machines or implements, in general Making or covering furrows or holes for sowing, planting or manuring A01C 5/00; machines for harvesting root crops A01D; mowers convertible to soil working apparatus or capable of soil working A01D 42/04; mowers combined with soil working implements A01D 43/12; soil working for engineering purposes E01, E02, E21
Landwirtschaft, Forstwirtschaft, Tierzucht, Jagen, Fallenstellen, Fischfang Bodenbearbeitung in Land- oder Forstwirtschaft; Teile, Einzelheiten oder Zubeho¨r von landwirtschaftlichen Maschinen oder Gera¨ten allgemein Herstellen oder Zudecken von Furchen oder Lo¨chern zum Sa¨en, Pflanzen oder Du¨ngen A01C 5/00; Maschinen zum Ernten von Wurzelfru¨chten A01D; Ma¨hmaschinen, umru¨stbar zu Bodenbearbeitungsmaschinen oder einsetzbar zur Bodenbearbeitung A01D 42/04; Ma¨hmaschinen, kombiniert mit Bodenbearbeitungsgera¨ten A01D 43/12; Bodenbearbeitung fu¨ml;ur bauliche Zwecke E01, E02, E21
Subclass A01B
References for this subclass
the patents classified in the category and which indicate related categories where some patents should preferably be placed, and guidance references, which list related categories where similar patents are classified. The references may list categories that are distant in the IPC hierarchy. In Table 1, for example, a reference to class E01 exists in subclass A01B. The IPC can thus be thought to contain a multitude of hyperlinks at all levels of classification. Placement rules. Patent classification is governed by placement rules. In certain parts of the IPC, a last-place rule governs the classification of documents relating to two categories at the same hierarchical level, for example in class C07. This rule indicates that the second of two categories should always be selected if two are found to concord with the subject of the patent application. In other parts of the IPC different specific rules hold, for example in subclass B32B where a first-place rule is applied. Secondary codes. A majority of patents do not only have a main IPC code, but are also associated with a set of secondary classifications, relating to other aspects expressed in the patent. Human experts classifying patents are usually free to attribute any number of additional codes. The taxonomy thus contains large overlapping categories (Teichert and Mittermayer, 2002). In some parts of the IPC, it is obligatory to assign more than one category to a patent document if it meets certain criteria. For example, in subclass A61K, a cosmetic preparation with therapeutic properties must also be classified in A61P. Vocabulary. The terms used in patents are unlike those in other documents, such as the news articles that have been widely used for past categorization benchmarks (Lewis, Yang, & Rose, 2003). Many vague or general terms are often used in order to avoid narrowing the scope of the invention. For example, in the pharmaceutical industry, patent applications tend to recite all possible therapeutic uses for a given compound (Vachon et al., 2001). Unlike news stories, patents are necessarily all different on a semantic level, as each must describe a new invention. This rule may complicate categorizer training. In addition, recent documents may contain much new vocabulary.
Furthermore, the IPC categories often cover a vast and sometimes disparate area, creating thereby a large vocabulary size in some categories. For example, class G09 refers to “Educating, Cryptography, Display, Advertising, Seals”. To overcome such variety, large collections of documents are needed to train automated categorization engines.
3. German patent document collection In this report, we make use of a collection of 78,386 German-language patent documents from the DEPAROM source, dating from 1987 to 2002, to train and test machine learning algorithms for patent categorization. As technology evolves, some parts of the IPC become outdated, and there is little point in retaining these when building an expert system for future use. In constituting this corpus, we have retained the 115 IPC classes and 367 subclasses that have been most frequently used in recent years. Both granted patents and patent applications are included. A slightly larger collection of German patent documents is made freely available for further research on our web site (www.wipo.int/ibis/ datasets). The statistics of our document distribution are shown in Table 2. In view of the difference between the average and median numbers of documents per categories, we see that the collection contains a large number of sparsely populated classes and a few heavily-used ones (such as B60, F16, G01, H01, and H04). The documents are split into training and test collections by patent application date, with earlier documents used for training and later documents used for testing. The document distributions follow the current frequency of use of the IPC, subject to the following rules: Above 450 available documents per category, a 2/1 ratio was used for training/test numbers. Between 300 and 450 documents, we take 300 training documents, with the remaining taken as test documents. Below 300 documents, we use 10 documents for testing, and all others for training. It should be noted that these rules are not the same as taking all documents before a single given date as training documents
272
C.J. Fall et al. / Expert Systems with Applications 26 (2004) 269–277
Table 2 Statistics of the document collection, showing the number of categories, the number of documents, the maximum, minimum, average, and median numbers of documents per category, for training and test collections at class and subclass levels IPC depth
Dataset
Categories
No. docs
Max docs
Min docs
Ave docs
Med docs
Class
Training Test
115 115
55,215 23,171
3904 1952
20 10
480 201
231 10
Subclass
Training Test
367 367
59,895 16,147
1467 734
20 10
163 43
105 10
and those after that date as test documents. Nevertheless, the scenario used here insures that a sufficient number of training documents are obtained in each category. Note, in Table 2, that the test collections at class and subclass level are different and that we do not limit the number of documents in large categories. The DEPAROM documents contain a title, lists of applicant companies and inventors, a claims section defining the legal boundaries of the invention protection and a full description. It should be noted that no patent abstract is available. A sample patent record in XML format is shown in Table 3. The main IPC code is seen the mc attribute of the kipcsl tag and a secondary code is found in the icattribute of the kipcl tag (in this example, both are in subclass B02B).
4. Methodology We make use the ‘rainbow’ package, part of the bow (bag-of-words) toolkit (McCallum, 1996), for indexing the patent documents prior to algorithm training. It should be noted that by default this package does not support accented characters, so we make use of a preliminary customized indexing package for German documents. Indexing is performed at word level since earlier studies indexing patent phrases, rather than words, have shown disappointing results (Koster and Seutter, 2003; Larkey, 1998). We make use of two alternative indexing approaches: (i) we index all different words without any linguistic treatment other than a conversion to lower-case; (ii) we additionally implement a rudimentary German stemming algorithm that accounts for the large variety of German suffixes used for distinguishing cases and plurals (Schewe, 1997). Essentially, this involves removing common suffixes (es, em, er, en, ern, s, e, lich, isch, heit, keit) and converting ß to ss. While more sophisticated German algorithms have been published (Lezius, Rapp, & Wettler, 1998), we require simple and fast procedures to cope with our volume of data. In this second approach we also remove from the dictionary all words that appear only once or twice. In both cases, we remove 231 common German stopwords. The rainbow package implements multinomial Naı¨ve Bayes (NB), k-NN, and SVM algorithms. Word indexing that accounts for word frequencies in each document is
Table 3 Sample German-language XML patent document, abridged k?xml version ¼ “1.0” encoding ¼ “iso-8859-1”l k!DOCTYPE records SYSTEM “../../../ipctraining.dtd”l krecordsl krecord cy ¼ “DE” an ¼ “ ” date ¼ “19991013” pn ¼ “DE 19950316 A1 20010426” dnum ¼ “19950316” kind ¼ “A1”l kipcs ed ¼ “7” mc ¼ “B02B00300”l kipc ic ¼ “B02B00700”/l k/ipcsl kinsl kinl Kuris, Isaak, Dr.-Ing., 14471 Potsdam, DE k/inl k/insl kpasl kpal Kuris, Isaak, Dr.-Ing., 14471 Potsdam, DE k/pal k/pasl ktisl kti xml:lang ¼ “DE”lVerfahren und Vorrichtung zum Scha¨len und/oder Schleifen von Korn, Samen und von a¨hnlichen Naturprodukten k/til k/tisl kclsl kcl xml:lang ¼ “DE”l 1. Verfahren zum Scha¨len und/oder Schleifen von Korn, Samen und von a¨hnlichen Naturprodukten bei dem das zu bearbeitende Produkt u¨ber einen Einlass (1) in einen Behandlungsraum (2) zwischen einem Behandlungsrotor (3) und einen diesen umgebenden gelochten Siebkessel (4) eingebracht wird, diesen durchla¨uft und dabei gescha¨lt und/oder geschliffen wird, das gescha¨lte und/oder geschliffene Produkt gesammelt und durch einen Auslass (5) abgefu¨hrt wird und die Scha¨l- und/oder Schleifru¨cksta¨nde mittels zugefu¨hrter Luft u¨ber den gelochten Siebkessel (4) abgesaugt werden, dadurch gekennzeichnet, dass wa¨hrend des Durchlaufs des Produktes durch den Behandlungsraum (2) auf das Produkt ku¨nstlich erzeugte Zwangsschwingungen einwirken, die die Fließfa¨higkeit des Produktes erho¨hen. 2. Verfahren nach Anspruch 1, dadurch gekennzeichnet, dass die Zwangsschwingungen vom Behandlungsrotor (3) und/oder dem Siebkessel (4) ausgehen, die mit einer Frequenz vorzugsweise zwischen 20 und 220 Hz schwingen. […] k/cll k/clsl ktxtsl ktxt xml:lang ¼ “DE”l Die Erfindung betrifft ein Verfahren und eine Vorrichtung zur Abnahme der a¨ußerenund/oder inneren Schale von Korn, Samen und a¨hnlichen Naturprodukten.Derartige Verfahren und Vorrichtungen sind bekannt. So wird von E. N. Grinberg“Herstellung von Graupen”, Agropromisdat, Moskau 1986, eine Kornschleifmaschine beschrieben, deren Behandlungskammer einen Behandlungsraum aufweist, der zwischen einer senkrechten Hohlwelle mit darauf montiertem Behandlungsrotor und einem diesen mit einem Abstand umgebenden Siebkorb liegt. Der Behandlungsrotor besteht dabei aus auf der Hohlwelle mit Zwischenabsta¨nden angeordneten Schleifscheiben. Das zu bearbeitende Korn wird von oben in die Maschine eingebracht und durch die eigene Schwerkraft durch den Behandlungsraum bewegt. Der Nachteil dieser Maschine besteht darin, dass der Wirkungsgrad abha¨ngig ist von der Menge des eingebrachten Korns. Wird wenig Korn eingebracht, ist die Scha¨lwirkung und die Produktivita¨t gering; wird viel Korn eingebracht, steigen Energieverbrauch und Kornbruch.Mit der Reisschleifmaschine BSRB der Firma Bu¨hler Deutschland GmbH(Betriebsanleitung TZ 974de/9510) wurde der Versuch unternommen, einen Kompromisszwischen Scha¨lwirkung, Produktivita¨t, Energieverbrauch und Kornbruch zu finden. In diese Maschine wird das Korn von oben außer mit Schwerkraft noch mit Hilfe einer Vorschubschnecke in den Behandlungsraum eingebracht. […] k/txtl k/txtsl k/recordl
C.J. Fall et al. / Expert Systems with Applications 26 (2004) 269–277
employed for the NB and SVM algorithms. By contrast, we found in initial tests that the k-NN algorithm performs better with word occurrence indexing (binary weighting). The rainbow k-NN algorithm samples the 30 closest neighbors throughout all tests. The SVM algorithm uses a linear kernel and for performance reasons we limit the vocabulary to 50,000 words, selected on the basis of information gain, and at most 500/100 training documents per IPC class/subclass, respectively. We have also implemented our own categorization package, which includes a k-NN algorithm, referred to herein as k-NN-1, and a linear least squares fit method (LLSF) (Sebastiani, 2002). These algorithms both make use of word occurrence indexing. k-NN-1 uses a cosine distance to find neighboring documents. We choose an optimal value of k ¼ 30; following initial tests with k ¼ 1; 10; 20; 30; 40: The LLSF method was previously found to be an excellent performer, but computationally intensive (Sebastiani, 2002). By implementing a sparse matrix optimizer (Paige and Saunders, 1982), we find that several hundred categories, tens of thousands of documents, and hundreds of thousands of words in the vocabulary can nowadays be handled on a desktop PC with this approach. This range is well suited to our needs here. Our algorithms are trained with documents pre-classified on the basis of their main IPC code only. Each document thus appears only once in the training set, despite sometimes being associated with several IPC categories. We index two different collections of XML fields of the patent documents and then perform categorization tests on them: (a) the first 300 different words of the titles, inventors, companies, and claims sections are used to describe each document; (b) the first 300 different words of the titles, inventors, companies, and full descriptions are used to represent each document. This procedure normalizes the length of the training and test documents and limits the vocabulary variety often found in the full descriptions. We choose not to make use of the full descriptions as these were found to yield poor overall accuracies in our earlier work (Fall et al., 2003). We retain the first 300 words for simplicity and computational efficiency. The collections of fields are indexed with the same weight within each document and added to the same feature vector. The output from the expert categorizers consists of a ranked list of categories for each test document. In the following investigations, we consider three different evaluations to flag a categorization success, as shown in Fig. 1. These evaluations serve different purposes, depending on whether an autonomous classifier is sought or whether a human-assistance tool is required. They have been chosen as they are felt most adapted to the case at hand, where documents are often associated with more than one category but always have a single main classification. The (micro-averaged) precision is defined here as the number of test documents that fulfill one of three conditions: In the topprediction scheme (1), we compare the top predicted
273
Fig. 1. Three evaluations of categorization success. In the top-prediction scheme, we compare the top predicted category with the main IPC symbol; in the three-guesses scheme, we compare the top three predicted categories with the main IPC symbol; in the all-categories scheme, we compare the top predicted category with all IPC symbols attributed to each document.
category with the main IPC category. In the three-guesses approach (2), we compare the top three categories predicted by the classifier, chosen on the basis of the confidence levels computed by the various algorithms, with the main IPC class. If a single match is found, the categorization is deemed successful. This measure is adapted to evaluating categorization assistance—where a user ultimately makes the decision of the correct category. In this case, it is tolerable that the correct guess appears second or third in the list of suggestions. In the all-categories scheme (3), we compare the top prediction of the classifier with all categories associated with the document and present in the main IPC symbol and in additional IPC symbols. If a single match is found, the categorization is deemed successful. This last evaluation was used earlier by other authors (Krier and Zacca`, 2002.
5. Results and analysis We first determine the best fields on which to index patent documents for building high-accuracy patent classifiers. In Tables 4 and 5, we display the categorization precisions obtained at IPC class and subclass levels, respectively. In runs 1 and 2 we compare the precisions obtained with the claims and descriptions, respectively (both including titles, inventors, and companies). The results are grouped by each evaluation of categorization success. For both the NB and k-NN algorithms, at class and subclass level, we find that the start of the descriptions are best are differentiating patents. The difference in precision when using the claims section is strongest when using the kNN algorithm. In our work on English-language patents (Fall et al., 2003), we also found that the description field was the most adequate. This corresponds with the expected behavior, as the start of the descriptions is often where the authors define the main scope of their inventions and provide a short history of the particular field, as seen in for example Table 3. For the rest of our experiments, we thus use only the start of the descriptions. The vocabulary obtained when indexing all the words in the patents is huge, with over 2,100,000 different words
274
C.J. Fall et al. / Expert Systems with Applications 26 (2004) 269–277
Table 4 Categorization accuracies at IPC class level using a variety of algorithms, indexing fields, and evaluations of categorization success Run
1 2 3 3b 1 2 3 3b 1 2 3 3b
Field
(a) Claims (b) Description (b) Description (b) Description (a) Claims (b) Description (b) Description (b) Description (a) Claims (b) Description (b) Description (b) Description
Indexing
(i) All (i) All (ii) Filtered (ii) Filtered (i) All (i) All (ii) Filtered (ii) Filtered (i) All (i) All (ii) Filtered (ii) Filtered
Evaluation
Top-prediction Top-prediction Top-prediction Top-prediction 3-Guesses 3-Guesses 3-Guesses 3-Guesses All-categories All-categories All-categories All-categories
rainbow
custom
NB (%)
k-NN (%)
42 44 56 59 58 59 79 78 56 56 68 68
41 48 50 52 64 71 73 72 52 61 63 62
SVM (%)
k-NN-1 (%)
LLSF (%)
51 51
49 52
60 65
70 71
72 71
86 86
64 61
62 61
74 76
For computation reasons, the SVM runs use a smaller vocabulary and a limited number of training examples. Results for run 3b are aggregated from the predicted categories obtained at subclass level (run 3 in Table 5).
rises dramatically, both at class and subclass level, bettering the k-NN results throughout. Such word filtering was not implemented in our earlier study (Fall et al., 2003). With the vocabulary filtering, we exploit the SVM, kNN-1, and LLSF algorithms in run 3. Note that the SVM algorithm is most computationally demanding, and thus operates on a smaller vocabulary and training set. Because of these limitations, results are disappointing, below those of the NB algorithm. The k-NN-1 algorithm provides very similar results to rainbow k-NN. On the other hand, the LLSF algorithm provides the overall best results for IPC classes and subclasses. Another advantage of the LLSF approach is that there are no parameters to tune after the document indexing is completed. We can make use of the hierarchical nature of the IPC taxonomy to improve automated categorization results. In run 3b in Table 4, we report the results of IPC class predictions obtained from the categorization engines trained to perform at subclass level. The predicted IPC subclasses are
in the full descriptions, and around 850,000 different words in the first 300 words of each patent description. For comparison, the first 300 words in an English-language patent collection of similar size contained only 200,000 different words. The vocabulary in German documents is much larger because of the multiple German suffixes and because of compound word formations. The longest words in the collection consist of chemical compound names, DNA sequences, and German compound words, which account for , 60% of the longest words and not occur in English-language documents. For example, our dataset contains words such as Datenkommunikationsgeschwindigkeitsa¨nderungsaufforderungsabschnitt (66 characters long). In runs 2 and 3 of Tables 4 and 5, we compare the results of using this raw diverse German vocabulary with those obtained after our word stemming and filtering procedure. In this case, the vocabulary is reduced to about 260,000 words. The k-NN algorithm precision improves slightly due to the stemming. However, the NB algorithm precision
Table 5 Categorization precisions at IPC subclass level using a variety of algorithms, indexing fields, and evaluations of categorization success Run
1 2 3 1 2 3 1 2 3
Field
(a) Claims (b) Description (b) Description (a) Claims (b) Description (b) Description (a) Claims (b) Description (b) Description
Evaluation
(i) All (i) All (ii) Filtered (i) All (i) All (ii) Filtered (i) All (i) All (ii) Filtered
Indexing
Top-prediction Top-prediction Top-prediction 3-Guesses 3-Guesses 3-Guesses All-categories All-categories All-categories
rainbow
custom
NB (%)
k-NN (%)
31 31 48 51 53 70 38 39 60
34 41 42 52 59 61 44 53 55
SVM (%)
k-NN-1 (%)
LLSF (%)
39
42
56
56
60
78
53
53
71
For computation reasons, the SVM runs use a smaller vocabulary and a limited number of training examples. For SVM, we also index only the first 150 words of each document (rather than 300).
C.J. Fall et al. / Expert Systems with Applications 26 (2004) 269–277
aggregated into predictions for classes using the taxonomy hierarchy. We observe that this approach usually provides an increase in precision over results obtained by direct categorization at class level, particularly for all results measured with the top-prediction success criterion and for all LLSF results. Not only are the LLSF results best at subclass level, but it appears that subclass prediction errors made by the LLSF algorithm are more likely to be in right class than with the other algorithms (as evidenced by the largest increase in precision when aggregating subclass predictions). In this sense also, the LLSF predictions are of higher quality. By using the confidence levels for each category prediction provided by the categorization algorithms, we can choose to classify only a fraction of the test documents—by retaining for automatic classification only those documents with high confidence levels and flagging the others for manual treatment. Fig. 2 shows the categorization precision as a fraction of the documents classified by aggregating the LLSF subclass predictions to class level using the taxonomy hierarchy. We filter the documents retained for automated categorization on the basis of the confidence level of the top-category subclass prediction. We note that by lowering the fraction of documents classified, and flagging the others for manual treatment, the precision rises sharply. For 80% of documents classified, this procedure has over 90% precision when evaluated with the three-guesses criterion. By again making use of the taxonomy hierarchy, we can provide a better technique for filtering class-level predictions aggregated from a subclass expert categorizer. This time we retain for categorization only those documents where the top predicted class, found by aggregating subclass
275
predictions, matches the top predicted class obtained directly from the classifier trained at class level. This situation arises for 75.5% of all test documents, which fixes the upper boundary for filtering documents in this way (see Fig. 2). We then filter this test subset using the subclass confidence levels. We find that by filtering documents retained for categorization in this way, we achieve a better precision on the remaining test documents than by filtering directly on confidence levels. Indeed, in Fig. 2, the topprediction (all-categories) precision improve by over 3% in absolute terms (4%, respectively) when classifying 75.5% of all test documents. This difference is only due to a better choice of which documents to classify automatically, not to any improvements in category predictions. This example again highlights the utility of building expert classifiers at several levels of detail within the IPC hierarchy. We now compare our results with those obtained earlier when classifying English-language patent documents in the IPC using a training and test corpus of similar size (Fall et al., 2003). At class level, we previously obtained 51% (55%) precision with documents in English and the k-NN algorithm (SVM, respectively, with a vocabulary of 20,000 words) when using the top-prediction evaluation. This compares, respectively, with 50% (51%, with 50,000 words) precision obtained here with German documents. The precisions are very similar between the two languages, with German results only slightly worse. In German, the SVM algorithm requires a larger vocabulary to provide similar levels of performance. Subclass level results are more difficult to compare in German and English, because of the smaller number of categories used here (367 subclasses, against 424 with English). When comparing
Fig. 2. Class-level IPC categorization precision as a fraction of the German-language documents categorized with the LLSF algorithm. The predictions are obtained by aggregating subclass categorization results to class level (lines). Also displayed (symbols) are precisions obtained only for documents where the subclass predictions aggregated to class level match the direct IPC class predictions (see text for details). Results are shown for the three evaluations of categorization success discussed in the text.
276
C.J. Fall et al. / Expert Systems with Applications 26 (2004) 269–277
these precisions across languages, one should remember that the IPC codes were allotted by a single institution for the German documents (i.e. DPMA), but by a collection of patent authorities worldwide for our English-language corpus. The latter case could potentially lead to less consistent codes across the collection. A comparison of the confusion matrices in German and English shows that, although the distributions of documents differ, categorization errors are overall similarly distributed, indicating that categorization difficulty is probably related to the nature of the categories, rather than to linguistic details. We also find that in both languages the categorization accuracy varies strongly from class to class. In order to compare the performance of an automated categorizer with that obtained manually, it is instructive to know how consistent the manual categorization is. A fraction of the German patent applications in our collection were also submitted to WIPO under the Patent Cooperation Treaty for protection in several countries worldwide. These documents have been separately classified by the European Patent Office (EPO) in the European Classification (ECLA) system, which is essentially a more detailed version of the IPC, with additional subdivisions within IPC groups. Patents can have multiple ECLA codes, but there is no main one. By comparing the ECLA codes with the German IPC codes, we can get a feel for the difficulty and consistency of IPC code assignment. Comparing classifications on 3071 DEPAROM documents shows that 21% of patents have been given exactly the same IPC and ECLA subclasses, 74% of patents have IPC and ECLA subclasses in common but not all identical ones (this includes cases where the ECLA codes are a superset of the IPC codes and vice-versa) and 5% of patents have no allotted IPC and ECLA subclasses in common. Similar inconsistencies in IPC use have been noted earlier (Oppenheimer and Jones, 1978). We thus understand that the attribution of multiple IPC codes is extremely difficult to do manually in a consistent way. In this light, the automated results presented above can be judged most favourably.
6. Conclusions Categorization in the International Patent Classification is difficult to automate perfectly because of the large number of overlapping categories and the particular categorization rules applied in parts of the taxonomy. Nevertheless, we have seen that computer-assisted classification—where an expert system suggests a small number of possible codes for manual validation—can provide good levels of precision when trained with tens of thousands of natural language full-text examples. This form of semi-automation will be implemented in the expert system WIPO is currently building. Such human support is invaluable when deciding among hundreds of possible categories and should help to overcome the difficulty of handling this complex taxonomy
manually, which is evidenced by the differing codes sometimes attributed when several human experts independently classify the same documents. Our research has shown that the automated categorization of German-language patent documents can achieve very similar results to those obtained in English. The statistical algorithms employed here for the automated categorization of patent documents in the IPC are language-independent, insensitive to word order, avoid the use of humanly-selected keywords, and allow systems supporting many different languages to be developed quickly. As most tuning parameters used here are the same as those we employed for English-language documents, we find that languageindependent parameters will provide an acceptable level of performance. Nevertheless, some language-dependence is necessary for document indexing. In order to cope with the naturally larger vocabulary of German documents, very infrequent words should in particular be removed and German stemming be used to handle the variety of German suffixes. The linear least-squares fit algorithm we implemented is a promising approach that deserves further investigations. We have shown how the hierarchical nature of the IPC taxonomy can be exploited both for predicting categories and for evaluating the accuracies of those predictions. In doing so, we have highlighted the general importance of building expert categorizers at several layers of detail when a hierarchical taxonomy is employed. The authors would like to thank Antonios Farassopoulos and Mikhail Makarov at WIPO for helpful discussions, Axel Okelmann at DPMA for access to the dataset, Jean-Luc Ernst at EPO for the ECLA codes, and Bob Jones at the University of Exeter for access to computational resources.
References Adams, S. (2000). Using the international patent classification in an online environment. World Patent Information, 22, 291 –300. Calvert, J., & Makarov, M. (2001). The reform of the ipc. World Patent Information, 23, 133 –136. Chakrabarti, S., Dom, B., Agrawal, R., & Raghavan, P. (1997). In M. Jarke, M. J. Carey, K. R. Dittrich, F. H. Lochovsky, P. Loucopoulos, & M. A. Jeusfeld (Eds.), Using taxonomy, discriminants, and signatures for navigating in text databases (pp. 446–455). VLDB’97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25– 29, Athens, Greece: Morgan Kaufmann. Chakrabarti, S., Dom, B., Agrawal, R., & Raghavan, P. (1998a). Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. VLDB Journal, 7, 163–178. Chakrabarti, S., Dom, B. E., & Indyk, P. (1998b). In L. M. Haas, & A. Tiwary (Eds.), Enhanced hypertext categorization using hyperlinks (pp. 307 –318). Proceedings of SIGMOD-98, ACM International Conference on Management of Data, New York, US: ACM Press. Deutsches Patent- und Markenamt (Eds.) (1999). Internationale Patentklassifikation (7th ed.) (Vol. 1–9). Ko¨ln: Carl Heymanns KG. Deutsches Patent- und Markenamt (Eds.), (1999b). Stich- und Schlagwo¨rterverzeichnis zur IPC-Internationale Patentklassifikation (7th ed). Ko¨ln: Carl Heymanns KG.
C.J. Fall et al. / Expert Systems with Applications 26 (2004) 269–277 Fall, C. J., To¨rcsva´ri, A., Benzineb, K., Karetka, G (2003). Automated categorization in the international patent classification. ACM SI IR Forum 37. Gey, F. C., Buckland, M., Chen, C., & Larson, R. (2001). Entry vocabulary—a technology to enhance digital search. Proceedings of the First International Conference on Human Language Technology, (pp. 91–95). Hull, D., At-Mokhtar, S., Chuat, M., Eisele, A., Gaussier, E., Grefenstette, G., Isabelle, P., Samuelsson, C., & Segond, F. (2001). Language technologies and patent search and classification. World Patent Information, 23, 265–268. Kohonen, T., Kaski, S., Lagus, K., Saloja¨rvi, J., Honkela, J., Paatero, V., & Saarela, A. (2000). Self organization of a massive document collection. IEEE Transactions on Neural Networks, 11, 574– 585. Koster, C. H. A., & Seutter, M. (2003). Taming wild phrases. Proceedings of ECIT03 Conference. Koster, C. H. A., Seutter, M., & Beney, J. (2001). Classifying patent applications with winnow. Proceedings of Benelearn 2001 Conference. Krier, M., & Zacca`, F. (2002). Automatic categorization applications at the European patent office. World Patent Information, 24, 187–196. Larkey, L. S (1998). Some issues in the automatic classification of US patents. AAAi-98 working notes. Larkey, L. S. (1999). In E. A. Fox, & N. Rowe (Eds.), A patent search and classification system (pp. 179 –187). Proceedings of DL-99, Fourth ACM Conference on Digital Libraries, New York, US: ACM Press. Lewis, D. D., Yang, Y., & Rose, T. F. L. (2003). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, (in press).
277
Lezius, W., Rapp, R., Wettler, M (1998). A freely available morphological analyzer, disambiguator and context sensitive lemmatizer for german. In COLING-ACL (pp. 743–748). McCallum, A. K (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. www.cs.cmu.edu/~mccallum/bow. Oppenheimer, C., Jones, M. C., & Carpenter, A. M. (1978). Consistency of use of the international patent classification. International Classification, 5, 30–32. Paige, C. C., & Saunders, M. A. (1982). Algorithm 583: Lsqr: Sparse linear equations and least squares problems. ACM Transactions on Mathematical Software, 8, 195– 209. Schewe, S (1997). Automatische kategorizierung von volltexten unter anwendung von nlp-techniken. Master’s thesis. Diplomarbeit am Fachbereich Informatik, Universita¨t Dortmund. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34, 1–47. Smith, H. (2002). Automation of patent classification. World Patent Information, 24, 269–271. Teichert, T., & Mittermayer, M.-A. (2002). Text mining for technology monitoring. Proceedings of the 2002 IEEE International Engineering Management Conference, IEMC-2002, (pp. 596 –601). Vachon, T., Grandjean, N., & Parisot, P. (2001). Interactive exploration of patent data for competitive intelligence: Applications in ulix (novartis knowledge miner). Proceedings of the International Chemical Information Conference. WIPO (1999). Guide, Survey of Classes and Summary of Main Groups (7th ed.) (Vol. 9). Geneva: World Intellectual Property Organization. Zakaria, S (1989). Automated patent classification for german patent documents. PhD thesis. University of Saarland, Germany.