A patent examining expert system using pattern recognition

A patent examining expert system using pattern recognition

Expert Systems with Applications 38 (2011) 4302–4311 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: ww...

2MB Sizes 7 Downloads 110 Views

Expert Systems with Applications 38 (2011) 4302–4311

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

A patent examining expert system using pattern recognition Sang-Sung Park, Won-Gyo Jung, Young-Geun Shin, Dong-Sik Jang ⇑ Division of Information Management Engineering, Korea University, 1, 5-Ka, Anam-dong Sungbuk-ku, Seoul 136-701, Republic of Korea

a r t i c l e

i n f o

a b s t r a c t

Keywords: Patent Examining System Pattern recognition Database Text mining

To know whether a patent is registered or rejected, one should rely on the subjective judgments of patent examiners and patent attorneys. In order to overcome this drawback, we propose an algorithm which is able to automatically examine the registration of a patent depending on objective patent data. In this paper, we create the proposed algorithm as a system composed of three procedures: Weight Value Selection, Rejection Criterion Value Selection, and Prediction. In Weight Value Selection, the core words which are the main content of patent documents are extracted. The algorithm finds the average word appearance rates in the document and compares it to the average number of words from rejected patent documents. The algorithm extracts core words from the patent documents and integrates them into an integration core word database. In Rejection Criterion Value Selection, the algorithm extracts the core words from other patent documents that are not used for generating the integration core word database. It finds a relevant document‘s similarity value using extracted core words and the weights of the integration core word database. After that, the algorithm sets each document’s similarity value and accepts the result about the registration or rejection of each patent document as an input value. The algorithm sets the boundary value of a class by running pattern recognition algorithms such as K-means, Perceptron, and Regularized Discriminant Analysis. In the third procedure, Prediction, the algorithm extracts the core words from patent documents for prediction. The algorithm compares the two values created in the first and second steps, and uses a similarity value to predict acceptance or rejection. The proposed Automated Patent Examining System in this paper derives objective prediction results based on past patent data so that we do not have to rely on the subjective judgments of the existing group of patent examiners and patent attorneys to know if a patent will be registered. Ó 2010 Elsevier Ltd. All rights reserved.

1. Introduction

against Abbott Laboratories in 2009. A federal jury in Texas ruled that Abbott Laboratories should pay $1.67 billion to J&J for copycatting its blockbuster rheumatoid arthritis drug Remicade. J&J’s Centocor subsidiary alleged that Abbott’s Humira was made with exclusive J&J technology licensed under a patent (Centocor Inc. 2009). This example shows that when a company develops technology it should carefully determine if its technology infringes on other patent rights to avoid economic losses (Bédécarrax & Huot, 1994). As a company’s patents become an important means of profit generation, companies are investing more resources to establish a patent strategy. Not only is the company strengthened, but because patents are used as indicators of national technological competitiveness, the country and the development of national policies are also aided (Lai & Wu, 2005). Inventors are also applying for more patents every year, which further increases the importance of patents (Fattori, Pedrazzi, & Turra, 2003). Despite the many people who apply for patents, the state only registers less than 50% through patent examination (Lent, Agrawal, & Srikant, 1997).

A patent right is an exclusive right granted to an inventor for a certain period in order to compensate the inventor’s efforts, research expenses and time involved in creating an invention (Archibugi & Pianta, 1996). A person who desires to obtain patent rights must submit his patent application, including detailed technical description, and request an examination by the Korea Intellectual Property Office (KIPO). Then a registered patent examiner decides to accept or reject an application (Korea Software Copyright Committee, 2007). With patents a company can create economic benefits, but can also suffer from economic loss due to patent infringement suits (Korea Intellectual Property Office, 2007). Centocor Inc. is a good example of this. Centocor is a subsidiary of Johnson & Johnson (J&J) which instituted a suit

⇑ Corresponding author. Tel.: +82 2 3290 3900; fax: +82 2 953 4750. E-mail addresses: [email protected] (S.-S. Park), [email protected] (W.-G. Jung), [email protected] (Y.-G. Shin), [email protected] (D.-S. Jang). 0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.09.099

S.-S. Park et al. / Expert Systems with Applications 38 (2011) 4302–4311

Research and development of specific patentable technologies consume lots of manpower and capital, which a company loses if the patent application is refused. To prevent the economic losses caused by these patent infringements or failed patent examinations, companies usually conduct research in the patent information database using a combination of core words (Clifton & Cooley, 1999). This approach is both passive and depends on the subjective judgments of patent experts. For these reasons, we suggest an Automated Patent Examining System which can automatically predict whether a patent will be accepted or rejected. The results are derived based on past patent data, so that we do not have to rely on subjective judgments. In addition this proposed system will help to fix the passive method of patent examination. The proposed algorithm is composed of three procedures: Weight Value Selection, Rejection Criterion Value Selection, and Prediction. Based on the results of the prediction experiment, we can tell if a patent document is similar to a rejected patent document, and predict that it has higher possibility to be rejected during patent examination. This paper is organized as follows. Section 1 begins with an overview of existing studies and describes the necessity of this study. Section 2 briefly describes the trends in related literature. Section 3 describes the proposed algorithm. Section 4 describes empirical tests using real word data sets and summarizes the results. Finally, Section 5 concludes this study and outlines directions for future study.

2. Related literature The analysis of the technical content of patent data can predict changing technology trends and the advent of new technology, as well as words used in patents, previously applied quotation information, and the scope of accepted applications. Yoon and Park (2004) extracted the main words that explain the properties of patents from previous patent quotations by data mining and then estimated the relevance between patents. The links are the main words extracted from patents, and the duplication rate and link distance of the main words are processed by previous conversion techniques. As a result of this parent impact, technology span, and technical relevance between companies that hold patents can be evaluated by using general patent index such as the Current Impact Index (CII) Yoon & Park, 2004. Lee, Ahn, and Lee (2003) analyzed the quantity and quality of Korean patents using parent index conversion of CII, and the level of Korean technology innovation from 1980 to 2001 based on the frequency of corresponding patents (Lee et al., 2003). Yoo, Lee, and Won 2006 used dictionary definition classification to measure the span of life that a specific technology might have. He classified common technology groups according to the technology information of specific patents and then estimated a technology life cycle (Yoo et al., 2006). Black and Ciccolo (2004) applied machine learning technology to text classifications of United States patent information to automatically identify patents relating to the biotech industry (Black & Ciccolo, 2004). Tseng, Lin, and Lin (2007) made a patent map and analyzed each technology using text mining. In this paper we not only automate the whole process to help create a final patent map for topic analyses, but also facilitate or improve other patent analysis tasks such as patent classification, organization, knowledge sharing, and prior art searches (Tseng et al., 2007). Researchers have extensively studied patent maps, patent quotation indexes, and patent technology spans. However, there have been insufficient studies on the acceptance possibilities of patent applications. As we said in the introduction, companies are very interested in preventing economic losses from rejected patent

4303

applications. For these reasons, this paper suggests an Automated Patent Examining System which can automatically predict whether the patent will be registered or rejected.

3. The proposed algorithm The proposed algorithm is composed of three procedures: Weight Value Selection, Rejection Criterion Value Selection, and Prediction. The flow chart shown in Fig. 1 describes these three procedures. Each procedure has an array frequency step, a word removal step, an appearance rate step, and a core words extraction step. In the array frequency step, the algorithm arranges words by frequency. In the word removal step, the algorithm eliminates unnecessary words. The appearance rate step involves getting the appearance rate of words in documents. And in the core words extraction step the algorithm removes the words which appear less than the average number of times. All of these steps are applied the same way in each procedure. In Weight Value Selection, the first procedure, the algorithm extracts the core word which is the main content of the documents based on the appearance rate. The appearance rate is the appearance probability of each word in rejected patent documents, and the average of the number of words in rejected patent documents. After that, the algorithm generates the integration core word database by integrating the extracted core words from each document, then selecting the weights for each word. In Rejection Criterion Value Selection, the second procedure, the algorithm extracts the core words from the patent documents which are not used for generating the integration core word database. Then, the algorithm finds a relevant document‘s similarity value by using the extracted core words and the weights of the integration core word database. After that, the algorithm sets each document’s similarity value as an input value, the registration (1) or rejection (0) of each patent document as an output value, and then selects a Rejection Criterion Value which is the boundary value of the class. The Rejection Criterion Value can be selected through running with classification by using pattern recognition algorithms such as K-means, Perceptron, and Regularized Discriminant Analysis (RDA). According to the Weight Selection and Rejection Criterion Value Selection procedures, the similarity value - which is the feature value of the patent document - can be extracted. Based on the similarity value, we can predict that the patent will be accepted or rejected using pattern recognition algorithms on the prediction procedure. In the third procedure, Prediction, the algorithm extracts the core words using patent documents for prediction. It compares the value of the extracted core words and generated integration core word database’s weights from the first procedure with the rejection criterion value from the second procedure to determine whether the paper will be accepted or rejected.

3.1. Weight Value Selection 3.1.1. Array frequency The patent documents which are converted to text files are arrayed by word frequency, and each document is stored separately. At this point, if there are documents 1 to m, y denotes each patent document. This means that y = (1, 2, 3, 4, . . ., m). The higher the word frequency, the more likely it is that the word is part of the document’s core content. The database screen showing the extracted word frequency of hard disk technology articles is shown in Fig. 2. In the figure, ‘‘did” represents the document’s unique ID, ‘‘word” represents the extracted word, ‘‘cnt” represents the

4304

S.-S. Park et al. / Expert Systems with Applications 38 (2011) 4302–4311

Fig. 1. The flow chart for proposed algorithm.

Fig. 2. DB of extracted word frequency.

4305

S.-S. Park et al. / Expert Systems with Applications 38 (2011) 4302–4311 Table 2 Example of word extraction using criterion value. i

Word

Frequency (Wij)

Appearance rate (Pij)

1 2 3 4 ... n

AAA BBB CCC DDDl ... ZZZ

90 80 70 60 ... Wij

0.4 03. 0.2 0.17 ... Pij

In order to find the appearance rate of i word, divide the frequency of each word by the entire sum of word frequencies in Step 1. For example, Eq. (1) is the appearance rate P1y of ‘AAA’ which is the first word in Table 1 Fig. 3. Example of array by word frequency.

Table 1 Example of word’s removal criterion. Elimination criterion

Example

Article Conjunction Preposition Numeral Other noun (including pronoun) Other unnecessary words

a, an, the, etc. and, but, so, etc. in, the, at, with, etc. one, two, three, etc. it, this, invention, bluetooth, etc. same, all, easier, etc.

frequency of words within a document, and ‘‘m_cnt” represents appearance rate of the extracted words. 3.1.2. Word removal Unnecessary words are eliminated in this step. Fig. 3 shows an example of an array by word frequency, and you can see the insignificant words ‘the’, ‘a’, and ‘of’ ranked high. Therefore, in Step 2 the algorithm must eliminate unnecessary, non-technological words. For example, articles such as ‘a’ and ‘the;’ prepositions such as ‘in,’ ‘with,’ and ‘of;’ and words which are commonly used in patent applications such as ‘invention,’ ‘claim,’ and ‘fig,’ must be eliminated. Table 1 shows the word removal criteria in this procedure. All the words of which the frequencies are 1 will be eliminated. Fig. 3 shows the database screen that lists unnecessary words. 3.1.3. Appearance rate and core word extraction In this step the algorithm gets the appearance rate, which means each word’s appearance probability in the documents. Then, the algorithm extracts core words which are the main content of the documents in this step. The algorithm to extract core words and the appearance rate is as follows: y = Set of patent document i = Set of words in document y W iy

Step 1. Piy ¼ Pn

i¼1

W iy

(i = 1, 2, 3, . . ., n)

Piy = Word frequency rate of i in document y Wiy = Word frequency of i in document y Step 2. C y ¼ 1n n = the number of whole words in document y Cy = Extracting criterion value of core word in document y Step 3. Eliminate whole word which is Piy < Cy Piy value of rest word are declared to Piy

P1y ¼

90 90 þ 80 þ 70 þ    þ X

ð1Þ

Find the criterion value of the extracted core word to extract the core word s that configures the content of the document in Step 2. Extracting core word’s criterion value is set as the average number of entire words (n) in the document. A word with an above-average proportion in the document is part of the core content of the document. In Step 3, eliminate the words which are Piy < Cy by extracting a core word’s criterion value Cy which is obtained in Step 2. Since then, the appearance rate of the other words which are eliminated Piy < Cy in document y is written as Piy . For example, the word frequency and appearance rate are same as Table 2. If the extracted core word’s criterion value Cy is 0.18, the appearance rate P iy will be 0.4, 0.3, and 0.2 which are greater than 0.18. And the core words in this document will be ‘AAA’, ‘BBB’ and ‘CCC’ which include these Piy values. 3.1.4. Generating the integration of core words database In the step of generating the integration core words database, collect the patent documents which pass through the Extracting Core Word Step, then construct the database and select the weight of each core word. Namely, the rest of the core words which are eliminated by the Piy < Cy condition and their appearance rate Piy from rejected documents are used to generate the integration core word database. In this study, the appearance rate of core words in the integration core word database is set by the core word weight, which will be written as Vi. An algorithm for generating the weight of integration core word database (Vi) is as follows: y = Set of rejected patent document i = Set of words in document y Step 1. Piy = The result of extracting core word algorithm Pm  P iy Step 2. V i ¼ y¼1 (y = 1, 2, . . ., m) Ri Vi = Weight of word i Ri = The number of duplicated core word Step 3. If there is no duplicated core word V i ¼ P iy Table 3 Example of duplicated word. Document (y)

Word

P iy

1 2 3 4

AAA AAA AAA BBB

0.5 0.3 0.1 0.2

4306

S.-S. Park et al. / Expert Systems with Applications 38 (2011) 4302–4311

Fig. 4. Integration core word DB.

The appearance rate P iy is gotten by the extracting core word algorithm in Step 1. When we construct the integration core word database, lots of duplicated words are generated because the patent documents are merged together. These words are duplicated in Step 2 through the process of setting the weight of the core words. If the core word is duplicated in the database, the value of the word is weighted. Set weights of these duplicated core words through Step 2, and if there is no duplicated core words in integration core word database, Piy set value as Vi which is its weight. For example, there is a duplicated word in the integration core words database in Table 3. There are three duplicated words so that Ri is 3 and P iy is 0.5, 0.3, 0.1 which causes the duplicated document y to be 1, 2, 3. Therefore, the weight (Vi) of the word ‘AAA’ can be obtained by Eq. (2):

Vi ¼

Pi1 þ P i2 þ Pi3 0:5 þ 0:3 þ 0:1 ¼ 0:3 ¼ 3 3

ð2Þ

Fig. 4 shows integration core word database constructed from hard disk technology patents as a result of the integration core word database generation step. The term ‘word’ represents a word and ‘m_cnt’ represents the weight of each word.

Table 4 Example of comparing similarity value. i

Word

3.2. Rejection Criterion Value Selection In the procedure of selecting a rejection criterion value, use the results of implementing the extracting core word algorithm from the patent documents which were not used to construct the integration core word database. 3.2.1. Selecting the similarity value algorithm An algorithm for selecting similarity value is as follows: y = Unused patent documents of constructing integration core word database I = Duplicated word of integration core word database and y Vi = Weight of each word in integration core word database Step 1. Get each Piy value by implementing the extracting core word algorithm from unused patent documents to construct the integration core word database Step 2. Select the Piy value of duplicated words and Weight value Vi by comparing with the integration core word database P Step 3. Sy ¼ ni¼1 ðV i  P iy Þ Vi = Weight of word i Sy = Similarity value of document y

Vi

Integration core word algorithm 1 AAA 2 BBB 3 CCC 4 DDD ... ...

0.5 0.6 0.2 0.3 ...

i

Word

P iy

Comparison patent (y) 1 2 3 4 ...

BBB EEE DDD FFF ...

0.4 0.3 0.25 0.2 ...

The similarity value is decided to sum up the result of multiplying Piy (which is the implemented result of the extracting core word algorithm from unused patent documents to construct integration core word database) by Vi (which is the weight of the integration core word database). For example, the integration core word database and comparison patent document are the same as Table 4. The duplicated words in the integration core word database and comparison patent are ‘BBB’ and ‘DDD’, and the similarity value is the same as in Eq. (3)

Sy ¼ ðV 2  P1y Þ þ ðV 4  P3y Þ ¼ ð0:6  0:4Þ þ ð0:3  0:25Þ ¼ 0:315 ð3Þ Fig. 5 shows a database screen which shows the similarity value for each document related to hard disk technology. In the figure, ‘did’

S.-S. Park et al. / Expert Systems with Applications 38 (2011) 4302–4311

4307

Fig. 5. Selecting similarity value from each document.

represents the document’s unique ID, ‘r_flag’ represents either registration (0) or rejection (1), ‘m_cnt’ represents the similarity value, and ‘file1’ represents the file name of the patent document.

y = Set of patent document i = Set of words in document y W iy

3.2.2. Rejection Criterion Value Selection To select the rejection criterion value, we used clustering algorithm K-means, RDA, the perceptron algorithm, and selected the algorithm which derived the best result by comparing them. Section 4 describes the experimental result by algorithm separately. For a better understanding of the algorithms, the K-means algorithm is used as an example to explain the procedure of selecting rejection criterion values. If you input 2 for k which means the number of colonies, including registration and rejection colonies, the input data will be Sy, which is the similarity value of the patent document, and 0 or 1 which is the result that the patent document will be registered or rejected. Output data is the boundary line between a registration colony and rejection colony, and this boundary line value becomes the rejection criterion value. Table 5 shows an example of input data for K-means. Registration and rejection are indicated with 0 and 1, respectively.

3.3. Prediction In the prediction procedure, the algorithm predicts the possibility of registration using the patent documents which are passed through the extracting core word algorithm and the selecting the similarity value algorithm, and the rejection criterion value which was selected in an earlier step. An algorithm for prediction is as follows:

Table 5 Input data of K-means.

Similarity value Registration or rejection

Doc 1

Doc 2

Doc 3

Doc n

3.13 0

8.56 1

4.34 0

... ...

Step 1. Piy ¼ Pn

i¼1

W iy

(i = 1, 2, 3, . . ., n)

Piy = Word frequency rate of i in document y Wiy = Word frequency of i in document y Step 2. C y ¼ 1n n = the number of wholewords in document y Cy = Extracting criterion value of core word in document y Step 3. If (Piy < Cy) Eliminate word Piy value of rest word is declared to Piy Step 4. Select the P iy value of duplicated word and Weight value Vi by comparing with the Integration Core Word database P Step 5. Sy ¼ ni¼1 ðV i  P iy Þ Vi = Weight of word i Sy = Similarity value of document y Step 6. Compare the registration criterion value C which was gotten in Rejection Criterion Value Selection step with similarity value Sy If (criterion value C > Sy) registration Else If (C < Sy) rejection Else impossible to predict

In order to determine whether the patent documents will be registered or not, the algorithm finds the similarity value of those documents which are gotten through the extracting the core word algorithm and the selecting the similarity value algorithm. After that, the algorithm compares that value with the document’s similarity value and rejection criterion value. If the similarity value is greater than the rejection criterion value it indicates rejection, and if not it is considered that the possibility of registration is high. However if the similarity value and rejection criterion value are the same, it is impossible to predict registration. The similarity value Sy is the degree that a rejected patent document compares to the document in question. It is possible to say that if the similarity

4308

S.-S. Park et al. / Expert Systems with Applications 38 (2011) 4302–4311

Fig. 6. Automated Patent Examining System.

Fig. 7. The rejection criterion value of Bluetooth technology by using K-means, RDA, and Perceptron.

value Sy of the patent document is high, this patent document contains many features for rejection.

Integration Core Word database step, and selecting the similarity value by comparing each document and the integration core word database. In order to select a rejection criterion value, K-means,

4. Experiment and result The described algorithm is implemented as a web-based program and constructed with the server-side scripting language PHP, using mySQL for the database. The program’s interface is shown in Fig. 6. This program performs the word frequency extraction step by dragging the patent document from the server, constructing the

Table 6 Predicted rate for Bluetooth technology. Algorithm

Test data

Success

Failure

Predicted rate (%)

K-means RDA Perceptron

50 50 38

35 38 38

15 12 12

70 76 76

S.-S. Park et al. / Expert Systems with Applications 38 (2011) 4302–4311

4309

Fig. 8. The rejection criterion value of solar cell technology by using K-means, RDA, and Perceptron.

RDA, and perceptron algorithms were used. These algorithms are not implemented on their own but were used from the Classifier Toolbox of Matlab. 4.1. Data acquisition The word database is constructed from 200 registered patent documents and 200 rejected patent documents on four different topics: Bluetooth, solar cells, LCDs, and hard disks. The total of 1600 documents is from the US patent process. The integration core word database was constructed from 100 documents, and the rejection criterion value was created by comparing these documents to 50 other documents. There are a total of 140,216 words in the 200 patent documents about Bluetooth technology. When the integration core word database was constructed for that subject using the explained algorithm it consisted of 2411 core words. There are a total of 173,145 words in the 200 patent documents about solar cell technology. When the integration core word database was constructed

Table 7 Predicted rate for solar cell technology. Algorithm

Test data

Success

Failure

Predicted rate (%)

K-means RDA Perceptron

50 50 50

38 41 42

12 9 8

76 82 84

for that subject using the explained algorithm the integration core word database consisted of 3636 core words. There are a total of 198,564 words in the 200 patent documents about LCD technology. When the integration core word database was constructed for that subject using the explained algorithm, the integration core word database consisted of 5284 core words. There are total 132,481 words in the 200 patent documents about hard disk technology. When the integration core word database was constructed for that topic using the explained algorithm, the integration core word database consisted of 2356 core words. 4.2. Bluetooth technology Fig. 7 shows the rejection criterion value for documents dealing with Bluetooth technology which was determined using the Kmeans, RDA, and Perceptron algorithms. The integration core word database was constructed from 100 documents, and the rejection criterion value was selected by comparing these documents with 50 new documents. Table 6 shows the predicted result of registration or rejection for the 50 new patent documents according to the rejection criterion value. As shown in Table 6, generating the rejection criterion value corresponding to the 50 new patent documents on Bluetooth technology and predicting the result of registration or rejection, the predicted rate of K-means was 70%, RDA was 76%, and Perceptron was 76%.

Fig. 9. The rejection criterion value of LCD technology by using K-means, RDA, and Perceptron.

4310

S.-S. Park et al. / Expert Systems with Applications 38 (2011) 4302–4311

Fig. 10. The rejection criterion value of hard disk technology by using K-means, RDA, and Perceptron.

4.4. LCD technology

Table 8 Predicted rate for LCD technology. Algorithm

Test data

Success

Failure

Predicted rate (%)

K-means RDA Perceptron

50 50 50

39 42 44

11 8 6

78 84 88

Table 9 Predicted rate of hard disk technology. Algorithm

Test data

Success

Failure

Predicted rate (%)

K-means RDA Perceptron

50 50 50

37 43 46

13 7 4

74 86 92

Fig. 9 shows rejection criterion value of LCD technology which was determined using the K-means, RDA, and Perceptron algorithms. The integration core word database was constructed from 100 documents, and the rejection criterion value was selected by comparing these documents with 50 new documents. Table 8 shows the predicted result of registration or rejection for the 50 new patent documents according to the rejection criterion value. The predicted rate of K-means was 78%, RDA was 84%, Perceptron was 88%. 4.5. Hard disk technology

Table 10 Predicted rate for each technology. Algorithm

Bluetooth (%)

Solar cell (%)

LCD (%)

Hard disk (%)

Average of predicted rate (%)

K-means RDA Perceptron Average of predicted rate

70 76 76 74

76 82 84 80.7

78 84 88 83.3

74 86 92 84

74.5 82 85 80.5

Fig. 10 shows rejection criterion value of hard disk technology which was determined using the K-means, RDA, and Perceptron algorithms. The integration core word database was constructed from 100 documents, and the rejection criterion value was selected by comparing these documents with 50 new documents. Table 9 shows the predicted result of registration or rejection for the 50 new patent documents according to the rejection criterion value which was selected by using K-means, RDA, and Perceptron. As shown in Table 9, generating the rejection criterion value corresponding to the 50 new patent documents on Bluetooth technology and predicting the result of registration or rejection, the predicted rate of K-means was 74%, RDA was 86%, and Perceptron was 92%. 4.6. Comparison of experiments by each technology

4.3. Solar cell technology Fig. 8 shows the rejection criterion value of solar cell technology which was determined using the K-means, RDA, and Perceptron algorithms. The integration core word database was constructed from 100 documents, and the rejection criterion value was selected by comparing these documents with 50 new documents. Table 7 shows the predicted result of registration or rejection for the 50 new patent documents according to the rejection criterion value. As shown in Table 7, generating the rejection criterion value corresponding to the 50 new patent documents on Bluetooth technology and predicting the result of registration or rejection, the predicted rate of K-means was 76%, RDA was 82%, and Perceptron was 84%.

In this paper, the K-means, RDA, and Perceptron algorithms were used to measure each predicted rate by comparing three algorithms in order to generate the rejection criterion value. As shown in Table 10, the RDA and Perceptron predicted rate was the highest in Bluetooth, and the Perceptron predicted rate was the highest in solar cell technology, LCD technology, and hard disk technology. Among the four technologies, LCD technology had the highest predicted rate, and the average predicted rate of the four technologies was 80.5%. 5. Conclusion This paper proposed an Automated Patent Examining using pattern recognition techniques to predict registration or rejection

S.-S. Park et al. / Expert Systems with Applications 38 (2011) 4302–4311

of the patent. The proposed algorithm assumes that rejected patent documents have some features for rejection, and apply those characteristics as criteria for examination. We generated weight by extracting the word frequency of rejected patent documents through text preprocessing and building up the integration core word database. After that, we found a relevant document’s similarity value by comparing it with the integration core word database. We set each document’s similarity value and the result about the registration or rejection of each patent document as an input value, then selected a rejection criterion value using the K-means, Perceptron, and RDA algorithms. The result of experiments about Bluetooth, solar cell, LCD, and hard disk patent applications is that Perceptron shows the highest predicted rate of rejection. And also, the average predicted rate of the four technologies is 80.5%, so that the proposed system can predict registration or rejection of the patent. The proposed Automated Patent Examining System in this paper derives objective prediction results based on past patent data so that we do not have to rely on the subjective judgments of the existing group of patent examiners and patent attorneys to know whether patents will be registered or not. This system is expected to predict the investment value of a patent in advance for a company which intends to realize profits by developing new technology and manufacturing products at the same time. In addition, it is possible to prevent applying for patents which have a low possibility to be registered, so that the company can reduce economic losses in accordance with patent applications. We will continue the study to get better performance selecting rejection criterion values by comparing various other algorithms and trying more experiments about different technologies. Acknowledgements This work was supported by the Brain Korea 21 Project in 2010. Supported by a Korea University Grant.

4311

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (20100024163).

References Archibugi, D., & Pianta, M. (1996). Measuring technological change through patents and innovation survey. Technovation, 16(9), 451–468. Bédécarrax, C., & Huot, C. (1994). A new methodology for systematic exploitation of technology databases. Information Processing and Management, 30(3), 407–418. Black, D., & Ciccolo, P. (2004). Machine learning for patent classification. Camus. Available from http://www.stanford.edu/class/cs229/proj2005/BlackCiccoloMachineLearningForPatentClassification.pdf. Centocor Inc. (2009). v. Abbott Laboratories case number 07-cv-00139. US District Court for the Eastern District of Texas. Clifton, C., & Cooley, R. (1999). TopCat: Data mining for topic identification in a text corpus. In Proceedings of the 3rd European conference of principles and practice of knowledge discovery in databases, Prague, Czech Republic. Fattori, M., Pedrazzi, G., & Turra, R. (2003). Text mining applied to patent mapping: A practical business case. World Patent Information, 25, 335–342. Korea Intellectual Property Office (2007). Patent and information analysis (pp. 46– 58). Korea Software Copyright Committee (2007), Patent Troll and SW patent, Institute for Information Technology Advancement, Science Information (pp. 29–30). Lai, K.-K., & Wu, S.-J. (2005). Using the patent co-citation approach to establish a new patent classification system. Information Processing and Management, 41(2), 313–330. Lee, W. H., Ahn, G. J., & Lee, M. H. (2003). Technology innovation in Korea through patent citation analysis (Vol. 1). KIIE (pp. 1007–1013). Lent, B., Agrawal, R., & Srikant, R. (1997). Discovering trends in text databases. In Proceedings of international conference on knowledge discovery and data mining, Newport Beach, California, USA (pp. 14–17). Tseng, Y. H., Lin, C. J., & Lin, Y. I. (2007). Text mining techniques for patent analysis. Information Processing and Management, 43, 1216–1247. Yoon, B., & Park, Y. (2004). A text-mining-based patent network: Analytical tool for high-technology trend. Journal of High Technology Management Research, 15, 37–50. Yoo, S. H., Lee, Y. H., & Won, D. G. (2006). A study on estimation of technology life span using analysis of patent citation. Information Management Research, 35, 93–112.