Information Processing and Management 41 (2005) 217–242 www.elsevier.com/locate/infoproman
Information extraction with automatic knowledge expansion Hanmin Jung *, Eunji Yi, Dongseok Kim, Gary Geunbae Lee Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja-dong, Nam-gu, Pohang, Kyungbuk 790-784, South Korea Received 28 February 2003; accepted 15 July 2003 Available online 6 September 2003
Abstract POSIE (POSTECH Information Extraction System) is an information extraction system which uses multiple learning strategies, i.e., SmL, user-oriented learning, and separate-context learning, in a question answering framework. POSIE replaces laborious annotation with automatic instance extraction by the SmL from structured Web documents, and places the user at the end of the user-oriented learning cycle. Information extraction as question answering simplifies the extraction procedures for a set of slots. We introduce the techniques verified on the question answering framework, such as domain knowledge and instance rules, into an information extraction problem. To incrementally improve extraction performance, a sequence of the user-oriented learning and the separate-context learning produces context rules and generalizes them in both the learning and extraction phases. Experiments on the ‘‘continuing education’’ domain initially show that the F 1-measure becomes 0.477 and recall 0.748 with no user training. However, as the size of the training documents grows, the F 1-measure reaches beyond 0.75 with recall 0.772. We also obtain F -measure of about 0.9 for five out of seven slots on ‘‘job offering’’ domain. 2003 Elsevier Ltd. All rights reserved. Keywords: Information extraction; Question answering; User-oriented learning; Lexico-semantic pattern; Machine learning
1. Introduction Information extraction is a process that takes unseen documents as input and produces a tabular structure as output. As Internet growth accelerates, information extraction is attracting
*
Corresponding author. Tel.: +82-54-279-5581; fax: +82-54-279-2299. E-mail addresses:
[email protected] (H. Jung),
[email protected] (E. Yi),
[email protected] (D. Kim),
[email protected] (G.G. Lee). 0306-4573/$ - see front matter 2003 Elsevier Ltd. All rights reserved. doi:10.1016/S0306-4573(03)00066-9
218
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
considerable attention from the Web intelligence community. Traditional information extraction tasks involve locating specific information from a plain text written in a natural language. Thus, the task biases an information extraction only as one area of natural language processing. Information extraction, as a fundamental front-end technique for knowledge discovery, data mining, and natural language interface to databases, on the Web has transformed a major Web application technology (Jung, Lee, Choi, Min, & Seo, 2003; Nahm, 2001; Nahm & Mooney, 2000). One crucial challenge in information extraction as a Web application technology is to acquire domain portability. Since most previous systems require human annotated data to learn extraction rules or patterns, domain experts manually annotate the training data. Worse, when a new domain is added, a considerable portion of the time to graft for the domain is poured into laborious annotation. To circumvent this problem, recent research develops weakly supervised and unsupervised learning algorithms. However, the new techniques do not yet satisfy the back-end applications (Eikvil, 1999; Zechner, 1997). Domain portability is greatly affected by the manner in which Web document types are used and by which point in time domain experts or users are involved. Thus, we propose two strategies: first, replacing laborious annotation with automatic knowledge extraction from structured Web documents, 1 and second, placing users at the end of a learning cycle in a deployment phase. To incrementally improve extraction performance, POSIE combines user-oriented learning to produce context rules with separate-context learning to generalize the rules. The remainder of the paper is organized as follows. Section 2 reviews important related research on information extraction. Section 3 proposes our information extraction model based on a question answering framework. The detail architecture and the knowledge of the POSIE 2 are respectively described in Sections 4 and 5. Section 6 explains the techniques to expand this knowledge through user-oriented and separate-context machine learning. Section 7 analyzes experimental results for the practical ‘‘continuing education’’ domain. To conclude this paper, Section 8 discusses the functional characteristics of information extraction systems and future works. 2. Related research Information extraction (IE) systems using an automatic training approach (Grishman, 1997; Sasaki, 1999; Yangarber & Grishman, 1998) have a common goal: to formulate effective rules to recognize relevant information. They achieve this goal in the manner of annotating training data and running a learning algorithm (Knoblock, Lerman, Minton, & Muslea, 2000; Riloff, 1996; Riloff & Jones, 1999; Sudo, Sekine, & Grishman, 2001). Recent IE research concentrates on the development of trainable information extraction systems for the following reasons. First, annotating texts is simpler and faster than writing rules by hand. The rapid growth of the Web contents increases the need for a series of automatic processing steps. Second, automatic training ensures domain portability and full coverage of ex1 Documents containing attributes which can be correctly extracted based on some uniform syntactic clues, for example, tables in the form of separated attributes and their contents. 2 POSIE (POSTECH Information Extraction System).
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
219
amples. However, training data which is expensive to acquire interrupts the predominance of trainable IE systems over other approaches. Core machine learning algorithms to reduce the burden of the training data were adopted in many NLP 3 applications including information extraction. Weakly supervised learning algorithms such as co-training and co-EM were developed for text categorization (Blum & Mitchell, 1998; Nigam & Ghani, 2000). They reduce the annotated data by using a small set of seed rules and by adding unlabeled text with the best score. Despite these efforts, error propagation in the total learning cycle is too severe to obtain the refined rules. Another strategy is active learning, which is a fully supervised learning algorithm, but the user as a domain expert is required to locate the examples (RayChaudhuri & Hamey, 1997). Due to this laborious work, knowledge changes are difficult to incorporate into the system. Pierce and Cardie indicate some limitations of co-training for natural language processing, and propose user-oriented learning (Pierce & Cardie, 2001a, 2001b). They are concerned about the scalability and portability of information extraction systems. During the learning cycle, the user confirms the desirability of new examples. Although the user is not an expert, either in machine learning or in information extraction, he or she is competent to identify the specific information to extract as an end user. To leverage the userÕs ability, user-oriented learning puts the user into a deployment cycle. However, the algorithm still requires the user both to annotate seed training data and to select new candidate examples. The user is only sporadically involved in the learning process; users would not concentrate further on handcrafted works. Recently, some systems address automatic constriction of extraction rules without any annotation, which is more desirable in practical-level applications (Jones, McCallum, Nigam, & Riloff, 1999). For DIPRE system, Brin (1998) uses a bootstrapping method to obtain patterns and relations from Web documents without pre-annotated data. The process is initiated with small samples, such as the relation of (author, title) pairs. Next, DIPRE searches a large corpus for patterns in which one such pair appears. Similarly, Yangarber and Grishman apply an automatic bootstrapping to seek patterns for name classification (Yangarber & Grishman, 2000). This method requires a named-entity tagger and a parser to mark all the instances of peopleÕs names, companies, and locations. Kim et al. improves these automatic bootstrapping algorithms using the types of the Web documents (Kim, Cha, & Lee, 2002; Kim, Jung, & Lee, 2003). They focus more on declarative-style knowledge, which can be extended with human interaction for practicallevel performance in a deployed commercial system. To generate extraction patterns, this model combines declarative DTD-style patterns and an unsupervised learning algorithm: SmL. The elimination of human pre-processed documents for training produces great portability to new domains. However, the model sacrifices a portion of extraction precision to acquire a high domain portability. Without a dedicate process by hand, a fully automatic extraction system does not always ensure stable results. 4 User-oriented learning is a promising strategy which eliminates the deficiency of task coverage and provides feedback to the extraction system in both the learning and extraction phases. On the
3
Natural Language Processing. SmL shows above 0.8 F 1-measure for semi-structured ‘‘audio–video’’ domain, but below 0.2 for more difficult and free-style ‘‘continuing education’’ domain. 4
220
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
other hand, the automatic acquisition of rules from structured Web documents is a great benefit to the WWW. To maximize the efficiency of information extraction on the Web, we propose a nice hybridized technique of automatic bootstrapping and user-oriented learning, i.e., the annotation aspect of user-oriented learning is replaced with bootstrapping. That is, the user becomes involved in only the confirmation aspect of learning. The combination of user-oriented learning and separate-context learning incrementally expands domain knowledge. As the framework of our IE system, a question answering system (Lee et al., 2001) is redesigned and adopted in the following four steps: first, automatically extract instances from structured Web documents; second, construct instance rules through sentence-to-LSP 5 transfer; third, confirm context rules by user-oriented learning; finally, generalize the context rules with separate-context learning. 3. Information extraction as question answering The goal of question answering is to develop a system that retrieves answers rather than documents in response to a question (Lee et al., 2001). As an ordinary procedure, a question answering system focuses on possible answers, how to determine the answer type and how to select the answers for each answer type. The system classifies possible answers, designs a method to determine the answer types, and searches answer candidates. There are three major steps: question processing, passage selection, and answer processing. Question processing analyzes the input question to understand what the user wishes to find. Passage selection ranks the passages in retrieved documents. Answer processing selects and ranks answer candidates matching the answer type. Question answering is closely related to information extraction in that its purpose concerns the acquisition of user-requested information. However, information extraction as question answering is much easier than question answering itself for the following reasons. First, information extraction has a set of pre-selected questions for a target, which removes the need for question processing. Second, a pre-classified document as input is ready to be directly processed, while question answering should identify related documents in unrestricted open domains. Third, the relation between slots is available in information extraction, i.e., pre-defined slots help to determine their instances by using the relation. Thus, recasting the question answering into information extraction can produce a better performance than state of the art question answering systems 6 (Harabagiu et al., 2000; Moldovan et al., 1999). Information extraction as a question answering also simplifies the extraction processes. We can easily introduce the techniques verified by question answering, such as domain knowledge and instance rules. Fig. 1 shows the similarities between information extraction and question answering. A slot in information extraction corresponds to a pre-selected question in question answering. Thus, information extraction can exclude the question processing which generates many ambiguities for answer types. Domain knowledge, which is common to the two applications, includes a category dictionary, a thesaurus, and collocation information. As a shared feature 5
LSP (lexico-semantic patterns): knowledge encoding techniques used in SiteQ (Lee et al., 2001) question answering system. 6 0.6–0.7 for reciprocal score. (The score for an individual question is the reciprocal of the rank at which the first correct response was found, or 0 if no correct response was found in the top five responses.)
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
221
Fig. 1. Information extraction as a question answering. Italics are for information extraction as compared with the question answering.
between the two, instance rules are applied to obtain instance hypotheses or answer candidates from the input document. The IE model on a question answering framework improves the building process of domain knowledge by separately applying the types of Web documents. Structured Web documents provide a set of instances for each slot. Instance-to-LSP transfer automatically constructs instance rules for IE from the instances obtained from automatic bootstrapping. The following section explains the system architecture based on the IE as a question answering model described here.
4. System architecture for extraction Our system, POSIE, consists of three major phases: building, learning, and extraction. The building phase constructs several classes of extraction knowledge (see Section 5), such as collocation DB (database) for NE (named entity) tagging and instance rule DB for instance finding. The learning phase generalizes the rules to enhance the extraction coverage (see Section 6). Fig. 2 shows only the system architecture to extract target frames using the knowledge obtained and generalized by the building and learning phase. 4.1. HTML pre-processing and morphological analysis DQTagger, an HTML pre-processor, removes most HTML tags except
and for an HTML document (Shim, Kim, Cha, Lee, & Seo, 2002). The pre-processor keeps the
222
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
Fig. 2. System architecture in the extraction phase.
Table 1 Examples of error DB (English italics are approximate translation) Incorrect morpheme sequences
Correct morpheme sequences Art practical trainer Becoming parents
layout of tables and determines the boundary of the body. All the processes after this pre-processing are performed on HTML tag-removed documents, that is, almost plain texts. A morphological analyzer (MA) segments and analyzes Korean sentences. Each eojeol 7 in a sentence produces pairs of morphemes and the part-of-speech (POS) tag. The MA post-editing of the analysis restores the incorrect morpheme sequences using an error DB (Table 1). 4.2. Category dictionary and thesaurus SiteQ, a question answering system, uses a category dictionary and a thesaurus to construct lexico-semantic patterns for both questions and retrieved passages (Lee et al., 2001; Kim, Kim, Lee, & Seo, 2001). Since POSIE shares the concept of language processing with a question answering system such as SiteQ, we use a category dictionary and a thesaurus as the main semantic 7
Segmented phrases and words in Korean that become a spacing unit.
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
223
information sources. In TREC 10, 8 SiteQ has 66 semantic tags and many user-defined semantic classes. The semantic tags are now expanded to 83 in POSIE. Semantic tags 66 tags for Q/A
Artificial language, action, artifact, belief, bird, book, building, city, color, company, continent, country, date, direction, disease, drug, event, family, fish, food, game, god, group, language, living thing, location, magazine, mammal, month, mountain, movie, music, nationality, nature, newspaper, ocean, organization, person, phenomenon, planet, plant, position, reptile, school, season, sports, state, status, subject area, substance, team, transport, weekday, unit for area, unit for count, unit for date, unit for length, unit for money, unit for power, unit for rate, unit for size, unit for speed, unit for temperature, unit for time, unit for volume, unit for weight
Extended 17 tags
Address, appliance, art, computer, course, deed, examination, hobby, law, level, living part, method, picture, river, room, sex, unit for age
The category dictionary has approximately 67,280 entries which consist of four components: semantic tag, user-defined semantic class, part-of-speech tag, 9 and lexical form. The structure of semantic tags is a flat form. In a lexico-semantic pattern, each semantic tag follows a ‘‘@’’ symbol. User-defined semantic classes are the tags for syntactically or semantically similar lexical groups. For example, a user-define semantic class ‘‘%each’’ includes the words, such as ‘‘ ,’’ ‘‘ ,’’ ,’’ ‘‘ ,’’ ‘‘ ,’’ ‘‘ ,’’ ‘‘ ,’’ and ‘‘ ’’ in Korean. ‘‘ The thesaurus, which is an assistant of the category dictionary, discovers sense codes for general unknown words. The codes are matched with category sense-code mapping table (CSMT) to acquire the most similar semantic tags. Currently, the thesaurus has about 90,000 words. [Thesaurus entries with sense codes] 386DX 03010173091001010o0202 (approval) 010A6M0E090H1a01j0B01010102030c02j0B0K0Q070p0B [Category sense-code mapping table (CSMT)] @computer 03010173091001010o02 @action 0B0Ej0B0K0Q062C04j0B0K0Q07 [Mapping results] 386DX fi @computer (approval) fi @action The vertical bar means ‘‘or.’’
8 9
TREC: Text Retrieval Conference, http://itl.nist.gov/. We use 32 part-of-speech tags.
224
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
Table 2 Example of sentence-to-LSP transfer (English italics are approximate translation) Phrases
Lexico-semantic pattern @hobby @position (@level)
Reading trainer Fairy tale oral narrator Fairy tale oral narrator Recreation coach
4.3. Sentence-to-LSP transfer A lexico-semantic pattern is a structure where linguistic entries and semantic types can be used in combination to make an abstraction of certain sequences of words in a text (Lee et al., 2001; Mikheev & Finch, 1995). Linguistic entries consist of words, phrases, and part-of-speech tags, such as ‘‘YMCA,’’ ‘‘Young MenÕs Christian Association,’’ and ‘‘nq_loc.’’ 10 Semantic types include slot name instances, semantic tags (categories), and user-defined semantic classes, for example, ‘‘#ce_c_teacher,’’ 11 ‘‘@person,’’ and ‘‘%each.’’ Sentence-to-LSP transfer makes a lexico-semantic pattern from a given sentence (Jung et al., 2003). Lexico-semantic patterns enhance the coverage of extraction by information abstraction through many-to-one mapping between phrases and a lexico-semantic pattern (Table 2). The lexico-semantic patterns obtained from structured Web documents become the left-hand sides of instance rules. The average compression ratio 12 is about 50%, i.e., about two unique sentences are transferred into one lexico-semantic pattern. Results show that the distribution of compression ratio has a high deviation according to slot names. Experimentally, the type of slot names influences recall and precision (Table 3, see Section 7). The transfer consists of two phases: named entity (NE) recognition and NE tagging. NE recognition discovers all possible semantic types for each word by consulting a category dictionary and a thesaurus (Rim, 2001). When a semantic type for a given word does not exist in the category dictionary, we attempt to discover the semantic types using the thesaurus. A category sense-code mapping table converts the sense code on the thesaurus into semantic tags used in the category dictionary. The table consists of pairs of semantic tags and sense codes. Each word without a semantic type becomes the key for the thesaurus search. If the search succeeds and some sense codes are retrieved, we calculate the semantic distance between the retrieved codes and the codes
10 11 12
Part-of-speech tag denoting for location or organization. Slot name instance for ‘‘teacher’’ slot in ‘‘continuing education’’ domain. Compression ratio suggests the degree of abstraction.
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
225
Table 3 Sentence-to-LSP transfer compression ratio from structured Web documents (the slot names are our target slots on ‘‘continuing education’’ domain in this paper) Slot name
# Sentences
# Unique sentences
# LSPs
Compression ratio
$TEACHER $NAME $START $PERIOD $MONEY $TIME $NUMBER
931 1982 226 841 1458 1904 834
579 1662 78 158 277 964 68
228 1287 15 32 43 186 5
60.62% 22.56% 80.77% 79.75% 84.48% 80.71% 92.65%
Total
8176
3715
1796
51.73%
Table 4 Trigram as collocation information for tagging Trigram
Frequency
NULL num @unit_money @unit_date num @weekday sym_* num @unit_time
138 41 25
contained in each semantic tag in CSMT. The semantic distance lows:
13
(similarity) is defined as fol-
SimðA; BÞ ¼ 2 common levelðA; BÞ=ðlevelðAÞ þ levelðBÞÞ NE tagging selects a semantic type for each word so that a sentence can map into our lexicosemantic pattern only. Collocation DB has the form of a trigram and is utilized for the tagging. The components of the trigram, like lexico-semantic patterns, are lexical entries and semantic types. The examples for the trigrams and the frequencies for the ‘‘continuing education’’ domain are given in Table 4. 4.4. Instance finding To find extractable instances, we apply two major features: instance rules and context rules. The instance rules automatically obtained from structured Web documents discover instance hypotheses in a given document (see Section 6.1). The lexico-semantic pattern and the slot name are the components of the instance rules, as follows: num@unit datenum@unit datesym par@living partsym par ! ! ukall@positionsym par@positionsym par
$START $NAME
We match the left-hand sides with a lexico-semantic pattern from the sentence-to-LSP transfer. If matching succeeds, the lexico-semantic pattern becomes an instance hypothesis for the slot
13
Current threshold is 0.7.
226
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
name on the right-hand side. Next, to expand the coverage of instance rules, POSIE merges the instance hypotheses. Recursively, the instance merge applies to the algorithm given below: Let A and B be instance hypotheses. [Basic conditions] 1. The two are the instance hypotheses of the same slot name. 2. The two are in the same sentence in a document. (We do not consider HTML tags because they were removed during the previous HTML pre-processing.) The two are merged into a new hypothesis: 1. If the scope 14 of A includes that of B, and vice versa. 15 2. If the scope of A overlaps with that of B, and vice versa. 3. If AÕs end position meet with BÕs start position, and vice versa. 4. If there is a symbol between AÕs end position and BÕs start position, and vice versa. The following shows some of the examples of the instance merge: [Sentence for ‘‘course time’’ slot] 16
Lecture time: Saturday. 09:30–16:30 (1 day/week, 7 hours/day) [Lexico-semantic pattern] {#ce_c_time sym_: @weekday sym_. num sym_: num sym_- num sym_: num sym_par @unit_date num @unit_date sym_, num @unit_date num@unit_time sym_par} [Instance hypotheses] 34 instance hypotheses (8 of them have correct slot name) including {@weekday} ($TIME) // Saturday {numsym_:numsym_-numsym_:num} ($TIME) // 09:30–16:30 {@unit_datenum@unit_date} ($TIME) // 1 day/week {num@unit_datenum@unit_time} ($TIME) // 7 hours/day [The result of instance merges] {@weekdaysym_.numsym_:numsym_-numsym_:numsym_par @unit_datenum@unit_datesym_,num@unit_datenum @unit_timesym_par} // Saturday. 09:30–16:30 (1 day/week, 7 hours/day) After instance search and merge, we have all possible instance hypotheses. The remaining modules of the extraction phases filter the hypotheses using context rules, dynamic slot grouping,
14
The scope means the length (from the start position to the end position) of an instance hypothesis. For example, instance hypothesis ‘‘09:30–16:30’’ has the length of 7 because its LSP ‘‘num sym_: num sym_- num sym_: num’’ consists of 7 elements. 15 For example, if A is ‘‘09:30’’ and B is ‘‘09:30–16:30’’ in the input sentence ‘‘Lecture time: Saturday. 09:30–16:30 (1 day/week, 7 hours/day),’’ then the scope of B includes that of A. Thus, the two would be merged. 16 ‘‘Lecture time: Saturday. 09:30–16:30 (1 day/week, 7 hours/day).’’
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
227
and a slot relation check. We have two versions of context rules (see Section 5.2): context rules from user-oriented learning (see Section 6.2) and generalized context rules from separate-context learning (see Section 6.3). These rules verify the extracted instance hypotheses using the left and right context. 4.5. Target frame filling A context rule represents only one slot instance using the left and right context. On the other hand, WHISK (Soderland, 2001) permits rule description for multi-slots, which is a major reason why WHISK gives accurate results in discovering multi-slot and their relations. However, WHISK requires learning all the types of permutations because the rule description depends on the ordering of slots. Our dynamic slot grouping removes the two major limitations of previous systems such as WHISK, i.e., the number of slots to describe and the learning load to permute. Two or more context rules are woven into a rule after discovering instance hypotheses. For example, if two context rules ‘‘{#ce_c_teacher sym_:} $TEACHER {sym_par %picture @action @action sym_, %picture @position sym_par #ce_c_period}’’ and ‘‘{#ce_c_period sym_:} $PERIOD {#ce_c_time}’’ share the same boundary ‘‘#ce_c_period,’’ then the slot grouping dynamically combines the two rules as ‘‘{#ce_c_teacher sym_:} $TEACHER {sym_par %picture @action @action sym_, %picture @position sym_par #ce_c_period sym_:} $PERIOD {#ce_c_time}.’’ Any restriction on the number of slots to describe does not exist, because slots are freely grouped in running time. This eventually makes POSIE extract multi-slots without any training or rule for them. The ordering of slots does not affect the learning load of permutation because the source of learning is a simple context rule, not a combined form of either two or more rules. Thus, dynamic slot grouping is a prospective algorithm for the multi-slot extraction that other information extractors regard as a burdensome chore. The following examples show our dynamic slot grouping: [Input document after HTML pre-processing] 17
Child art medical cure primary class Teacher: Yoon, Youngok (major on art medical cure, art medical curer) Lecture period: 15 weeks Lecture time: Saturday. 09:30–16:30 (1 day/week, 7 hours/day) Registration fee: 450,000 Won
17
The underlined phrases are slot instances.
228
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
[Slot instances and their context] {NULL} $NAME {#ce_c_teacher 18} {#ce_c_teacher sym_:} $TEACHER {sym_par %picture @action @action sym_, %picture @position sym_par #ce_c_period} {#ce_c_period sym_:} $PERIOD {#ce_c_time} {#ce_c_time sym_:} $TIME {#ce_c_money} {#ce_c_money sym_:} $MONEY {NULL} [Dynamic slot grouping with left and right context 19] Combined rule for multi-slots: {NULL} $NAME {#ce_c_teacher sym_:} $TEACHER {sym_par %picture @action @action sym_, %picture @position sym_par #ce_c_period sym_:} $PERIOD {#ce_c_time sym_:} $TIME {#ce_c_money sym_:} $MONEY {NULL}
Finally, slot relation check (SR) determines the number of target frames and slots to fill by the inspection of the groups to which slots belong as follows: Let M be the group with the largest number of slots. Let slot-numðAÞ be the number of slots belonging to group A.
18 19
The italic words are the context boundaries to dynamically group. The same patterns in the below table indicate that they shares their own context.
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
229
Let n be the number of groups where slot-numðAÞ > 1. For each group A do If slot-numðAÞ is 1 and the slot name in A is one of the slot names in M, then remove group A. If n is 1, the number of target frames is also 1. After the target frame filling, the user can confirm the extraction results. POSIE provides several types of information, i.e., instance type, position, context score, and group number, to the user to help the confirmation. The feedback updates context rules and then generalized context rules to incrementally improve the extraction performance.
5. Rules for information extraction Information extraction requires several pieces of knowledge to be closely related to a predefined target. Many systems attempt to minimize domain-dependent knowledge and handcrafted features. Machine learning algorithms such as co-training (Blum & Mitchell, 1998) and inductive learning (Michalski, Carbonell, & Mitchell, 1983) are widely used as aspects of the effort. We also follow mainstream research by reusing semantic information, minimizing manual annotation, and generalizing rules using machine learning. In contrast to other systems, we use two kinds of rules to extract and filter instances: instance rules and context rules, including a generalized version. A context rule consists of a slot name and two context separated into left and right. This two-level rule description increases extraction coverage without keeping all the possible permutations between instances and the context. 5.1. Instance rules From each table field in structured Web documents, instance rules are automatically acquired through a sentence-to-LSP transfer process (see Section 6.1). Instance rules, a tool of instance finding, consist of lexico-semantic pattern and slot name pairs. Automatic acquisition of instance rules overcomes the current major barriers of information extraction: domain portability with minimal human intervention while maintaining a high extraction performance. POSIE automatically extracts the instances for each slot from structured Web documents, and uses them as seed instance examples. Instance rules are the knowledge required to find slot instance hypotheses. The role of the rules in information extraction resembles that of question answering, whose rules have named entities to discover answer candidates. In information extraction, however, the rules consist of slot instances. The rules are applied to test documents to obtain all possible instance hypotheses. Thus, instance filtering is a role of the context rules. Feedback on user-oriented learning helps to select instance rules to be added later. With the positive confirmation of the user, new instance rules from the merged instances are added into instance rules. Next, the system automatically applies these newly updated rules to extract instance hypotheses. This process assures an incremental improvement of the system by enhanced recall.
230
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
Table 5 Examples of context rules . . .Teacher: David Lee (picture remedy major, picture curer) Lecture period. . . . . .Lecture time: Saturday: 09 : 30–16 : 30 r Registration fee. . . Left context
Slot namea
Right context
#ce_c_teacher sym_:
$TEACHER
#ce_c_time sym_:
$TIME
sym_par %picture @action @action sym_, %picture @position sym_par sym_h #ce_c_period sym_h #ce_c_money
a
The bold and underlined in the above examples are slot names. Their lefts are left context, and the rights are right context.
5.2. Context rules and generalized context rules In this section, we describe both context rules produced by user-oriented learning and generalized context rules by separate-context learning. The two learning algorithms are sequentially applied to produce the two rules. Context rules composed of a left and right context are the knowledge to represent the context of selected instances. The sample context and their context rules are given in Table 5. The rules proposed by Califf and Mooney (1998) consist of a filler, a pre-filler, and a post-filler. The context rule of POSIE also has three similar parts. However, several differences exist between the two systemsÕ rules: First, the context style and component: their rules consist of part-of-speech tags, semantic classes, and words regarded as independent features. Our rules are lexico-semantic patterns tightly coupled with linguistic components. Second, instance representation: they represent the filler in the same format as a pre-filler and a post-filler. Their rules are, in a sense, zeroth order, i.e., more rules are required to represent various different contexts. In POSIE, instance rules and context rules are separated. A context rule, a meta-instance rule, has only a slot name. The two-level architecture enhances coverage and reduces the size of the rules. Third, context range: they define the range as the number of common features between the examples, which causes the context too short to include all of the clue words. On the other hand, POSIE selects the furthest slot name instance 20 within pre-defined window size, currently 10, as the context boundary. 21 As described above, POSIE adopts a two-level rule architecture. However, the absence of rule generalization does not ensure reliable coverage. We propose a separate-context learning and a sequential covering algorithm, to produce a generalized version of context rules. In their formats, generalized context rules (Table 6) are different from context rules; the latter consists of three parts: a left context pattern, a slot name and a right context pattern, but, the former of four parts: slot name, context type, context pattern and coverage score (see Section 6.3). As with context rules, the slot name is the upper level of two-level rule description. The context type has information about the direction of the current rule, i.e., left and right, and affirmativeness, i.e., positive and negative. Negative rules play a role on filtering instance hypotheses. If the current
20 Slot name instances are the variations of a given slot name, for example, ‘‘professor,’’ ‘‘teacher,’’ and ‘‘lecturer’’ for slot name ‘‘$TEACHER.’’ 21 When no slot name instance is found, context becomes NULL.
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
231
Table 6 Examples of generalized context rules Slot name
Context type
Context pattern
Coverage score
$TEACHER $PERIOD $START
LEFT (+) RIGHT (+) RIGHT ())
#ce_c_teachersym_: #ce_c_time #ce_c_period@weekday#ce_c_start
5/7 4/6 2/5
hypothesis matches with a negative rule, then the hypothesis would be discarded. Where no context rules are applied, we selectively match with generalized rules according to the context type. The source of a generalized rule is context rules described with lexico-semantic patterns. Occasionally, however, the generalized rule would include an incomplete component. For example, ‘‘#ce_c_t’’ originally from ‘‘#ce_c_teacher’’ or ‘‘#ce_c_time,’’ because our learning algorithm produces it as the common string among context patterns. Coverage score is the number of context covered by a current generalized rule among the entire context with the same slot name. 6. Incremental expansion of knowledge To ensure an incremental extraction performance, new reliable knowledge should be added into an extraction system as training proceeds. POSIE automatically extract instances, the source of instance rules, for each slot using mDTD 22 (Kim et al., 2003). Whenever a Web robot gathers documents for a given domain, for the structured documents among them, the mDTD rules extract the instances. This process gradually increases instance rules through sentence-to-LSP transfer. Further, POSIE incrementally expands domain knowledge using a sequence of useroriented learning and separate-context learning. We adapt original user-oriented learning to reducing the userÕs involvement by replacing manual annotation with automatic bootstrapping. User-oriented learning, a promising algorithm which applies to both learning and extraction phases, is combined with separate-context learning to produce a generalized version of the context rules confirmed by the user. Our knowledge expansion on instances and context is similar with Jones and his colleaguesÕ work in that they also use two distinct knowledge: phrases and extraction patterns (Jones et al., 1999). However, we do not use their mutual bootstrapping-like methodology because iterative bootstrapping loop on different knowledge would cause error propagation although each loop chooses the highest scoring pattern. We prevent error propagation by excluding iterative mutual learning between the knowledge, and filter instance hypotheses by applying dynamic slot grouping and the validation with both instance and context rules. 6.1. Extracting instances from automatic bootstrapping The instance extractor focuses more on declarative-style knowledge, which can be extended with human interaction for practical-level performance in an actual deployed commercial system. 22
mDTD (modified Document Type Definition): an analytical interpretation to identify target information from the textual fragments of Web documents.
232
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
The extractor applies a new extraction method to combine declarative DTD-style extraction patterns and a machine learning algorithm without an annotated corpus to generate the extraction patterns. The DTD concept is generally used for markup languages, such as SGML, XML, and HTML. In these documents, DTD is usually located in some external files, and defines the elements which belong to this document type (Flynn, 1998). Using DTD, SGML documents can encode the elements included in the documents, and also parse those elements that appear in the document. We introduce the concept of mDTD, an extension of the conventional DTD concept of SGML, which we modify for applicability to HTML-based Web document extraction. The background idea of mDTD is similar to DTD usage in SGML. Hence, mDTD is used to encode and decode the textual elements of the extraction target. In the learning phase, mDTD rules are learned and added to the set of seed mDTDs for the extraction task. In the extracting phase, a learned mDTD rule set is used as extraction patterns to identify the elements in HTML documents from Web sites. The idea of mDTD gives a more structured encoding ability to an otherwise degenerated HTML document. A Web Robot gathers Web pages according to the seed URL lists for a given domain. The robot downloads only structured Web documents among the pages. Next, the instance extractor parses seed mDTD rules and then uses token sequences to construct hierarchical mDTD object graphs. In this token sequence, the extraction process is the same as the instance classification task, where each token is classified into an instance of the extraction target based on the HTML table structure, and the possible name instances of each class. The extractor identifies various types of extraction targets, which are defined by the template with slots (same as the schema attributes in the relational database), and then fills the empty template slots with the identified instances. Table 7 shows an example of the output template, with its filled slots. We process part-of-speech tagging and rule matching tests. Basically, the extractor Table 7 Example of a slot name and its instances extracted from structured Web documents (English italics are approximate translation) Slot name
Instance
$NAME (course name) C++ programming language CNC Processing Technology Park, Jongbok Piping Work Park, Seungri Confucius’s love study Lecture of scientific investigation Course of study for tourism Church music––piano Beading
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
233
uses exact matching techniques between the input token with POS tag information and the symbolic rule object. If the input token partially matches the symbolic object, decision-making for matching depends on the ratio of the matched characters compared with the total length of the input token, where the threshold level is up to half the total length. The POS tag sequence rules are used only to test for exact matching. If none of the symbolic rules from lexical similarity match, this module evaluates the POS tag rules; otherwise, POS tag rules are applied to confirm the matching results between the token and mDTD rules. SmL (Kim et al., 2002; Kim et al., 2003), the forerunner of POSIE, describes the whole process in detail. 6.2. Adapting user-oriented learning User-oriented learning, a moderately supervised learning algorithm introduced by Pierce and Cardie, concentrates on two main issues in information extraction: scalability and portability (Pierce & Cardie, 2001b). Real users are deployed in the identification of the target they wish to locate and extract. Real users may not be experts at machine learning or text processing, but may be qualified experts at judging their goals. The authors believe that users can specify their information needs by providing training examples; the user is proficient at judging an information structure as adequate or inadequate. User-oriented learning performs three steps: annotation, location, and confirmation. Users confirm examples as positive, negative, and unconfirmed (no decision) examples (Fig. 3). Distinct from active learning, the user merely confirms the desirability of new examples. Definitive judgments from the user also differentiate the form of weakly
Fig. 3. Extraction results before and after applying the user confirmation.
234
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
supervised learning. However, user involvement in the final decision step is inevitable to acquire an acceptable quality for target extraction, as shown in Fig. 3. The followings show the context score (confirmation score) calculation formula: Context score ¼ ð# positive decisions # negative decisionsÞ=# total decisions apply threshold if # total decisions is 0 ðinitial conditionÞ where current threshold is 0:25 Using structured Web documents as annotated sources removes the need for manual annotation (Kim et al., 2002). Users can concentrate on confirmation without any manual annotation of training corpus. Thus, the learning steps in POSIE are reduced to two steps which differ from the original user-oriented learning: location and confirmation. The judgment of the user greatly influences the location of new candidate examples. The context score ranges from )1 to 1. 6.3. Generalization of context rules The sequential covering algorithm, which is a family of algorithms for inductive learning, learns one rule and then removes the examples which the rule covers (Mitchell, 1998). This iteration is called a one-rule-learning–discarding process. Using this process, we introduce a separate-context learning algorithm to generalize the characteristic of context rules (Fig. 4). CalculateMinScore( ) and CalculateMaxScore( ) calculates the minimal (smin ) and the maximal (smax ) covering scores for current context rule (ci ). The covering score represents the number of rules covered by the current rule. Covering score (for positive) ¼ # positive rules covered by given context ¼ )1 if some negative rules are covered by the context
Fig. 4. Separate-context learning algorithm.
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
235
Table 8 Separate-context learning compared with the other sequential covering algorithms Separate-context learning SmL CN2
Example selection
Rule scoring
Example types
Best context score and frequency (specific-to-general) No selection Best complex (general-to-specific)
Coverage
Positive/negative and left/right Only positive Positive and negative
Coverage and length Entropy
Covering score (for negative) ¼ # negative rules covered by given context ¼ )1 if some positive rules are covered by the context GeneralizeContext( ) returns the new context rule as long as possible with the maximal (smax ) covering score. The function repeatedly grows the size of a given rule and calculates its own covering score while the maximal (smax ) covering score would not be dropped. The enlargement will be stopped when the maximal (smax ) covering score changes down. Like a standard separateand-conquer algorithms such as IREP (Cohen, 1995), our separate-context learning trains rules in a greedy fashion. However, we attempt to find the best context rule set strictly holding the highest context score. Table 8 compares the three sequential covering algorithms using three important features: example selection, rule scoring, and example types to demonstrate the general ability of our algorithm. CN2 (Clark & Niblett, 1989) measures the performance for the generated rules using information gain, which resembles FOIL (Quinlan, 1990), while, sequential m-DTD learning (SmL) (Kim et al., 2002) calculates the lexical similarity and coverage rate. Our separate-context learning selects examples by both context score and rule frequency. Positive and negative examples for learning are the field instances simply extracted from structured Web documents. Unlike the other two algorithms, our algorithm learns four sets of examples: positive + left, positive + right, negative + left, and negative + right. When each set of generalized rules entirely covers its own examples, the learning stops.
7. Experimental results A Web robot searched Web sites, such as universities and education centers, which provide information on ‘‘continuing education.’’ We manually gathered and filtered 431 Web documents on course information from tens of education-related Korean Web sites such as http://oun. knou.ac.kr/, http://www.ajou.ac.kr/~lifetime/, and http://ncle.kedi.re.kr/. Two hundred and forty eight of them were semi-structured Web documents 23 and the others were structured Web documents. One thousand seven hundred and ninety six instance rules were automatically extracted
23
Documents containing tuples with missing attributes, attributes with multiple values, variant attribute permutations, and exceptions.
236
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
Table 9 Incremental user-oriented learning Total
Measure
Technique
No user training
24 docs
54 docs
78 docs
Recall
CS CS + GC CS + GC + SR
0.748 0.748 0.748
0.756 0.78 0.772
0.756 0.78 0.772
0.789 0.78 0.772
Precision
CS CS + GC CS + GC + SR
0.35 0.35 0.35
0.489 0.653 0.674
0.508 0.681 0.704
0.61 0.727 0.731
F 1-measure
CS CS + GC CS + GC + SR
0.477 0.477 0.477
0.594 0.711 0.72
0.608 0.727 0.736
0.688 0.753 0.751
from 3715 instances in the structured Web documents (Table 3). The rules determine instance hypotheses from the semi-structured Web documents in the first extraction phase. POSIE extracts instances for seven slots: prescribed number ($NUMBER), teacher ($TEACHER), course name ($NAME), start time ($START), period ($PERIOD), tuition fee ($MONEY) and school hours ($TIME). The semi-structured Web documents include several multi-slots to handle. 24 We divide the documents into two sets: a training and a test set. One hundred and seventy out of 248 are randomly selected as the test set. The training set consists of three subsets: 24, 30 and 24. POSIE measures the extraction performance after learning each training subset. Finally, we applied the context score (CS; see Section 6.2), generalized context (GC; see Section 6.3) and slot relation check strategies (SR; see Section 4.5). 25 Table 9 shows that user-oriented learning enhances extraction performance. As the criteria of the performance, we measure recall, precision and F 1-measure on three techniques: context score (CS), context score + generalized context (CS + GC) and context score + generalized context + slot relation check (CS + GC + SR). We define the baseline of the system as the performance with no user training, i.e., no manual confirmation is applied. F 1-measure on the baseline is 0.477. Recall is sufficiently high to apply to this domain without human intervention. The high recall and low precision imply that the automatic knowledge construction from structured-documents helps to search instance hypotheses, but does not have useful information to select one. As the size of the user training documents grows, the system obtains a higher performance, up to 0.75 for F 1measure. The performance after learning 24 documents rapidly increases. While the user training documents less than 10% of total are used for user-oriented learning, precision and F 1-measure reach almost the peak of our extraction performance. Almost all of our improved performance comes in precision, while recall stays almost completely flat. Only the merged instances are added into new instance rule set when the user positively confirms, that is, any instance rules completely separated from existed rule set would not be added into the set. Providing the way to consider the separated
24 25
More than one target frames in a document. All the strategies include dynamic slot grouping strategy.
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
237
Fig. 5. IE performance on each slot.
rules during the user-oriented learning will certainly increase recall and overall performance, and which is one of our future works. CS + GC + SR strategy is always superior to those of CS and CS + GC except for 78 documents in F 1-measure, but no exception exists in precision. The reason would be that the high performance of CS and GC decreases SRÕs effect. Indeed, in a case of omitting CS or GC, the performance gap by SR distinctly increased for several randomly selected documents. General context rules ensure a high precision. User-oriented learning makes a set of context rules, and separate-context learning generates the generalized version that determines whether the current instance hypothesis extracted from an unseen document is a real instance. From the above results, we can see that the greater the number of user learning documents, the less the role of the slot relation check. We also experimented with the case that CS + GC + SR does not include dynamic slot grouping strategy. Recall is the same as the result in Table 9, and precision drops to 0.721 at 78 docs. The little effect of the grouping is caused by the high performance of context rules, that is, CS and GC. As we expected, in the case without CS and GC strategies, SR with dynamic slot grouping has 0.61 as mentioned in the above table, and SR without dynamic slot grouping has 0.47 in precision at 78 docs. Fig. 5 shows the extraction performance on each seven slot. From four slots such as $NUMBER, $PERIOD, $MONEY and $TIME, we obtain F 1-measures higher than 0.8. On the other hand, course name and teacher slots have the range of 0.5–0.6 for F 1-measure. The two slots have more variations in their forms than the other slots, which proves that the compression ratios of the two are less than the others. However, the low performance on some specific slots does not discourage POSIE. Through an indirect comparison with other systems on semi-structured documents, WHISK (Soderland, 2001) and SRV (Freitag, 1998a, 1998b), for teacher slot, 26 POSIE achieves a much better performance. For WHISK, the recall for ‘‘speaker’’ slot is only 0.111 at precision 0.526, and for SRV, precision is 0.62. POSIE shows a remarkable result that precision for ‘‘teacher’’ slot is 0.769 at recall 0.435. This result is noteworthy because the ‘‘teacher’’ slot often requires more than one name, 26
They call it ‘‘speaker’’ slot.
238
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
Table 10 Added extraction knowledge after user-oriented learning New instance rules Context rules General context rules
No user training
24 docs
54 docs
78 docs
0 0 0
112 26 16
144 317 19
200 364 14
occasionally even five or six persons, for an instance in the ‘‘continuing education’’ domain. The high recall for the ‘‘teacher’’ slot would be obtained from a category dictionary which is crucial knowledge for answering questions. Even in the worst case, precision and recall always are above 0.4, which ensures the reliability of our extraction algorithm. While previous IE systems are weak to extract persons, locations, and organizations, question answering systems endeavor to discover them. To extract these types, POSIE adopts many methodologies from the question answering system, which eventually ensure higher recall than other IE systems. User-oriented learning and separate-context learning enrich extraction knowledge (see Table 10). New instance rules and context rules are incrementally increased as learning proceeds. General context rules oscillate in their size due to the conflict between context rules, because our learning produces only correct generalized rules which do not cover any opposite rules. We also experimented on ‘‘job offering.’’ We manually gathered and filtered 190 Web documents from http://www.jobkorea.co.kr/, http://www.joblink.co.kr/, and http://www.guinbank. com/ and so on. POSIE extracts instances for seven slots: category, number, age, schooling, salary, area and period. We divide the documents into two sets as the above experiment: a training and a test set. One hundred and thirty five out of 190 are randomly selected as the test set. The training set consists of three subsets: 24 and 55. POSIE measures the extraction performance after learning each training subset. Finally, we applied the mixture of the context score, generalized context and slot relation check strategies to the test documents (Table 11).
8. Conclusion 8.1. Characteristics of the information extraction systems Table 12 summarizes the functional characteristics of well-known information extraction systems (Eikvil, 1999). 27 The first four rows have a background in the wrapper generation communities (Kushmerick, 2000; Muslea, Minton, & Knoblock, 1998), i.e., they generate wrappers for structured Web documents with delimiter-based patterns, and the others in the traditional information extraction communities. RAPIER (Califf & Mooney, 1998), SRV and WHISK adopt relational learning algorithms to handle a wider range of texts. The last two are systems developed from our recent research. SmL applies automatic bootstrapping to the instances in structured Web documents (Kim et al., 2003). While SmL shows the optimal reduction of human inter-
27
The table entries except last two rows are cited from EikvilÕs.
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
239
Table 11 Incremental user-oriented learning on ‘‘job offering’’ Technique
Slot
No user training
24 docs
55 docs
Recall
CS + GC + SR
Category Number Age Schooling Salary Area Period
1.000 0.600 0.600 1.000 0.911 1.000 1.000
1.000 0.400 0.556 1.000 0.911 1.000 1.000
1.000 0.400 0.578 1.000 0.911 1.000 1.000
Precision
CS + GC + SR
Category Number Age Schooling Salary Area Period
0.771 0.234 0.639 0.699 0.706 0.663 0.600
0.800 0.909 0.644 0.788 0.869 0.744 0.600
0.821 1.000 0.736 0.849 0.883 0.808 0.957
F -measure
CS + GC + SR
Category Number Age Schooling Salary Area Period
0.871 0.337 0.619 0.823 0.800 0.797 0.750
0.889 0.556 0.597 0.881 0.890 0.853 0.750
0.902 0.572 0.648 0.918 0.897 0.894 0.978
Table 12 Functional characteristics of information extraction systems (O , requires all possible permutations to be trained to process; O , groups related information after extraction) Name
Structured document
Semi-structured document
Free text
Multi-slot
Missing items
Permutations
ShopBot WIEN SoftMealy STALKER RAPIER SRV WHISK
O O O O O O O
– – O O O O O
– – – – – – O
– O – O – – O
– – O O O O O
– – O O O O O
SmL POSIE
O O
O O
– Oa
– O
O O
O O
a POSIE can handle free text documents with lexico-semantic patterns although it does not include any syntactic chunker or parser.
vention and guides an adequate use of the Web document types, it suffers from unstable extraction results due to the lack of natural language processing capabilities.
240
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
POSIE satisfies all the functions required by Table 12. To handle multi-slot extraction, POSIE links related instances using shared boundaries between context rules and slot relation check, and detects missing instances that frequently appear in semi-structured and free text documents. Since POSIE dynamically combines the context rules extracted from a document, 28 the specific ordering of the instances does not degrade its performance. 8.2. Discussion POSIE is an information extraction system with the hybridization of automatic bootstrapping and a sequence of learning algorithms on a question answering framework. POSIE uses structured Web documents, dictionaries, and semantic information to seed patterns of instances and context. Minimal human effort is used to validate these patterns, and then iteratively discover new ones. The system has several strong points. First, minimal intervention is required by users who are domain experts. Second, question answering techniques give a high performance with a reliable recall of slot instances. Third, a wide linguistic combination from lexical forms to semantic features is employed. Fourth, a sequence of learning algorithms in both the learning and extraction phases ensures incremental extraction performance. Future work includes the following topics: adding new domains to ascertain domain portability, 29 designing flexible generalization algorithm to obtain maximal coverage for unseen documents, and updating user-oriented learning interface to add separated instance rules to increase recall.
Acknowledgements This work was supported by BK21 (Ministry of Education) and mid-term strategic funding (MOCIE, ITEP).
References Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the conference on computational learning theory. Brin, S. (1998). Extracting patterns and relations from the World Wide Web. In Proceedings of the international workshop on the Web and databases. Califf, M., & Mooney, R. (1998). Relational learning of pattern-match rules for information extraction. In Proceedings of AAAI spring symposium on applying machine learning to discourse processing. Clark, P., & Niblett, T. (1989). The CN2 induction algorithm. Machine Learning, 3(4). Cohen, W. (1995). Fast effective rule induction. In Proceedings of the 12th international conference on machine learning. Eikvil, L. (1999). Information extraction from World Wide Web: A survey. Technical Report 945, Norwegian Computing Center.
28
This implies that it does not need to get trained on all possible permutations. Now we are preparing ‘‘job offering and hunting’’ domains, and some experiments on them showed similar performance with ‘‘continuing education’’ domain. 29
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
241
Flynn, P. (1998). Understanding SGML and XML tools––Practical programs for handling structured text. Kluwer Academic Publishers. Freitag, D. (1998a). Information extraction from HTML: Application of a general machine learning approach. In Proceedings of the 15th conference on artificial intelligence. Freitag, D. (1998b). Toward general-purpose learning for information extraction. In Proceedings of the 17th conference on computational linguistics and the 36th annual meeting of the association for computational linguistics. Grishman, R. (1997). Information extraction: Techniques and challenges. Materials for information extraction. International Summer School SCIE-97. Harabagiu, S., Moldovan, D., Pasca, M., Mihalcea, R., Surdeanu, M., Bunescu, R., G^ırju, R., Rus, V., & Morarescu, P. (2000). FALCON: Boosting knowledge for answer engines. In Proceedings of the 9th text retrieval conference. Jones, R., McCallum, A., Nigam, K., & Riloff, E. (1999). Bootstrapping for text learning tasks. In Proceedings of the IJCAI-99 workshop on text mining: Foundations, techniques and applications. Jung, H., Lee, G., Choi, W., Min, K., & Seo, J. (2003). Multi-lingual question answering with high portability on relational databases. IEICE Transactions on Information and Systems, E86-D(2). Kim, D., Cha, J., & Lee, G. (2002). Learning mDTD extraction patterns for semi-structured Web information extraction. Computer Processing of Oriental Languages, 15(1). Kim, D., Jung, H., & Lee, G. (2003). Unsupervised learning of mDTD extraction patterns for Web text mining. Information Processing and Management, 39(4). Kim, H., Kim, K., Lee, G., & Seo, J. (2001). A fast and reliable question-answering system based on predictive answer indexing and lexico-syntactic pattern matching. Computer Processing of Oriental Languages, 14(4). Knoblock, C., Lerman, K., Minton, S., & Muslea, I. (2000). Accurately and reliably extracting data from the Web: A machine learning approach. Data Engineering Bulletin, 23(4). Kushmerick, N. (2000). Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118. Lee, G., Seo, J., Lee, S., Jung, H., Cho, B., Lee, C., Kwak, B., Cha, J., Kim, D., Ahn, J., Kim, H., & Kim, K. (2001). SiteQ: Engineering high performance QA system using lexico-semantic pattern matching and shallow NLP. In Proceedings of the 10th text retrieval conference. Michalski, R., Carbonell, J., & Mitchell, T. (1983). Machine learning: An artificial intelligence approach. Tioga Publishing Company. Mikheev, A., & Finch, S. (1995). Towards a workbench for acquisition of domain knowledge from natural language. In Proceedings of the 7th conference of the European chapter of the association for computational linguistics. Mitchell, T. (1998). Machine learning. McGraw-Hill. Moldovan, D., Harabagiu, S., Pasca, M., Mihalcea, R., Goodrum, R., G^ırju, R., & Rus, V. (1999). LASSO: A tool for surfing the answer net. In Proceedings of the 8th text retrieval conference. Muslea, I., Minton, S., & Knoblock, C. (1998). STALKER: Learning extraction rules for semistructured, Web-based information sources. In Proceedings of AAAI workshop on AI and information integration. Nahm, U. (2001). Text mining with information extraction: Mining prediction rules from unstructured text. PhD Proposal, The University of Texas at Austin. Nahm, U., & Mooney, R. (2000). Using information extraction to aid the discovery of prediction rules from text. In Proceedings of the KDD (knowledge discovery in databases)––2000 workshop on text mining. Nigam, K., & Ghani, R. (2000). Understanding the behavior of co-training. In Proceedings of the KDD (knowledge discovery in databases)––2000 workshop on text mining. Pierce, D., & Cardie, C. (2001a). Limitations of co-training for natural language learning from large datasets. In Proceedings of the conference on empirical methods in natural language processing. Pierce, D., & Cardie, C. (2001b). User-oriented machine learning strategies for information extraction: Putting the human back in the loop. Working notes of the IJCAI workshop on adaptive text extraction and mining. Quinlan, J. (1990). Learning logical definitions from relations. Machine Learning, 5. RayChaudhuri, T., & Hamey, L. (1997). Active learning––approaches and issues. Intelligent Systems, 7. Riloff, E. (1996). Automatically generating extraction patterns from untagged text. In Proceedings of the 13th national conference on artificial intelligence. Riloff, E., & Jones, R. (1999). Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the 16th national conference on artificial intelligence.
242
H. Jung et al. / Information Processing and Management 41 (2005) 217–242
Rim, H. (2001). Language resources in Korea. In Proceedings of the symposium on language resources in Asia. Sasaki, Y. (1999). Applying type-oriented ILP to IE rule generation. In Proceedings of the AAAI-99 workshop on machine learning and information extraction. Shim, J., Kim, D., Cha, J., Lee, G., & Seo, J. (2002). Multi-strategic integrated Web document pre-processing for sentence and word boundary detection. Information Processing and Management, 38(4). Soderland, S. (2001). Learning information extraction rules for semi-structured and free text. Machine Learning, 34. Sudo, K., Sekine, S., & Grishman, R. (2001). Automatic pattern acquisition for Japanese information extraction. In Proceedings of the conference on human language technology. Yangarber, R., & Grishman, R. (1998). Transforming examples into patterns for information extraction. In Proceedings of TIPSTER text program phase III. Yangarber, R., & Grishman, R. (2000). Machine learning of extraction patterns from unannotated corpora: Position statement. In Proceedings of the 14th European conference on artificial intelligence workshop on machine learning for information extraction. Zechner, K. (1997). A literature survey on information extraction and text summarization. Paper for direct reading.