Learning the “Whys”: Discovering design rationale using text mining — An algorithm perspective

Learning the “Whys”: Discovering design rationale using text mining — An algorithm perspective

Computer-Aided Design 44 (2012) 916–930 Contents lists available at SciVerse ScienceDirect Computer-Aided Design journal homepage: www.elsevier.com/...

3MB Sizes 0 Downloads 42 Views

Computer-Aided Design 44 (2012) 916–930

Contents lists available at SciVerse ScienceDirect

Computer-Aided Design journal homepage: www.elsevier.com/locate/cad

Learning the ‘‘Whys’’: Discovering design rationale using text mining — An algorithm perspective Yan Liang a , Ying Liu b,∗ , Chun Kit Kwong a , Wing Bun Lee a a

Department of Industrial and Systems Engineering, Hong Kong Polytechnic University, Hong Kong SAR, China

b

Department of Mechanical Engineering, National University of Singapore, Singapore 117576, Singapore

article

info

Keywords: Design rationale Rationale representation Rationale discovery Text Mining Patent mining

abstract Collecting design rationale (DR) and making it available in a well-organized manner will better support product design, innovation and decision-making. Many DR systems have been developed to capture DR since the 1970s. However, the DR capture process is heavily human involved. In addition, with the increasing amount of DR available in archived design documents, it has become an acute problem to research a new computational approach that is able to capture DR from free textual contents effectively. In our previous study, we have proposed an ISAL (issue, solution and artifact layer) model for DR representation. In this paper, we focus on algorithm design to discover DR from design documents according to the ISAL modeling. For the issue layer of the ISAL model, we define a semantic sentence graph to model sentence relationships through language patterns. Based on this graph, we improve the manifold-ranking algorithm to extract issue-bearing sentences. To discover solution–reason bearing sentences for the solution layer, we propose building up two sentence graphs based on candidate solutionbearing sentences and reason-bearing sentences respectively, and propagating information between them. For artifact information extraction, we propose two term relations, i.e. positional term relation and mutual term relation. Using these relations, we extend our document profile model to score the candidate terms. The performance and scalability of the algorithms proposed are tested using patents as research data joined with an example of prior art search to illustrate its application prospects. © 2011 Elsevier Ltd. All rights reserved.

1. Introduction To assist engineering design, many computer-aided design and engineering (CAD/E) systems have been developed since the 1960s. Based on the techniques of computer graphics, traditional CAD systems have been helpful in modeling and simulating design objects in 2D or 3D contexts [1]. While these CAD systems can help designers to represent their ideas by means of formal geometrical models, they are expected to provide means of designing new products. Since the 1980s, artificial intelligence techniques have been applied into CAD systems to suggest possible solutions from design knowledge bases. As increasing design information is available in digital form, there is a need to integrate such helpful information into the design knowledge bases to better assist design analysis, innovation and decision-making. It is therefore considered that one of the major concepts for future CAD systems is to build design knowledge bases with a variety of useful engineering design knowledge [2].



Corresponding author. Tel.: +65 65167812. E-mail address: [email protected] (Y. Liu).

0010-4485/$ – see front matter © 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.cad.2011.08.002

Among such design information and knowledge, design rationale (DR) is regarded as one kind of important knowledge for the next-generation product development system [1]. DR generally refers to the explanation of why an artifact is designed the way it is [3]. It is able to help designers to understand design know-how and the technology of an artifact, and also it facilitates the reuse of design knowledge in decision-making and product innovation. Without a careful record of useful design information, significant time and effort are cost to search for relevant answers [4]. Since the 1970s, many DR approaches have been developed with this goal in mind, such as SEURAT for software development [5] and DRed for industrial engineering [4]. However, such DR systems have not been widely spread in industry [6]. One of the most critical reasons is that, they require heavy human involvement to interpret and load DR information into the systems. In addition, they mainly attempt to record DR along design processes, while DRs stored in other archival design documents, such as design reports and patents, are often neglected. Although DR in documents can be interpreted into a predefined DR structure by designers, inconsistency will occur and consequently affect the storage and retrieval of rationales. In order to make the DR process more effective and tractable, one of the promising approaches

Y. Liang et al. / Computer-Aided Design 44 (2012) 916–930

relies on computational algorithms to discover DR from a large number of archival design documents using text mining techniques. We have also observed that although text mining and machine learning techniques have been applied in design document processing, a limited number of tasks focus on mining deep information from design documents, and none of them is on DR. In our previous study, we have proposed a layered rationale representation model, Issue, Solution and Artifact Layer (ISAL) [7] and included a conceptual comparison framework between our ISAL model and the classical DR model, i.e. IBIS (Issue based Information System) model. In this paper, we focus on algorithm design to automatically extract DR information from a large amount of archival design documents according to our ISAL model. For each single document, we will extract a single set of issue, solution and artifact. The performance evaluation and scalability test of the algorithms are also detailed. The rest of this paper is organized as follows. Section 2 reviews the state of the art on several relevant topics, i.e. DR models and systems, design document processing and patent processing for design, and highlights the challenges and opportunities. In Section 3, we detail our algorithm design for the ISAL model. Next, an ISALbased DR retrieval framework is described in Section 4. Using patent documents as our research data, Section 5 then reveals the performance of our DR discovery approach and Section 6 reports an example of DR extracted and a case study on DR retrieval based on the ISAL model. Section 7 discusses some issues in our current DR strategy, including using patents as research data, structure issue in patent texts, DR and process in presenting a holistic view of DR development, etc. Section 8 concludes the paper. 2. Related work 2.1. DR representation models and systems How to represent DR effectively is the most important issue in DR management. It affects the reuse of DR information and knowledge [3]. DR representation models vary greatly as they support different design activities. The first approach is argumentation-based representation. Issue based Information System (IBIS) [8] is the earliest argumentation-based method to represent DR and it is the original model for most DR approaches [9]. In IBIS, issues, positions, arguments and their relationships are used to represent DR. Several DR systems with graphical user interface support, such as Compendium tool [10] and DR editor (DRed) [4], have been implemented based on the concept of IBIS. Furthermore, several IBIS variants, such as Procedural Hierarchy of Issues (PHI) [11] and Kuaba approach [12], have also been proposed. In addition, Decision Representation Language (DRL), which is concerned with computer usability, defines the primary elements as decision problems, alternatives, goals, claims and groups [13]. It was further extended by Software Engineering Using Design RATionale (SEURAT) which includes argument ontology to predefine claim types [5]. Question, Option and Criteria (QOC) is another argumentation-based model, which focuses on the discussions of alternatives regarding specific artifact features [13]. Other representation approaches include models that integrate rationale information with detailed engineering information, e.g. programming codes, sketches and product specifications. Most of these representation models are related to particular domains, like product design and software engineering. For example, functional representations are centered on describing how the device works [14]. A Rationale Construction Framework (RCF) was proposed to acquire rationale information by monitoring designers’ interaction with a CAD tool [15]. In order to support design traceability in software architecture design, a rationale-based architecture model was proposed by linking up architecture elements, such as requirements, assumptions and design objects [16].

917

2.2. Design document processing using text mining The wide use of computer-aided design documentation tools has enabled the record of many e-design documents, such as patents, journals, design reports, drawing notes and design logbooks. They often contain useful information for design, e.g. customer demand analysis, design concepts and rationale. To effectively support engineering design, design document processing has been studied using text mining techniques. We classify the applications into three classes based on the contents that they produce, i.e. document-level, fragment-level and semantic-level manners. Design document processing at the document-level often focuses on helping designers to organize or retrieve relevant design documents. The most typical way is to index documents based on a domain taxonomy. Yang et al. [17] introduced a thesauribased approach for design document indexing using vector space model and singular value decomposition techniques. Also, McMahon et al. [18] introduced a Waypoint system for engineering information searches. In such a system, the relevant documents were retrieved through a constrained-based classification, which provided mapping between classification nodes and term phrases. However, manual effort is required to build and maintain the mappings. In our previous study, we have also conducted a study on automatically classifying design engineering papers [19], in which we have proposed a new term weighting scheme. Although document-level processing can help designers to search for relevant documents, designers still need much time to locate the segments of interests if the document is quite long. Design document processing at the fragment-level allows users to access and retrieve document content in fragments rather than the whole document. Liu et al. [20] proposed a computational framework for retrieval of document fragments. In this framework, the documents were decomposed into elements from multiple views, e.g. physical structure view, technical description view and logical content view. Then the fragments were associated with the multiple views for supporting fragments retrieval. Document processing at the semantic-level often focuses on extracting useful information for different applications. For example, in order to organize design documents and support ontologybased design retrieval, a domain ontology, which includes function taxonomy, device taxonomy, material taxonomy, etc., was built using text processing techniques [21]. In order to identify the product portfolio to capture and understand customer needs, association rule mining techniques were introduced to discover useful patterns from past sale and product records [22]. In addition, in order to understand the information needs of users of an engineering information system, the patterns of users’ transaction were extracted by combining Latent Semantic Analysis with analyses of online user transaction logs like online reviews and download records [23]. However, little attempt has been made to discover deep design information and knowledge like design rationale for product design. 2.3. Patent processing for design Patent documents are well recognized as quality data sources for technology trend analysis and design innovation. In this context, the topics of patent classification, retrieval and analysis have been studied for engineering design purpose. Patent classification and patent retrieval can help to organize and produce relevant information at document level or fragment level. For instance, information extraction techniques, such as keyword and phrase extraction as well as term association techniques, were introduced to support topic-based patent retrieval [24]. In addition, ontology annotation, e.g. domain ontology, patent document structure ontology and meta data

918

Y. Liang et al. / Computer-Aided Design 44 (2012) 916–930

We believe that text mining techniques are promising in handling such challenging issues in DR studies. Currently, in design document processing, text mining techniques have received an increasing academic attention in the community. In our previous work, we also studied relevant topics, such as concept extraction for product family ontology [31], semantic relationship identification for manufacturing information management [32] and gathering customer concerns from online product reviews [33]. In view of the challenging issues in DR studies that we have highlighted, we aim to explore the discovery of deep design information, i.e. DR information, from digital design documents using text mining and machine learning techniques. 3. Algorithm design for ISAL model Fig. 1. ISAL model for DR representation.

3.1. ISAL model ontology, was introduced to organize patent documents [25]. Such ontology was used for content-based patent processing, e.g. patent retrieval, classification and clustering. Patent classification has also been studied for better patent document organization. For instance, machine learning algorithms, such as KNN(K-Nearest Neighbor) and back-propagation network, have been used for patent classification and search platforms [24,26]. Patent semantic structural information, e.g. claim and purpose, has also been integrated to further facilitate patent categorization [27]. Although the ontology and structural information can help to index patents in a semantic manner, the approaches for building ontology and discovering semantic segments still need to be improved to reduce human involvement and secure the accuracy. In addition, the studies on patent analysis attempt to discover useful patterns from patent documents. Patent citation information is often used for this purpose. For example, citation information was analyzed to show the inter-organizational transfer of knowledge [28]. Also, information about assignees, attorneys and various section contents, such as claims and detailed descriptions, was studied to suggest potential fields for design [29]. In addition, patent documents were used to identify opportunity of technology [30]. In Yoon’s study, the technical terms were extracted by term frequency and then confirmed by experts. They were next organized in a hierarchical structure. The values of the technical terms were used to suggest the opportunity of technology. However, the discovery of design patterns, such as technical trend and opportunity of technology, still needs to be further validated. 2.4. Challenges and opportunities Although the existing DR systems have made significant advances, they have not been widely used in industry [6]. We have observed that several reasons have led to this situation. Firstly, existing DR systems require heavy human involvement. Designers themselves have to manually input rationales into knowledge bases. However, recording DR is often considered not an essential design task and designers often cannot afford to spend much time in annotating their design intentions, outputs and explaining their motivations [1]. In addition, existing DR systems can only start to record rationales along the design processes. However, rationales stored in a large amount of past design documents are often neglected and rarely included in a DR system. It requires much effort to transfer such rationale information into DR knowledge bases. Furthermore, the manual approach cannot afford to timely handle rationales stored in ongoing design archival documents. Therefore, with the increasing number of digital design documents, there is a significant demand to process and manage DR information in a more effective and tractable manner.

In our previous study, we have proposed a computational model, i.e. ISAL [7], which provides the foundation for our further research, such as DR discovery, retrieval and analysis. The ISAL model consists of three layers to represent DR, i.e. issue layer, solution layer and artifact layer, as shown in Fig. 1. The issue layer describes the design motivational reasons and objectives of designing an artifact. It can be needs of the artifact, limitations of prior relevant artifacts, problems and difficulties needed to be solved, and opportunities to come up with this new artifact. The design solution layer describes how the motivations presented in the issue layer can be addressed and fulfilled. It includes ‘‘solutions’’ and their ‘‘causes and effects’’. ‘‘Solutions’’ represent thoughts, ideas, possible approaches and mechanisms. ‘‘Causes and effects’’ refer to reasons of introducing this ‘‘solution’’, e.g. criteria, considerations and arguments. The artifact layer refers to the artifact information that is described in issue and solution layers. An artifact can be a solid product, such as printer, airplane and camera, and it can also refer to a ‘‘soft’’ product, such as codes, diagram charts and software systems in software engineering. The connections between layers represent ‘‘support’’ or ‘‘related to’’ relationships between elements in different layers. For example, links from the solution layer to issue layer indicate that design solutions are introduced to stress the improvements or significant ideas of the design. Correspondingly, links from the artifact layer to solution layer demonstrate components that are associated with the design solutions. Design solutions link up the issue and artifact components, which offer clues about problems and their corresponding designed artifact. 3.2. Technical blueprints In order to discover DR information according to ISAL modeling from digital design documents particularly with significant textual contents, we propose an approach using text mining and machine learning techniques for artifact information identification, issue summarization and solution–reason pair discovery. Given a single design free text, our DR discovery approach aims to extract a corresponding set of issue, solution and artifact that constitutes the essential information of a DR. We report our rationales and efforts in algorithm design in the following subsections. 3.2.1. Artifact information identification and extraction Artifact information extraction aims to identify entities as artifact components from a single design document. In our previous study, we proposed a document profile (DP) model to extract word sequences that often bear semantic meaning for

Y. Liang et al. / Computer-Aided Design 44 (2012) 916–930

919

Fig. 2. Algorithm for artifact information extraction.

representing document profile [31]. In this study, an extended version of DP model is proposed to extract artifact information. We first form the candidate term set by our DP model. Then we model the relationship among terms as graphs by defining two relations, i.e. positional term relation and mutual term relation. Based on these graphs, we propose to use PageRank [34] algorithm to select terms from the candidate terms to represent artifact information. Fig. 2 shows the process for artifact information extraction. Firstly, a document D is preprocessed into a set of individual sentences and stop words are removed. Then the second process starts to generate candidate term set F using our DP model. Given a sentence support threshold σ and a word gap g, we count the frequencies of terms, either single words or word sequences with 0 to g gap co-occurrence. Next, the candidate term set F is formed by selecting the terms whose frequencies are equal or larger than σ [31]. We next rank and reserve terms that represent artifact information from F using a graph-based approach, i.e. PageRank [34]. We create a term graph with the candidate terms as nodes. If term j is located before term i and they are in the same sentence, there is a direct link from j to i. In order to leverage the term graph to score terms, it is critical to model the relations among terms. We (PTR) define two relations, i.e. positional term relation (PTR) w(i,j) and (MTR)

mutual term relation (MTR) w(i,j) , under the assumption that if the terms often co-occur in the same sentences, they are likely to have similar scores. ) w((iPTR ,j )     1  exp (−dist (i, j, xl )) = Z x ∈X l   0,

distance between i and j at sentence xl . Z is the normalized factor, which is the summation of absolute positional distances between all the candidate terms. In formula (2), the mutual term relation ) w((iMTR is defined using pair mutual information (PMI) between ,j ) term i and term j (formula (3)). p(i) denotes the probability of term i in a document. p(i, j) represents the probability of term i and term j co-occurred at sentence level.

 PMI(i, j)   if i links to j and i ̸= j (MTR) PMI(i, j) w(i,j) =   i ,j 0, otherwise   p(i, j) PMI(i, j) = log , i ̸= j. p(i)p(j)

(2)

(3)

Based on the term relations, the ranking process iteratively updates a term score s(i) by the ranking function shown in formula (4). The initial score s(i) for term i is its term frequency tfi . λ is a coefficient. w(i, j) represents the term relation. |F | denotes the number of candidate terms in set F . In our study, we will examine the performance of two relations, i.e. PTR and MTR, respectively. The ranking process will continue until the summation difference of two successive iterations is lower than a given threshold, 0.001 in our study. s(i) = (1 − λ)

1

|F |





w(j, i)s(j).

(4)

j

(1)

In the final step, we use the part-of-speech (POS) tagging to remove the verbs in F . Then we select the top k% terms in the candidate term list and combine the co-occurrence single words to form the finale artifact information A.

The positional term relation shown in formula (1), is defined based on the average distances between term i and term j at a sentence level. dist(i, j, xl ) denotes the absolute positional

3.2.2. Issue summarization The issue summarization task for our ISAL model aims to extract sentences that represent motivations of the artifact designed. It is different from the existing text summarization that is mainly

if i links to j and i ̸= j otherwise.

) w((iPTR ,j) ,

920

Y. Liang et al. / Computer-Aided Design 44 (2012) 916–930

Fig. 3. Algorithm for issue summarization generation.

meant for general purposes like news summarization. In text summarization, graph-based ranking, such as PageRank [35] and manifold-ranking process [36], is one kind of popular approaches. It ranks the sentences iteratively based on a sentence graph with sentences as nodes and with sentence similarity as link weights. However, the use of traditional sentence similarity based on vector space model is not sufficient to reflect the semantic relationships between sentences and to direct issue-bearing sentences to the top. Therefore, to extract issue-bearing sentences, we define a semantic sentence graph to model sentence relations and propose to use the manifold-ranking algorithm based on the semantic sentence graph. The algorithm for issue summarization generation is shown in Fig. 3. Firstly a document D is segmented into n individual sentences X = {x1 , x2 , . . . , xn }. After stop word removing and term stemming, the next process is to calculate term frequency tft , inverse sentence frequency isft and term weight t wt for each term t, as shown in formula (5), where nt is the number of sentences that contain the term t. t wt = tft × isft ;

isft = 1 + log(n/nt ).

(5)

Then we intend to extract and generate language patterns like terms and phrases that are often used to convey motivational meanings. We initiate the language pattern set T with frequent terms (tft , ≥ 2) in the human tagged sample data. Then we, on the one hand, exclude the terms that appear in the artifact information set from T , since artifact components are often discussed along the description in the document and their related terms can hardly help to distinguish issue-bearing sentences. On the other hand, we include some general words, e.g. ‘‘limitation’’, ‘‘disadvantage’’ and ‘‘need’’, which encode motivational statements into T . In order to get a general pattern set, we expand T by including their hyponyms from WordNet1 —a lexical database of English. We can adopt this process to enrich this language pattern set as we get more training data.

1 http://wordnet.princeton.edu/.

Next, we define the semantic sentence graph G(X , W ) to model sentence relations by integrating issue language patterns. G(X , W ) is built with the sentences in X as nodes. The link weights defined by a matrix W indicate semantic sentence similarity, where w(xi , xj ) is issue-oriented sentence relevance between sentence xi and xj , as shown in formula (6) and (7). It measures how strongly that two sentences share issue language patterns. We let wii = 0 to avoid loops in the graph in the later steps. The similarity matrix W is symmetrically normalized by M = Q −1/2 WQ −1/2 . Q is the diagonal matrix where qii is equal to the sum of the ith row of W .

w xi , xj = 



   rs (xi ∩ T ) + rs xj ∩ T 0

rs (xi ∩ T ) =

 tk ∈xi ∩T

I (tk ) ,

xi ∩ T ̸= ∅ and xj ∩ T ̸= ∅ otherwise

t wk I (tk ) = − log  . t wt

(6)

(7)

Then based on the semantic sentence graph, we improve the manifold-ranking method [37] to rank the sentences using sentence scores and sentence relationships in the graph. Our prior assumptions are: (1) if a sentence contains issue language patterns, it is likely to have higher score; and (2) the nearby sentences are likely to have the same ranking scores. In our semantic sentence graph, we also introduce a vector y = [y0 , y1 , . . . , yn ]T where we define yi to represent whether a sentence xi is similar to any issue language patterns. For each xi , we set yi = 1, if xi contains a term in the language pattern set T ; otherwise, yi = 0. We denote a vector f = [f0 , f1 , . . . , fn ]T , where fi denotes the ranking score for sentence xi , and f (r ) represents the sentence score vector f in the rth iteration. For the initial iteration, we set fi (0) = 1/n. f (r + 1) = ηMf (r ) + (1 − η)y .

(8)

After the parameters have been initiated, the ranking process iteratively updates each fi in the vector f at each iteration r based on the ranking function (8). η is the coefficient (0 ≤ η ≤ 1). The ranking process will not stop until the sum difference of f between two successive iterations is lower than a given threshold 0.001 in our study. Finally, the sentences are sorted based on their final

Y. Liang et al. / Computer-Aided Design 44 (2012) 916–930

921

Fig. 4. Algorithm for generating solution and reason sentences.

scores f ∗ = [f0∗ , f1∗ , . . . , fn∗ ]T and the top k sentences are extracted as issue summarization I of the document D. They are presented according to their order in D to users. 3.2.3. Discovery of solution and reason pairs The solution–reason pair approach focuses on discovering solutions, which address the said issues, as well as reasons, which include effects of the solutions and arguments of why the solutions are proposed. We have noticed that solution sentences often have different patterns, but reasons of a solution are usually expressed by causality sentences. In literature, approaches for causality identification have been reported, including methods based on cue terms and classification [38,39]. However, few of them touch on the topic of solution and reason extraction as in our case. To discover the solution–reason pairs related to an issue, we first separate the sentences into candidate reason-bearing set and solution-bearing set. Next, we direct the solution and reason sentences to the top by using the ranking process and propagating the information from candidate reason sentences to the candidate solution sentences. The algorithm for solutions and reasons discovery is shown in Fig. 4. Firstly, a document D is preprocessed into a sentence set X = {x1 , x2 , . . . , xn } and term weights are calculated. Next, in order to identify reason language patterns, we apply the process of language pattern generation addressed in Algorithm 2 to the reason-bearing sample data. It thus forms the reason-bearing pattern set C .

Secondly, we separate the sentence set X into two nonoverlapping subsets. One set is the candidate reason sentence set Ge = {xe0 , xei , . . . , xep }, where the sentences contain one or more words in the reason-bearing set C . The other set is the candidate solution sentence set Gs which is Gs = X − Ge = {xs0 , xsj , . . . , xsq }. Next, a matrix CM is built to model the connections between Ge and Gs . We assume that if sentence xei has a high correlation with sentence xsj , they are likely to be a solution–reason pair. The matrix CM is defined by positional distances between Ge and Gs , where position(xei ) is the absolute position of sentence xei in the document D, as shown in formula (9): 1

, position(xei ) − position(xsj )

CMij = 

i = 1, 2, . . . , p;

j = 1, 2, . . . , q.

(9)

Next, we use the sentence ranking process discussed in Algorithm 2 to assign scores to sentences in Ge and Gs respectively. We first set the initial parameters for this ranking process. For Ge , we define a sentence score vector fe = [fe0 , fei , . . . , fep ]T where fei = 1/p. Then we introduce a vector ye = [ye0 , yei , . . . , yep ]T to represent how strongly a candidate reason sentence shares reason language patterns. Each yei is set with the number of terms that occur in both sentence xei and language patterns set C . Then ye is normalized. For Gs , we define the sentence score vector fs = [fs0 , fsj , . . . , fsq ]T , where fsj = 1/q. Then we define a vector ys = [ys0 , ysj , . . . , ysq ]T , where ysj is equal to the cosine similarity

922

Y. Liang et al. / Computer-Aided Design 44 (2012) 916–930

4. An ISAL-based DR discovery, retrieval and management framework

Fig. 5. The information propagation from reason sentence set Ge to solution sentence set Gs . (A black node represents a sentence xei in reason sentence set Ge , while a gray node denotes a sentence xsj in solution sentence set Gs . The dotted links between Ge and Gs indicate connections between sentences in Ge and Gs . This connection is defined by a matrix CM . If CM ij > thc , which means a connection between xei and xsj , then the sentence score ei of xei will be propagated to nodes in Gs through xsj . The sentence score ei will decrease as it travels through nodes in Gs . The lines with arrowheads show an example of a travel path.)

between sentence xsj and issue summarization I of D. Here, we assume that the solution sentences should be relevant to the issue. Given the initial parameters, the ranking process performs to get the candidate sentence ranking score vector fe∗ and fs∗ for Ge and Gs respectively. We assume that the scores of the candidate reason sentences will affect the possible solution and reason pairs. The candidate reason sentences with their ranking values can serve as a reference data set to transfer their information to the candidate solution sentences in Gs , as shown in Fig. 5. Therefore, we introduce an information propagation process to re-rank the sentences in Gs by propagating the correlation information from the Ge to Gs . The propagation process is based on a metadata propagation algorithm [40]. With the use of connection information, if CM ij is larger than a threshold thc , the particle xei begins its journey to transfer its associated energy (score) ei = fei∗ from the node xsj to other nodes of Gs . The next node will be selected based on the outgoing edge weight probability distribution rand (out (xsj )). Each time an edge is traversed, the particle xei decays its energy according to a propagation rate δ(0 ≤ δ ≤ 1) over the step st, which is the number of nodes that xei travels, as shown in formula (10). ei (st + 1) = (1 − δ)ei (st ).

(10)

The energy value of a particle defines how much the connection influence a particle ranking score has on a visited node. Each time a particle node xei traverses a node xsj in Gs , it increments the ranking score fsj∗ of node xsj with its current energy value ei , as shown in formula (11): fsj∗ =

fsj∗ + ei , fsj∗ ,



if xei traverses node xsj otherwise.

(11)

The propagation journey of a node xei will stop at the current node xsk , if xsk has no outgoing edges or ei = 0. After the information propagation process, fsj∗ is obtained as the final score for the candidate solution sentence xsj . After we have obtained the final sentence score vector fe∗ and ∗ fs , the final step is to produce ks solution sentences and ke reason sentences from the reason sentence set Ge and solution set Gs as the solution and reason sentences S of the document D. We can get ks and ke according to formula (12). ks = round (qk/n) ;

ke = k − ks .

(12)

Based on our ISAL representation model and the aforementioned algorithm design, we propose a framework for DR discovery, retrieval and management as shown in Fig. 6. This framework consists of two modules, i.e. DR information organization module and DR search and retrieval module. The DR information organization module aims to capture and secure DR information from e-design documents by two basic processes. They are DR discovery process and manual DR annotation process. The DR discovery process is to extract DR information from digital design documents using the proposed approaches in Section 3.2. It can also suggest tags of issues, solutions and artifacts for ongoing design documents. In this context, designers only need to confirm the discovered DR or correct it if necessary. In addition to the automatic approach, our ISAL model can support manual DR annotation and capture. Designers can manually highlight the relevant segments of DR and save them in the DR repository. Based on the DR repository, the DR information retrieval module is designed to facilitate designers’ search for DR information. The query process can handle queries in relation to multiple aspects, i.e. issues, solutions and artifacts. The expanded query terms are matched in the context of the DR repository and then the relevant rationale information is retrieved. Furthermore, in order to better help designers to understand design information, the retrieved DRs are integrated based on their connections. One way to connect rationales is by measuring the relevancy between the retrieved DRs (DRi , DRj ) based on the ISAL structure. It can be calculated based on the aggregate similarity of their corresponding layers, i.e. similarity between issues (Ii , Ij ), similarity between design solutions (Si , Sj ) and similarity between artifacts (Ai , Aj ). Examples of DR extracted using ISAL including a retrieval example are presented in Section 6. 5. Experimental study, results and discussion 5.1. Experimental setup In our study, we use patent documents as our research data. Unlike internal design documents, e.g. design reports, which are confidential, patent documents are quality data source and open accessible with critical rationale information. We randomly collected 18 290 patent documents that were patented by HewlettPackard Company or Epson on the topic of inkjet printer design from United States patent database as our research data. Among these 18 290 patents, we randomly selected 300 patents that are centered on inkjet printhead design. These 300 patents were manually tagged and the approach in manually tagging them largely follows our previous work in building a manufacturing corpus intended for manufacturing knowledge discovery and management purpose [41]. The tagged patents served as the base line in assessing the performance of the algorithms proposed primarily for effectiveness concerned, while the entire set of 18 290 patents were used more for scalability test. Fig. 7 shows the distribution profile of document length in these two data sets, with 18 290 patents and 300 patents respectively. As for the performance measurement, in artifact information extraction, we compare the generated results with the human annotations in terms of precision, recall and F value. In the evaluation of issue summarization and solution discovery approaches, we use ROUGE-1 measurement [42]. It matches the unigram cooccurrences between the systems generated results and the human annotation data in terms of precision, recall and F value. ROUGE-1 is used since it has been shown to agree with human judgment most in document summarization [42]. In addition to the performance evaluations, we analyze the scalability of the proposed DR discovery approach, i.e. load scalability and functional scalability.

Y. Liang et al. / Computer-Aided Design 44 (2012) 916–930

923

Fig. 6. The ISAL-based framework of DR discovery, retrieval and management.

Fig. 7. Distribution percentage of our dataset.

5.2. Performance of DR discovery approaches 5.2.1. Performance of artifact information extraction We first evaluate the four methods listed in Table 1 for artifact information extraction. They are our DP extension (MTR) using mutual term relation, DP extension (PTR) using positional term relation, DP and TextRank [35]. TextRank also uses PageRank algorithm to score candidate terms. One difference between our DP extensions and TextRank is that in our approach the candidate terms are generated by our DP model, while TextRank uses syntactic filters, such as nouns and adjectives, to form the candidate terms as graph nodes. Another difference is that we model term relationships by two term relations at a sentence level,

while in TextRank approach, the graph edges are formed if terms are within a window size, i.e. 2–10 words, and the edge weights are randomly assigned in the interval of 0–10. For parameter tuning, we conduct several trials with combinations of σ = 2, 3, g = 0, 1, 2, λ = 0.85, and k = 10%, 20%, 30%, 40%. For comparison purpose, Table 1 reports the best results which are selected based on the F value and the equivalent volume of terms generated. The second column shows the total number of terms annotated by experts, which is 4770 in the sample data. In the third column, it presents that the four approaches generated an equivalent volume of terms, which are around 13 400 terms. The fourth column shows the number of overlapping terms between human annotation and system generation.

924

Y. Liang et al. / Computer-Aided Design 44 (2012) 916–930 Table 1 The performance of different approaches for artifact information extraction. (MTR denotes the mutual term relation using pair mutual information. PTR denotes the positional term relation.) Approach

# of terms tagged

# of terms generated

# of overlapping terms

Recall

Precision

F value

DP extension (MTR) DP extension (PTR) DP TextRank

4770 4770 4770 4770

13 480 13 480 13 315 13 754

1688 1606 214 540

0.3539 0.3367 0.0449 0.1132

0.1252 0.1191 0.0161 0.0393

0.1850 0.1760 0.0237 0.0583

Table 2 The ROUGE-1 values of different approaches for issue summarization. Method

Recall

Precision

F value

Our approach Baseline Similarity-ranking

0.5172 0.1638 0.2845

0.5217 0.2111 0.4342

0.5195 0.1845 0.3438

Our first observation is that for the four methods, much better performance was achieved when the ranking process was used (i.e. DP Extension (MTR), DP Extension (PTR) and TextRank with 0.1850, 0.1760 and 0.0583 in terms of F value respectively), compared to using term frequency to score the terms (i.e. DP with 0.0237 in F value). Secondly, when the ranking process is used, our DP extension methods, both MTR and PTR, achieved better performance compared with TextRank. It reveals that our DP extension approaches can help to reserve more than 1000 terms. The recall values of our approaches using MTR and PTR are 0.3539 and 0.3367 respectively, which are about 20% higher than TextRank’s 0.1132. In terms of precision, our approaches can obtain 0.1252 and 0.1191 by using MTR and PTR respectively, which are about 8% higher than TextRank’s 0.0393. It indicates that the two term relations defined in our DP extension approaches are better to model term relations for the ranking process. Thirdly, in the two DP extension approaches, the MTR performed slightly better than PTR did in artifact information extraction. 5.2.2. Performance of issue summarization Our next experiment evaluates the performance of our approach based on the semantic sentence graph for the issue summarization task. For comparison purposes, we implement another two methods, i.e. the baseline method and the similarityranking method. The baseline method takes the first k sentences of a document as issue summarization. The similarity-ranking method also uses manifold-ranking algorithm to score sentences. However, it uses a different sentence graph, in which the sentence similarity is calculated based on the vector space model and the vector y is not defined. Our approach differs from the similarityranking method in taking advantage of issue language patterns and semantic sentence graph for ranking process. The aim of the first experiment is to compare the proposed approach with the other two relevant approaches. Table 2 presents their ROUGE-1 results. In parameter setting, we set η = 0.9 and k = 3. In addition, we tuned to use paragraphs as nodes in the semantic sentence graph. It is because we intend to minimize the bias of scoring process to the long sentences in a patent document. Our first observation is that the proposed approach outperformed the baseline and the similarity-ranking methods. The overall performance of our approach in F value is 0.5195, which is about 33% higher than the baseline’s 0.1845 and 17% higher than the similarity-ranking method’s 0.3438. In terms of precision, our approach achieved 0.5217, which is about 9% higher than the similarity-ranking method’s 0.4342. When we switch to recall performance, we observe that our approach obtained 0.5172, which helps to increase around 23% compared with the similarityranking method’s 0.2845. It shows that our approach manages to reserve sentences that are more relevant to the human annotation results.

Fig. 8. The ROUGE-1 F value of semantic similarity and cosine similarity under different setting of η.

We then further examine the parameter η and the vector y in the ranking function on how the information in the semantic graph can help to score sentences. We compare the performance of two link-weighting schemes combined with vector y for the ranking process. One link-weighting scheme is our semantic similarity defined in Section 3.2.2. Another is used cosine similarity as link weights. The parameter η in the ranking function, i.e. Eq. (8), tunes the proportion of combination between link weights and the vector y. Fig. 8 shows the ROUGE-1 F values of this experiment. When we set η = 0, which means only using values in y to score the sentences, it generates the lowest F value, i.e. 0.0504. When we switch to η = 1, which means we omit the vector y and utilize the link weights to score the sentences, it shows that compared with η = 0, the two link weights can help to increase the results significantly. Especially when our semantic similarity is used, the F value can reach 0.5195, which is about 17% higher than the cosine similarity’s 0.3438 (i.e. the similarity-ranking method). It reveals that our semantic similarity can better model the sentence relations based on their semantic meanings, which are beyond the lexical aspects. When the value of η is turned down from 1 to 0, which means we gradually add the value of y in the ranking function, we notice that when η = 0.9, the F value of cosine similarity is 0.5086, which is about 16% higher than its results with η = 1. It suggests that by integrating the vector y defined in our approach, it is likely to generate better results. Overall, the experiments indicate that our semantic graph can help to improve the issue summarization by taking advantage of relationships between sentences as well as issue language patterns. 5.2.3. Performance of solution and reason identification Our third experiment addresses the performance of three approaches for solution and reason identification. For comparison, we also use the baseline method and the similarity-ranking method. We set k = 10 in this experiment. In parameter setting, we set η = 0.9 according to the experiments of issue summarization and the propagation rate

Y. Liang et al. / Computer-Aided Design 44 (2012) 916–930 Table 3 The ROUGE-1 values of different approaches for solution and reason discovery. Method

Recall

Precision

F value

Our approach Baseline Similarity-ranking

0.4095 0.2498 0.2518

0.8921 0.7462 0.7615

0.5613 0.3743 0.3784

δ = 0.6. Table 3 shows the overall performance for solution and reason pair discovery. We notice that our approach is able to generate better results than the other two approaches. In terms of overall performance, the F value of our approach is 0.5613, which is about 19% higher than the baseline’s 0.3743 and the similarityranking method’s 0.3784. The results indicate that by utilizing and integrating prior information, such as the artifact information and language patterns, our approach can help to suggest the possible boundary of the candidate solution and reason sentences. In addition, we notice that the similarity-ranking method obtained similar results with that of the baseline method. It indicates that the ranking process based on traditional cosine sentence similarity cannot well reveal the solution and reason semantics of sentences. When we compare our approach with similarity-ranking method, we notice that our approach is able to achieve better results with 0.4095 and 0.8921 in terms of recall and precision respectively. Its recall value is about 15% higher than the similarity-ranking method’s 0.2518 and its precision value is around 13% better than the similarity-ranking method’s 0.7615. This is largely because the semantic sentence graph and the propagation process are able to model the sentence relationships for suggesting the solution and reason sentences. 5.3. Scalability With the ever-increasing electronic design documents available and the needs of information management for design reuse, there is concern about the ability of the DR approaches to handle a large amount of documents and the potential to expand the system functions. In this subsection, we discuss the load scalability and functional scalability of our approach. The load scalability refers to the capability of the proposed approach to handle different document lengths. The functional scalability focuses on analyzing how our ISAL-based DR management framework can be extended by adding new functions. 5.3.1. Load scalability We first analyze the load scalability of our approaches with respect to their worst-case computational complexity. Let N be the number of single terms in a document, n be the number of sentences, R be the number of iterations needed for convergence of ranking process and F be the numbers of candidate terms. With respect to our algorithm for artifact information extraction, if we consider g = 0 in our experiment, it has space complexity of O(N 2 + F 2 ) to store the term positional information and the term relation matrix for its computation. In addition, it takes time O(N 2 + RF 2 ) to generate the artifact information. The issue summarization algorithm needs space O(n2 ) to store the semantic sentence graph and takes time O(Rn2 ) to score the sentences. As for the solution and reason discovery algorithm, it has space complexity of O(n2 ) to store sentence graph information and it has time complexity of O(Rn2 ). Secondly, to test the load scalability, we performed our DR discovery approach over 18 290 patents using high performance computing facilities2 hosted in the National University of Singapore. For each document, we recorded three processing times for our three subtasks, i.e. artifact information extraction, issue summarization

2 http://www.nus.edu.sg/comcen/HPC/index.html.

925

and solution discovery respectively. We studied the relationship between processing time Tei and document length Li for each task. Since we do not have prior knowledge about this relationship, we can adopt the non-parametric regression to estimate the relationship between Tei and Li based on the experiment results. We assume that Tei = m(Li ) + εi

εi is a noise term and m is a function of Li . In order to estimate the model, we apply the local polynomial modeling method [43]. Fig. 9 shows the estimated results which illustrate the relationships between the sample Tei and Li , as well as the cumulative distribution percentage of document length. When we investigated the processing time of artifact information extraction approaches in terms of document length, we have observed that our DP extension approaches require much less time as the number of words grows. It takes about 10 s to process a document with around 15 000 words, while the TextRank approach needs about 120 s. For issue summarization, the processing time of our approach linearly grows from about 1 s to around 20 s as the document length increases from 2500 word to about 15 000 words. In solution discovery, our approach needs around 5 s to 160 s to process documents when the document size ranges from 2500 words to 15 000 words. While we observed that our approaches for issue summarization and solution discovery need more time to extract relevant information, we believe that the load scalability of our DR discovery is acceptable. Firstly, our approach can manage to produce better results as per experimental results shown in Section 5.2. Secondly, the proposed DR discovery approach generates DRs in an almost linear processing time. We notice that about 80% of the documents possess less than 9000 words. When the document length is less than 9000 words, the processing time for extracting artifact, issue and solution information is less than 3 s, 8 s and 50 s respectively, which is about one-third of its corresponding processing time when the document length reaches its maximum in the data set, i.e. 15 000 words. 5.3.2. Functional scalability In functional scalability, we analyze the possible ways to expand our proposed DR-based framework. Firstly, our DR repository can be evolved with incremental design documents. New rationale information will be incrementally added in the DR repository and connections between rationales will be updated accordingly. The second potential function is to provide DR information summarization for different document clusters based on different organizations and domains. By comparing DR summarization in similar clusters, designers can obtain design focuses of their competitors. In addition, we can integrate our DR repository with other design documents, such as product family ontology and CAD files, which may contain other forms of information like graphics. One possible way is to get design documents semantically tagged based on our multifaceted ontology, such as structure ontology, functional ontology and manufacturing ontology [31]. Then by mapping the artifact information to structure ontology, the multifaceted ontology enables designers to associate product DR with functional and manufacturing information. 6. An example using inkjet printhead In order to illustrate our approach for DR discovery and ISAL-based retrieval framework, we demonstrate example DRs extracted by our algorithm and a DR retrieval case study. Fig. 10 shows the DR extracted from a patent that focuses on high print quality printhead. From the issue layer, it indicates the motivations of a new design. For example, it includes the

926

Y. Liang et al. / Computer-Aided Design 44 (2012) 916–930

(a) Load scalability in terms of document length (artifact information extraction).

(b) Load scalability in terms of document length (issue summarization).

(c) Load scalability in terms of document length (solution–effect pair discovery). Fig. 9. Load scalability of DR discovery approaches over different document lengths and the cumulative distribution of document length.

general requirement of higher quality printing from the market. Also it indicates some detailed design considerations. An example is that using low heater resistance can lead to energy waste. In a quick summary, the issue layer suggests that a redesign of heater resistors is needed to make a high quality printhead. The solution layer introduces more details of the design. For instance, in order to rapidly deposit ink dots on the medium, the heater resistors should be energized at a high rate. In addition, each switch circuit is connected to an address pad to allow the heater resistor to be fired. In general, the solution layer gives the information of how to select the heater resistors and how to connect them with other components. The artifact layer shows the components that are involved in the design. The component information is associated with solutions where the components are discussed. In this case, it reveals that firing chamber, firing resistor and heater resistor may well be the design focuses (e.g., key components or mechanisms) in this invention. To demonstrate how professionals could benefit from the proposed DR retrieval scheme, we adopt prior art search on the issue of improving print quality in the redesign of ink jet cartridge part as an example. It is our intention, by giving this example, although simple at this moment, to reveal the merits of our research efforts in discovering in-depth design information and patterns that existing design information search and management system cannot match and to show the great potentials to help design especially when DRs, technical know-hows and intellectual

property claims are at the centered of such activities, e.g., medical device design, by pushing the research frontier in this regard. Fig. 11 shows our initial interface design of DR retrieval. We have studied on interface design for DR retrieval since how to visualize the DR information and how to guide users to navigate DRs are another challenges in DR studies [44]. In Fig. 11, the main window allows users to navigate in the DR space through year and company name. The search bar on the left side provides both basic search and rationale search. The basic search allows designers to perform traditional keyword based search. We focus on rationale based search here. It permits designers to search from three rationale facets, i.e. issue, design solution and artifact. In this scenario, we consider the following possible queries. Issue query: improve print quality printhead Artifact query: ink jet printer, printhead, cartridge Our DR retrieval module starts with query preprocessing. To form two query sets from issue facet and artifact facet respectively, the queries will be separated into query terms and they will be expanded based on their co-occurrence with other issue or artifact concepts. For example, ‘‘high drop generator density’’ and ‘‘high quality print output’’ may also be included in the issue query set, as they were co-occurred with the given issue query in the DR repository. In the rationale facet retrieval, the query terms are matched from issue and artifact facets respectively based on similarity measurement. Instead of listing the results initially, the results

Y. Liang et al. / Computer-Aided Design 44 (2012) 916–930

927

Fig. 10. An example of DR information extracted by our algorithm.

Fig. 11. Initial interface for DR search and retrieval.

will be integrated and assigned into multiple clusters. The left side of Fig. 12 shows the result clusters based on product structure. By clicking into ‘‘ink cartridge’’, the relevant rationales will be shown based on their issue similarity. For example, the first issue group includes documents that intend to ‘‘increase the ink drop’’ or ‘‘reduce the manufacturing cost’’ in order to ‘‘improve the high

quality printhead’’. The second issue group shows another angle to tackle the problem like ‘‘ink removal system’’. From the issue layer, the clustered results can help designers to figure out the issues that are critically concerned. By navigating into a specific issue group, more detailed rationale information will be shown. The left side of Fig. 13

928

Y. Liang et al. / Computer-Aided Design 44 (2012) 916–930

Fig. 12. Rationale clusters based on product structure and similar issues.

Fig. 13. Relevant document list with issue and artifact snapshot.

illustrates some relevant documents in the first issue group. Under each document title, it provides an issue description snapshot. To further look into rationale details, designers can browse each document individually. The right side of Fig. 13 shows the artifact information that is related to the given query. The larger font sizes suggest the key components of relevant issues. Through rationale search, designers can navigate DR within a single design document or across multiple documents. 7. Discussions Our DR approach has taken advantage of several timely research efforts in text mining, machine learning, information retrieval and text processing at large, and it is technically quite different from the traditional systems that rely on manual efforts in DR capture while design archives are often left intact. A few issues deserve our

immediate attention particularly related to the technical strength, merits as well as limitation of the current approach, and hopefully, it sheds light on some possible future research directions. Being part of academic research, we always promote the practice of using research data that are accessible to public. Out past attempts to obtain design texts from several brands in consumer electronics were either not successful or the quantity was not sufficient. Unlike classified design documents, patents are quality data sources, open accessible and they contain critical rationales in resolving technical challenges and support design innovations. These are the primary reasons why we turn to patents. However, patent documents only present one type of design documents. They are written in nice language with comprehensive details. It would be interesting to explore how our algorithms perform over other forms of design texts, such as texts in design log

Y. Liang et al. / Computer-Aided Design 44 (2012) 916–930

books, and texts harvested from mobile devices and emails within design teams. While we are using patent texts as the research data, we have not made use of any information related to patent internal structure in our algorithm design and research study. This differs greatly from research attempts that assume input DR data are formatted or annotated using HTML, XML or any other markup language or and ad-hoc semantic labels. While it is certainly helpful if the internal structure of patent document, like patent abstract and claims, could be segmented, it is a non-trivial task in annotating such a semantic structure if the input data are loaded in free texts. In our study, we follow the general practice in text processing community and regard free text, the most generic form of text representation, as the standard data input format for our algorithms to process. By laying down DR identification and extraction from a single free text as our research foundation, it highlights several interesting future research extensions and application possibilities. One challenging example is to help professionals capture the evolution of a design from relevant patent and design documents for design analysis and innovation. In a snapshot, we consider this as an issue of scanning window in observing how DR is developed along the way when design activities are actually carried out. Zoom in, within each single DR, we focus more on the solution–reason reasoning in restoring the logical development of a specific DR; while zoom out, by taking into account the sequence of DRs (for example, timestamp and other sequential information), we help to recover the holistic view of DR development by identifying crucial clues such as issues mentioned, key artifacts involved and how these are overlapped. For the time being, while we are focusing exclusively on mining DR from free texts, the aforementioned ISAL-based framework for DR discovery, retrieval and management can be easily extended to handle other elements like graphics and CAD model elements in design and design documents. While dealing with the graphic content in patents is widely acknowledged as a challenging task, partly because many graphic contents in patents are bmp files, our previous work [31,32] reveals that it is promising to count on a joint approach of information extraction and semantic annotation in building a semantic-based link between graphic contents and different elements in the ISAL model. In the end, through such semantic relations that have been automatically recognized and labeled, it enables the search and retrieval of graphic contents embedded in design texts and CAD elements under the same framework. Finally, one crucial factor, which is often neglected by design researchers, is the human factor issue in DR management. What would be a user-friendly interface, mechanism and process that facilitate the gathering, search, navigation, retrieval and analysis of DR through a desktop PC or a mobile device, on-the-go or dialog and conversation based? We have initiated some preliminary work in this regard [44]. But, certainly much more efforts should be further invested. Otherwise, who would be willing to load potential DR rich design documents into the system and use it? 8. Conclusions In this paper, we have given our focus to algorithm design for DR discovery and management from a large amount of digitized design documents with rich textual content. Our research efforts in algorithm design, i.e. artifact information extraction, issue summarization and solution–reason pair identification, are structured based on a computational DR model ISAL which was introduced in our previous study. Further experimental studies have been conducted to assess the performance of our proposed algorithms including the scalability issue of our DR approach.

929

The experimental results show that among the graph-based methods for information extraction, our approach produces better overall performance on DR extraction and the new DRs can be incrementally appended in the DR repository. To show the merits of our DR process, we have given an example of DR information extracted using our approach and presented a case study of DR retrieval for cartridge redesign in ink jet. Discussions on relevant concerns have also been given, in order to highlight some future research possibilities. Acknowledgment The work described in this paper was supported by a research grant from the National University of Singapore (R-265-000-362133) and was partially supported by an open project of the State Key Lab of CAD&CG, Zhejiang University, China (Grant No: A1013). References [1] Szykman S, Sriram RD, Regli WC. The role of knowledge in next-generation product development systems. Journal of Computing and Information Science in Engineering 2001;1:3–11. [2] Tomiyama T. Intelligent computer-aided design systems: past 20 years and future 20 years. Artificial Intelligence for Engineering Design, Analysis and Manufacturing 2007;21:27–9. [3] Regli WC, Hu X, Atwood M, Sun W. A survey of design rationale systems: approaches, representation, capture and retrieval. Engineering with Computers 2000;16:209–35. [4] Bracewell R, Wallace K, Moss M, Knott D. Capturing design rationale. Computer-Aided Design 2009;41:173–86. [5] Burge JE, Brown DC. Software engineering using rationale. Journal of Systems and Software 2008;81:395–413. [6] Burge JE. Design rationale: researching under uncertainty. Artificial Intelligence for Engineering Design, Analysis and Manufacturing 2008;22:311–24. [7] Liu Y, Liang Y, Kwong CK, Lee WB. A new design rationale representation model for rationale mining. Journal of Computing and Information Science in Engineering 2010;10. [8] Kunz W, Rittel HWJ. Issues as elements of information systems. Center for Planning and Development Research. University of California at Berkeley. 1970. [9] Louridas P, Loucopoulos P. A generic model for reflective design. ACM Transactions on Software Engineering and Methodology 2000;9:199–237. [10] Shum SJB, Selvin AM, Sierhuis M, Conklin J, Haley CB, Nuseibeh B. Hypermedia support for argumentation-based rationale. In: Dutoit Allen, McCall Raymond, Mistrík Ivan, Paech Barbara, editors. Rationale management in software engineering. Springer; 2006. p. 111–32. [11] McCall RJ. PHI: a conceptual foundation for design hypermedia. Design Studies 1991;12:30–41. [12] de Medeiros AP, Schwabe D. Kuaba approach: integrating formal semantics and design rationale representation to support design reuse. Artificial Intelligence for Engineering Design, Analysis and Manufacturing 2008;22: 399–419. [13] Moran TP, Carroll JM. Design rationale: concepts, techniques, and use. Mahwah (New Jersey): Lawrence Erlbaum Associates, Inc.; 1996. [14] Chandrasekaran B, Goel A, Iwasaki Y. Functional representation as design rationale. Computer 1993;26:48–56. [15] Myers KL, Zumel NB, Garcia P. Acquiring design rationale automatically. Artificial Intelligence for Engineering Design, Analysis and Manufacturing 2000;14:115–35. [16] Tang A, Jin Y, Han J. A rationale-based architecture model for design traceability and reasoning. Journal of Systems and Software 2007;80:918–34. [17] Yang MC, Wood III WH, Cutkosky MR. Design information retrieval: a thesauribased approach for reuse of informal design information. Engineering with Computers 2005;21:177–92. [18] McMahon C, Lowe A, Culley S, Corderoy M, Crossland R, Shah T, et al. Waypoint: an integrated search and retrieval system for engineering documents. Journal of Computing and Information Science in Engineering 2004;4:329–38. [19] Liu Y, Loh HT, Sun A. Imbalanced text classification: a term weighting approach. Expert Systems with Applications 2009;36:690–701. [20] Liu S, McMahon CA, Darlington MJ, Culley SJ, Wild PJ. A computational framework for retrieval of document fragments based on decomposition schemes in engineering information management. Advanced Engineering Informatics 2006;20:401–13. [21] Li Z, Ramani K. Ontology-based design information extraction and retrieval. Artificial Intelligence for Engineering Design, Analysis and Manufacturing 2007;21:137–54. [22] Jiao J, Zhang Y. Product portfolio identification based on association rule mining. Computer-Aided Design 2005;37:149–72. [23] Song S, Dong A, Agogino A. Modeling information needs in engineering databases using tacit knowledge. Journal of Computing and Information Science in Engineering 2002;2:199–207.

930

Y. Liang et al. / Computer-Aided Design 44 (2012) 916–930

[24] Tseng Y-H, Lin C-J, Lin Y-I. Text mining techniques for patent analysis. Information Processing & Management 2007;43:1216–47. [25] Wanner L, Baeza-Yates R, Brügmann S, Codina J, Diallo B, Escorsa E, et al. Towards content-oriented patent document processing. World Patent Information 2008;30:21–33. [26] Trappey AJC, Hsu F-C, Trappey CV, Lin C-I. Development of a patent document classification and search platform using a back-propagation network. Expert Systems with Applications 2006;31:755–65. [27] Kim J-H, Choi K-S. Patent document categorization based on semantic structural information. Information Processing & Management 2007;43: 1200–15. [28] Chakrabarti AK, Dror I, Eakabuse N. Interorganizational transfer of knowledge: an analysis of patent citations of a defense firm. IEEE Transactions on Engineering Management 1993;40:91–4. [29] Li Y-R, Wang L-H, Hong C-F. Extracting the significant-rare keywords for patent analysis. Expert Systems with Applications 2009;36:5200–4. [30] Yoon B. On the development of a technology intelligence tool for identifying technology opportunity. Expert Systems with Applications 2008;35:124–35. [31] Lim JSC, Liu Y, Lee WB. Multi-facet product information search and retrieval using semantically annotated product family ontology. Information Processing & Management 2009. [32] Yu W, Liu Y. Automatic identification of semantic relationships for manufacturing information management. In: The 6th international conference on manufacturing research. ICMR08. Brunel University. 2008. [33] Zhan J, Loh HT, Liu Y. Gather customer concerns from online product reviews— a text summarization approach. Expert Systems with Applications 2009;36: 2107–15. [34] Brin S, Page L. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems 1998;30:107–17. [35] Mihalcea R, Tarau P. TextRank: bringing order into texts. In: Conference on empirical methods in natural language processing. 2004.

[36] Wan X, Yang J, Xiao J. Manifold-ranking based topic-focused multi-document summarization. In: Proceedings of the 20th international joint conference on artifical intelligence. Hyderabad (India): Morgan Kaufmann Publishers Inc.; 2007. [37] Zhou D, Weston J, Gretton A, Bousquet O, Schölkopf B. Ranking on data manifolds. Advances in neural information processing systems, vol. 16. MIT Press; 2004. [38] Chang D-S, Choi K-S. Incremental cue phrase learning and bootstrapping method for causality extraction using cue phrase and word pair probabilities. Information Processing & Management 2006;42:662–78. [39] Cole SV, Royal MD, Valtorta MG, Huhns MN, Bowles JB. A lightweight tool for automatically extracting causal relationships from text. In: Proceedings of the IEEE southeastcon 2006. 2006. p. 125–9. [40] Rodriguez MA, Bollen J, Sompel HVD. Automatic metadata generation using associative networks. ACM Transactions on Information Systems 2009;27: 1–20. [41] Liu Y, Loh HT. Corpus building for corporate knowledge discovery and management: a case study of manufacturing. In: Proceedings of the 11th international conference on knowledge-based and intelligent information & engineering systems. KES. Lecture notes in artificial intelligence. LNAI 4692. Vietru sul Mare (Itally). 2007. p. 542–50. [42] Lin C-Y, Hovy E. Automatic evaluation of summaries using N-gram cooccurrence statistics. In: Proceedings of the 2003 conference of the north American chapter of the association for computational linguistics on human language technology. 2003. p. 71–8. [43] Fan J, Gijbels I. Local polynomial modeling and its applications. London: Chapman and Hall; 1996. [44] Liang Y, Liu Y, Lu WF, Lim SCJ. Interactive interface design for design rationale search and retrieval. In: ASME 2010 international design engineering technical conferences & computers and information in engineering conference. 2010.