Management and analysis of unstructured construction data types

Management and analysis of unstructured construction data types

Available online at www.sciencedirect.com ADVANCED ENGINEERING INFORMATICS Advanced Engineering Informatics 22 (2008) 15–27 www.elsevier.com/locate/a...

914KB Sizes 1 Downloads 33 Views

Available online at www.sciencedirect.com ADVANCED ENGINEERING

INFORMATICS Advanced Engineering Informatics 22 (2008) 15–27 www.elsevier.com/locate/aei

Management and analysis of unstructured construction data types Lucio Soibelman b

a,*

, Jianfeng Wu a, Carlos Caldas b, Ioannis Brilakis c, Ken-Yu Lin

d

a Department of Civil and Environmental Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA Department of Civil, Architecture and Environmental Engineering, University of Texas at Austin, Austin, TX 78712, USA c Department of Civil and Environmental Engineering, University of Michigan Ann Arbor, MI 48109, USA d Department of Civil Engineering, National Taiwan University, Taipei City 115, Taiwan

Received 30 November 2006; received in revised form 6 July 2007; accepted 17 August 2007

Abstract Compared with structured data sources that are usually stored and analyzed in spreadsheets, relational databases, and single data tables, unstructured construction data sources such as text documents, site images, web pages, and project schedules have been less intensively studied due to additional challenges in data preparation, representation, and analysis. In this paper, our vision for data management and mining addressing such challenges are presented, together with related research results from previous work, as well as our recent developments of data mining on text-based, web-based, image-based, and network-based construction databases. Ó 2007 Elsevier Ltd. All rights reserved. Keywords: Data mining; Construction; Information management

1. Introduction Currently, the construction industry is experiencing explosive growth in its capability to both generate and collect data. Advances in scientific data collection, the new generation of sensor systems, radio frequency tags, construction extranets, bar codes, and computerization have generated a flood of data. Advances in data storage technology, such as faster, higher capacity, and less expensive storage devices, better database management systems, and data warehousing technology have allowed the transformation of this enormous amount of data into computerized database systems. As the construction industry is adapting to new computer technologies in terms of hardware and software, computerized construction data are becoming more and more available. As a consequence we have seen an explosive increment of data in both volume *

Corresponding author. Tel.: +1 412 268 2952; fax: +1 412 268 7813. E-mail addresses: [email protected] (L. Soibelman), jianfen1@ andrew.cmu.edu (J. Wu), [email protected] (C. Caldas), brilakis@ engin.umich.edu (I. Brilakis), [email protected] (K.-Y. Lin). 1474-0346/$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.aei.2007.08.011

and complexity. Some research efforts have been focused on developing data mining tools on construction data to discover novel and useful knowledge, in support of decision making by project managers. However, a majority of such efforts were focused on analyzing structured data sources, such as productivity records and cost estimates available in spreadsheets, simple relational databases, or single data tables. Unfortunately the large majority of construction documents are available as complex unstructured data sources, such as text documents, digital images, web pages, and project schedules. Because of the intricacies for managing and mining this type of data it was less intensively studied although they also carry important and abundant information from construction projects. In this paper, our vision for data management and analysis on unstructured construction data sources is first introduced, including its definition, importance, and related issues. Previous research efforts relevant to data analysis on various construction data sources are reviewed. Our recent work in graphical analysis on network-based project schedules is also shown, and initial results from this ongoing research are demonstrated for further discussions.

16

L. Soibelman et al. / Advanced Engineering Informatics 22 (2008) 15–27

2. Definition and vision Traditionally, research in construction data analysis has been focused on transactional databases, in which each object/instance is represented by a list of values for a given set of features in a data table or a spreadsheet. Comprehensive Knowledge Discovery in Databases (KDD) processes, as well as many statistical and machine learning algorithms, have been introduced and applied in previous research [1,2]. In reality, there are a variety of construction data sources besides those available in spreadsheets or data tables, including semi-structured or unstructured text documents such as contracts, specifications, change orders, requests for information, and meeting minutes; unstructured multimedia data such as 2D or 3D drawings, binary pictures, as well as audio and video files; and structured data for specific applications that need more complicated data structures for storage and analysis than single data tables, e.g., network-based schedules. Currently a large percentage of project information is still kept in these sources for two major reasons. First, the large numbers of features, or high dimensionalities of such information make it inefficient to store and manage related data, such as text files with hundreds of words or images with thousand of pixels, in transactional databases. Second, information contained in the original data formats, such as graphical architectures of schedules and texture information of images could be lost if they are transferred into a single data table or spreadsheet. This situation requires that KDD processes, originally developed for traditional databases, be adapted to be applicable for non-traditional construction data sources. Compared with transactional construction databases that have been analyzed in a number of existing research efforts, non-traditional construction data sources are relatively less studied for information retrieval or lessons learning, even though they contain important and abundant historical information from ongoing and previous projects. As a result, useful data could be buried in large volumes of records due to difficulties in extracting relevant files for various purposes in a timely manner, while many informationintensive historical data could be left intact after projects are finished, without being further analyzed to improve practices in future projects. Three major challenges could be attributed to such a relative scarce of related research. First, in addition to data preparation operations for general KDD processes [3], special attention should be taken to remove incorrect or irrelevant data from non-traditional construction data sources while ensure integrity and completeness of original information. Second, extra efforts are needed to reorganize project data, originally recorded in application-oriented formats, into analysis-friendly representations that support efficient and effective KDD implementation. Third, data analysis techniques, which were initially developed to identify patterns and trends from single tables or spreadsheets,

should be adapted in order to learn useful knowledge from text-based, image-based, web-based, network-based, and other construction data sources. In our vision, a complete framework should be designed, developed, and validated for data analysis on particular types of non-traditional construction data sources. Three essential parts should be included in such a framework. First, careful problem investigations should be done in order to understand data analysis issues and related work in previous research. Second, detailed guidelines for special data preparation, representation, and analysis operations are also needed to ensure generality and applicability of the developed framework. And eventually, scientific criteria and techniques has to be employed to evaluate validity of data analysis results and feasibility of extending such a framework to same or similar types of construction data sources. 3. Related research areas and previous work In this section, previous research for analysis of textbased, web-based, and image-based construction data are presented, reviewing work done by other researchers, together with frameworks developed in our research group to support advanced data analysis on non-traditional construction data sources. 3.1. Text mining for managing project documents In construction projects, a high percentage of project information is exchanged using text documents, including contracts, change orders, field reports, requests for information, and meeting minutes, among many others [4]. Management of these documents in model-based information systems, such as Industry Foundation Classes (IFC) based building information models, is a challenging task due to difficulties in establishing relations between such documents and project model objects. Manually building the desired connections is impractical since these information systems typically store thousands of text documents and hundreds of project model objects. Current technologies used for project document management, such as project websites, document management systems, and project contract management systems do not provide direct support for this integration. There are some critical issues involved, most of them due to the large number of documents and project model objects and differences in vocabulary. Search engines based on term match are available in many information systems used in construction. However, the use of these tools also has some limitations in cases where multiple words share the same meaning, where words have multiple meanings, and where relevant documents do not contain the user-defined search terms [4]. Recent research addressed some issues related to text data management. Information systems and algorithms were designed to improve document management [5–7]; controlled vocabularies (thesauri) were used to integrate

L. Soibelman et al. / Advanced Engineering Informatics 22 (2008) 15–27

heterogeneous data representations including text documents [8–10]; various data analysis tools were also applied on text data to create thesauri, extract hierarchical concepts, and group similar files for reusing past design information and construction knowledge [11–13]. In one of the previous studies conducted by our research group, a framework was devised to explore the linguistic features of text documents in order to automatically classify, rank, and associate them with objects in project models [4,14]. This framework involved several methods as discussed in our vision for data analysis on non-traditional construction data sources: – Special Data Preparation operations for text documents were identified, such as transferring text-based information into flat text files from their original formats, including word processors, spreadsheets, emails, and PDF files; removing irrelevant tags and punctuations in original documents; removing stop words that are too frequently used to carry useful information for text analysis like articles, conjunctions, pronouns, and prepositions; and finally, performing word stemming to remove of prefixes and/or suffixes and group words that have the same conceptual meanings. – The preprocessed text data were transformed into a specific Data Representation using a weighted frequency matrix A = {aij}, where aij was defined as the weight of a word i in document j. Various weighting functions were investigated based on two empirical observations regarding text documents: (1) the more frequent a word is in a document, the more relevant it is to topic of the

17

document; and (2) the more frequent a word is throughout all documents in the collection, the more poorly it differentiates between documents. By selecting and applying appropriate weighting functions, project documents were represented as vectors in a multi-dimensional space. Query vectors could then be constructed to identify similar documents based on similarity measures such as Euclidian distance and the cosine between vectors. – Data Analysis tools were applied to develop a framework for integrating text documents in model-based information systems. The main goal of this integration framework is to improve the identification and analysis of relevant project documents. The large number of objects in a project model and of text documents in construction projects makes the proposed automated integration framework desirable. It generates significant savings in the time and effort required to link all objects contained in a project model to all relevant documents generated during the project’s life cycle. The proposed framework, called Text Information Integration Methodology (TIIM), has the following main components: project model preparation, project documents preparation, classification, retrieval and ranking, and association. Fig. 1 displays a workflow diagram that includes the main steps of the text information integration methodology. The methodology is summarized in the following paragraphs. Project documents preparation involves the identification of the project document repository and the implementation

Project Documents Preparation

Pre-Process Documents

Project Model Preparation

Create Project Model

Add References from Project Model Objects to Classes

Define Classification System

Create a Document Classifier for each Class

Create Vector Space Model of the Project Documents

Classify all Documents in the Project Document Repository

Select Object

Create Query Vector

Identify all Documents that have the same Class as the Selected Object

Association

Retrieval and Ranking

Create Project Document Repository

Classification

Text Information Integration Methodology

Calculate Similarity between Query Vector and Project Document Vectors

Retrieve and Rank Documents

Review Retrieved and Ranked Documents

Fig. 1. Text information integration methodology.

Associate Project Model Objects and Related Documents

18

L. Soibelman et al. / Advanced Engineering Informatics 22 (2008) 15–27

of the data preparation and data representation steps explained previously. The goal is to create a vector space model of the project document repository. Project model preparation involves the creation of the project model. The TIIM methodology requires that each object in the project model, such as an instance of an IFC entity (e.g., IfcDoor) must be linked to a class (e.g., Openings) defined in an external classification source (e.g., CSI Masterformat [15]). During classification, document classifiers are created for each of the classes of the external classification source. Then, project documents are automatically classified according to each of these classes. Project document classification creates a higher semantic level that facilitates the identification of documents that should be linked to project model objects. This is one of the major characteristics of the proposed methodology and represents a major distinction from existing project document management approaches. The proposed approach for retrieval and ranking combined the vector space model for information retrieval and pattern classification techniques to develop a specialized information retrieval and ranking method for the AEC/FM domain. Specialized information retrieval systems tend to perform better than traditional web search engines in domain-specific document collections. The vector space model was selected to represent documents because the resulting representation can be uniformly applied both to the classification and the retrieval and ranking components of the proposed integration methodology. In the retrieval and ranking phase, a query vector is constructed using terms extracted from the description of a selected project model object (e.g., Door). This vector and the class of the selected object (e.g., Openings) are used as input for retrieval and ranking. During retrieval and ranking, all project documents are analyzed, but a higher weight is given to documents belonging to the same class as the object. Association is the last component of the text information integration methodology. After project documents have been ranked and retrieved, they are reviewed and a reference to the relevant documents is be added to the project model objects. It is important to emphasize that the actual document is not added to the project model, just a reference to it. This facilitates future updates, saves storage space and computer memory, and helps the integration with existing information systems. The proposed methodology uses both the project model object’s description terms and the project model object’s class as input for the data analysis process. In methods based just on text search, only documents that contain the search terms can be located. Documents that use synonyms would not be found. If only the project model object’s class is considered, some documents that belong to the object’s class but are not related to the object would be mistakenly retrieved. By analyzing all project model classes, but giving a higher weight to the project model

object’s class, related documents in other classes can also be identified. A prototype system, Unstructured Data Integration System (UDIS), was developed to verify and validate the TIIM methodology. The prototype system is able to classify, retrieve, rank, and associate documents according to their relevance to project model objects [14]. Fig. 2 illustrates UDIS. The validation of the proposed methodology was based on the analysis of more than 25 construction databases and 30,000 electronic documents. The following measures were used for comparison: recall and precision. Recall represents the percentage of the retrieved documents that were related to the selected project model object, while precision indicates the percentage of the related documents that were actually retrieved. Statistical analyses were conducted using paired tests. A summary of these analyses are presented in this section, while detailed results are reported in another publication [14]. The UDIS prototype achieved better performance in both recall and precision than typical project contract management systems, project websites, and commercial information retrieval systems. The average precision measures were 46.33% for UDIS, 28.55% for the project contract management system, 29.20% for the project website, 31.45% for the information retrieval system 1, and 29.58% for the information retrieval system 2. This difference between the system that implements the TIIM model and each of the four other technologies was considered significant, as indicated by the low values (<0.05) for the levels of significance (q-values) obtained as results of the statistical tests. The average recall measures were 66.86% for UDIS, 44.17% for the project contract management system, 31.06% for the project website, 41.64% for the information retrieval system 1, and 31.70% for the information retrieval system 2. Once again, this difference was considered significant according to the results of the paired tests (q-values <0.001). Some of the major advantages of the TIIM framework include: (1) it does not involve manual assignments of metadata to any text documents; (2) it does not require a controlled vocabulary that would only be effective if it is adopted by all users of an information system; and (3) it provides semi-automated mechanisms to map text documents to project objects using their internal characteristics instead of user-defined search terms. The research results demonstrated that the TIIM framework provides an effective mechanism for the integration of project documents in model-based information systems. However, some aspects need to be addressed for the comprehensive implementation of this methodology in real-working environments. These include access to distributed project databases, reuse of classifiers, use of different types of project documents, incorporation of version control mechanisms, management of document access privileges, and consideration of data privacy and security issues.

L. Soibelman et al. / Advanced Engineering Informatics 22 (2008) 15–27

19

Fig. 2. Unstructured data integration system [14].

3.2. Text mining for ontology-based online A/E/C product information search In addition to managing an in-house document collection such as that of Section 3.1, online documents, especially those containing Architecture/Engineering/Construction (A/E/C) product information, have also received increasing attention as more and more manufacturers/suppliers host such information on the Internet. This is because the web provides boundary-less information sources which can be used by industry practitioners to survey the A/E/C product market for product specification, selection, and procurement applications. Advances in material science research and new manufacturing processes allow an increasing number of better, safer, or more affordable products to be developed across the nation and globally. Despite this formation of the world market and the potential of abundant online product information, no adequate approaches are yet available for A/E/C industry practitioners to survey the virtual market. Concerns about the effectiveness of general search engines for retrieving domain-specific information were also raised [16]. Therefore, one of the studies within our research group established a domainspecific approach for this purpose and created a specialized A/E/C online product search engine to help industry practitioners collect product information from the virtual market. In the study, some challenges related to searching for A/E/C product information online were similar to those encountered when dealing with an in-house project document collection. These challenges included the ambiguity of natural languages (e.g., polysemy, synonym, and struc-

tural ambiguity) and the lack of consistent product descriptions (e.g., definitions of product properties, domain lexicons for describing the products, and product information formatting) within A/E/C. Researchers [17] also pointed out that searching A/E/C data was made difficult as numerous parties hold pieces of information in a variety of different formats and most information was not available for indexing due to its proprietary nature. Besides those, the enormous search space of the Internet, together with industry practitioners’ varied ability to conduct effective queries, presented additional difficulties to the study. If the type of query contained Boolean operations or nest structures, query formulation became even more tedious. Domain knowledge represented in re-applicable formats provided great help to tackle the above mentioned challenges. Ontology, being commonly used in knowledge engineering to represent the knowledge-level characterization of a domain, was therefore utilized in the reported study. However, the followed-up question was which formality of ontology should be adopted. As ontology’s simplest forms, taxonomies such as the popular construction information classification systems MasterFormatTM [15] and OmniClass Construction Classification System (OCCS) [18] were defined at the industry level to cover all project objects. This would make the task of covering each satellite domain of A/E/C products very difficult, although other successful examples [4,19–24] of utilizing domain knowledge derived from the form of taxonomy have been reported. Inspiring by previously reported success such as [8,11] in the domain, our study [25] used thesaurus as a more elaborate form of ontology to help organize domain concepts

20

L. Soibelman et al. / Advanced Engineering Informatics 22 (2008) 15–27

related to a given A/E/C product. By doing so, the study was able to couple the represented knowledge with the extended Boolean models in order to formulate effective queries automatically to describe the product domain. As reported [26], formulating effective extended Boolean queries with expertise in the query domain would make the extended Boolean model outperform the classic Boolean model or the vector space model. The thesaurus creation for the product domain was therefore a necessary step along the path to handle online product information. The demonstration below explained how our vision for analyzing non-traditional construction data sources was applied in [25]. First of all, the product thesaurus was constructed following the data preparation, representation, and analysis procedures. Secondly, domain knowledge in the form of a thesaurus was utilized to support data preparation, representation, and analysis to identify highly relevant online product documents for a given product domain. For the product thesaurus construction, the goal was to identify the terms highly relevant to or frequently mentioned in the product domain. To produce these terms in an unbiased manner, an automatic process was developed to extract these terms from a set of domain documents. In the process, the set of documents were to be derived by collecting existing product documents from the identified or known product manufacturers. – Data Preparation was first introduced by applying the linguistic techniques such as removing stop words (e.g., at, on, of, etc.) and term conversion (e.g., ‘insulating’ being converted into its original form ‘insulate’ via the use of an external dictionary WordNet) to each individual terms appearing in the collected documents. – Data Representation was then conducted with the goal to represent each converted, remaining term in its more frequently-derivative form. During this step, noun form was preferred over the adjective form with the verb form being the least preferred. This was because noun or noun phrases are most desirable when constructing a thesaurus [27]. For example, the term ‘insulate’ then was replaced by ‘insulation’ and the statistical analysis of term occurrences in the next step would be based on the term ‘insulation’. – Data Analysis was conducted according to the term-document frequency matrix where the frequency associated with each term in each collected document was recorded. When a term’s occurrence passed the minimum threshold within a document and this condition held true most of the time across the collected set of documents, then the term was added to the list of candidate thesaurus terms. A searcher could then select from this list the terms that he/she considered appropriate to make up the product domain thesaurus. After the domain knowledge was constructed in the form of product thesaurus, the study [25] re-applied the data preparation, representation, and analysis steps

from our vision to handle the rest of the research challenges in order to identify relevant online product documents for surveying the product market. – Data Preparation operations were conducted automatically at three levels. Level one leveraged the represented thesaurus and lexical variations of the query terms with the support of an external dictionary (i.e., WordNet) to expand a single product query into multiple queries (i.e., the ‘grow’ strategy). In level one, word operations using the linguistic techniques such as removing stop words and word stemming, similar to that introduced in Section 3.1, were also performed. In level two, each expanded query was tested on a general search engine where the top-k documents returned by the search engine for each expanded query were gathered to form a pool of highly relevant documents (i.e., the ‘trim’ strategy). As an example, after the first two levels of data preparation, for example, a product query ‘translucent roof panels’ was expanded into eight queries and gathered a pool of 514 online product documents in one of the validation tests conducted in [25]. In level three, these documents were converted into flat text files and processed with word operations to remove the document noise such as stop words or punctuation. By doing so, the developed approach was able to expand and then reduce the search space both efficiently and effectively to prepare a high quality pool of possibly relevant online product documents. – During Data Representation, for each expanded query, each prepared document was transformed into a document vector. The document vectors were assumed to exist in a multi-dimensional space whose dimensions were defined by the concepts from the expanded query being considered. The values of the vector were defined by a weighted term-frequency matrix A = {aij} where aij was the normalized weight of a query term i in document j. Assume that a search query was composed of two terms, i1 and i2, and the two normalized weights associated with i1 and i2 for document j were ai1j and ai2j respectively. Then the document was considered to ai2j Þ in the space as illustrated in be the vector ð ƒ! ai1j ; ƒ! Fig. 3. This representation was necessary and helpful for the application of the extended Boolean model during data analysis.

Normalized weight for Query Term i2

Max (1,1)

Doc (ai1j, ai2j)

D1

D2 Min(0,0)

Normalized weight for Query Term i1

Fig. 3. Illustration for the document vector representation.

L. Soibelman et al. / Advanced Engineering Informatics 22 (2008) 15–27

– Data Analysis was achieved through two stages. In the first stage, each document vector was evaluated based on each expanded query, considering the thesaurus terms as well as the Boolean connectors residing within the query. Taking Fig. 3 as an example, the basic evaluation principals were applied as such: (1) when the AND Boolean operator was used to connect the two terms i1 and i2 in a given query, point (1, 1) became the most desirable location of a document point and so the closer (ai1j, ai2j) is to (1, 1), the more relevant the document is for the query. In this case, distance D1 between (ai1j, ai2j) and (1, 1) was used to measure the dissimilarity between the document and the search query; (2) When the OR Boolean operator was used to connect the two terms instead in a given query, point (0, 0) became the most undesirable location of a document point and so the further (ai1j, ai2j) is away from (0, 0), the more relevant the document is for the query. In that case, distance D2 between (0, 0) and (ai1j, ai2j) was used to measure the similarity between the document and the search query. Special distance normalization considerations were designed into the developed approach as well as complicated query structure (e.g., hierarchical queries) accommodation. In the second stage, a synthesis evaluation derived from data fusion techniques was calculated for a given document vector, considering all the possibly expanded queries with the use of a domain thesaurus. As described, a semi-automated thesaurus generation method was developed in this research in order to supply domain knowledge for retrieving relevant A/E/C online product information. With the aid of domain knowledge in the generated thesaurus, the enormous Web space was successfully reduced into a manageable set of candidate online documents during data preparation. The use of a thesaurus also helped solve the language ambiguity problems so that synonyms like ‘‘lift’’ and ‘‘elevator’’ were treated indifferently. In [25] for the purpose of validation, five data sets originating from different A/E/C product domains were processed by a developed prototype to return a list of search results that contained highly relevant online product information. The interface of the developed prototype is displayed in Fig. 4. To assess the true relevance of the returned search results, each web document was examined and labeled manually as ‘relevant’ or ‘non-relevant’. When a document was considered ‘relevant’, the product manufacturers introduced by the document were also recorded. Using these labels and the recorded product manufacturer information, retrieval performance was evaluated. The performance evaluation focused particularly on the number of distinct product manufacturers identified because the goal of the research was to help A/E/C industry practitioners obtain more product information from the available marketplace. Hence for each data set, the number of distinct product manufacturers derived from (1) the search results generated from the established prototype; (2) the searching results

21

Fig. 4. User interface for the developed prototype search engine [21].

returned by Google; and (3) an existing catalog compiled by SWEETS were compared at the end as displayed in Table 1. It was observed that in most cases, the developed research prototype consistently retrieved more distinct product manufactures than the other two approaches. Compared with approaches that enforce standardized data representation, the developed framework [25] has more flexibility in dealing with text-based and heterogeneous A/E/C product specifications on the Internet. The automatic query expansion process in the framework also relieved searchers’ burden to structure complex queries using highly related terms. Therefore, an industry practitioner can take advantage of the developed A/E/C domain-specific search engine to survey the virtual product market for making more informed decisions. 3.3. Image reasoning for search of jobsite pictures Pictures are valuable sources of accurate and compact project information [28]. It is becoming a common practice Table 1 Prototype performance comparisons for each of the five data sets [25] Data set 1

This research Searching via Google SWEETS

2

3

4

5

Number of distinctive product manufacturers discovered 8 12 28 26 30 3 7 19 17 36 5 1 2 6 1

22

L. Soibelman et al. / Advanced Engineering Informatics 22 (2008) 15–27

for jobsite images to be gathered periodically, stored in central databases, and utilized in project management tasks. However, the volume of images stored in an average project is increasing rapidly, making it difficult to manually browse such images for their utilization. Moreover, the labeling systems used by digital equipment are tweaked to provide distinct labels for all images acquired to avoid overlaps, but they do not record any information regarding the visual content. Consequently, images stored in central databases cannot be queried using conventional data and text search methods and thus, retrievals of jobsite images have so far been dependent on manual labeling and indexing by engineers. Research efforts have been developed to address the image retrieval concerns described above. A prototype system, for example, was developed by Abudayyeh [29] based on a MS Access database that allowed engineers to manually link construction multimedia data including images with other items in an integrated database. The use of thesauri was also proposed by Kosovak et al. [8] to assist image indexing by enabling users to label images with specific standards. However, both solutions are difficult to use in practice due to the large number of manual operations that is needed to classify images in meaningful ways. Also, besides manual operations that are still involved in both solutions, the issue of how to index images in different ways was not addressed. This issue is important since it is not unusual that an image could be used for multiple purposes, even if it had been taken for just one use. For example, an engineer may have taken a picture of protective measures to prove their existence, but the same picture might contain other useful content needed in the future, such as materials or structural components included in its background [28]. A novel construction site image retrieval methodology based on material recognition was developed to address these issues [28,30,31]. A similar framework was developed based on our vision for data analysis on non-traditional construction data sources: – Data Preparation: construction site images were first separated into their basic elements (red, green, blue) and each element was separately denoised, contrastenhanced and sharpened. Denoising is necessary to remove random pixel values (salt and pepper noise) caused by imperfections of the camera, the lens, the CCD sensor and the compression/storage algorithm (i.e. DCT, DWT, etc). Denoising was achieved by convolving each element with a low sigma Gaussian. Contrast enhancement is necessary to take advantage of the entire color/intensity spectrum. This enhancement was achieved by ‘‘stretching’’ the histogram of each element. Sharpening is necessary to more clearly separate the objects in the image and was achieved by convolving each element with a high pass filter. – Data Representation: the image features were then extracted from the image elements. Specifically, the normalized colors, the intensity, and the texture response

were isolated. Normalized colors are the percentage contribution of each color to the total (intensity). This normalization automatically removes the brightness/ illumination problem; features unrelated to image content such as the intensity of lights and shades. Intensity is the average of the elements. The texture response is the response of each element when convolved with a texture filter, such as a horizontal or diagonal line filter. In total 4 line filters were used to cover all directionalities and 2 spot filters to cover unidirectional items. The preprocessed images were then divided into separate regions based on uniformity. Since most construction materials have relatively uniform color and texture, each identified region contained an object (or part of) or a non-relevant item (i.e. sky, cloud, etc.). This way a variety of construction objects were isolated using region growing based clustering. Subsequently, the features of each region represent the feature of the object. These regional features were further compressed into quantifiable, compact, and accurate descriptors, also known as signatures using statistical analysis. As a result, the image was transformed into several clusters, each with a list of values for a given set of feature signatures. – Data Analysis: the feature signatures of each image region were compared with signatures in an image collection of material samples. Using proper threshold and/or distance functions, the actual material of each region was recognized. An example is given in the next page (Fig. 5) to show how a site image is decomposed into clusters, represented with signatures, and identified with actual materials in each separated region. The material information was then combined with other spatial, temporal, and design data to enable automated, fast, and accurate retrieval of relevant site images for various management tasks. In addition to image attributes (e.g., materials) that can be automatically extracted using the above method, time data can be obtained from digital cameras when image are taken on construction sites; 2D/ 3D locations can be obtained from positioning technologies; and as-planned and as-built information for construction products can be retrieved from a model-based system or a construction database. By combining all such information together, images in the database can be compared with the query’s criteria in order to filter out irrelevant images and rank the selected images according to their relevance. The results are then displayed on the screen, and the user could easily browse through a very limited amount of sorted images to identify and choose the desired ones for specific tasks. Fig. 6 provides a detailed example of retrieving images for a brick wall using the developed prototype. This method was tested on a collection of more than a thousand images from several projects and evaluated with similar criteria like recall and precision as discussed in Section 3.1. The results showed that images can be successfully classified according to the construction materials visible within the image content [30].

L. Soibelman et al. / Advanced Engineering Informatics 22 (2008) 15–27

23

Fig. 5. An example for image data preparation, representation, and analysis [31].

36], we are studying more advanced data preparation, representation, analysis solutions to boost automated detection and classification of pipe defects based on improved problem investigation, feature selection, data representation, and multi-level classifications of pipeline images. 4. A recent development in scheduling data analysis This section presents our recent work applying the same vision, which has been successfully applied for analysis of unstructured text and image data, for analysis of network-based project schedules. A preliminary study including problem investigation, data preparation, data representation, data analysis, and pattern evaluations are detailed below. Fig. 6. Image retrieval based on image contents and other attributes [30].

4.1. Problem investigation The original contribution of this research work on construction site image indexing and retrieval could be extended to solve other related problems. For example, our research group is currently investigating a new application of image processing, pattern recognition, and computer vision technologies for fully automated detection and classification of defects found in acquired images and video clips of underground pipelines. Due to the rapid aging of pipeline infrastructure systems, it is desirable to develop and use more effective methods for assessing the condition of pipelines, evaluating the level of deterioration, and determining the type and probability of defects to facilitate decision making for maintenance/repair/rehabilitation strategies that will be least disruptive, most cost-effective and safe [32]. Currently, trained human operators are required to examine acquired visual data and identify the detailed defect class/subclass according to the standard pipeline condition grading system, such as Pipeline Assessment and Certification Program (PACP) [33], which makes the process error-prone and labor-intensive. Building on our previous research efforts in image analysis [26] and former research efforts in defect detection and classification of pipe defects [34–

Since computer software for Critical Path Method, or CPM-based scheduling started to be used in the construction industry decades ago, planning and schedule control history of previous projects has been increasingly available in computerized systems in many companies. However, there is a current lack of appropriate data analysis tools for scheduling networks with thousands of activities and complicated dependencies among them. As a result, schedule historical data are left intact without being further analyzed to improve future practices. At the same time, companies are still relying on human planners to make critical decisions in various scheduling tasks manually and in a case-by-case manner. Such practices, however, are humandependent and error-prone in many cases due to subjective, implicit, and incomplete scheduling knowledge that planners have to use for decision making within a limited timeframe. Many research efforts have been focused on improving current scheduling practices. Delay analysis tools were applied to identify fluctuations in project implementation, responsible parties for the delays, and predict possible consequences [37,38]; machine learning tools like genetic algorithms (GA) were applied to explore optimal or

24

L. Soibelman et al. / Advanced Engineering Informatics 22 (2008) 15–27

near-optimal solutions for resource leveling and/or time– cost tradeoff problems, by effectively searching only a small part of large sample space for construction alternatives [39– 41]; simulation and visualization methods were developed on domain-specific process models to identify and prevent potential problems during implementation [37,42,43]; and knowledge-based systems were also introduced to collect, process, and represent knowledge from experts and other sources for automated generation and reviewing of project schedules [44,45]. Such research efforts helped improve efficiency and quality of scheduling work in many aspects, but none of them addressed the above ‘‘data rich but knowledge scarce’’ issue, i.e., no research described here has worked on learning explicit and objective knowledge from abundant planning and schedule control history of previous projects. Our research is intended to address this issue related to scheduling knowledge discovery from historical networkbased schedules, so that more explicit and objective patterns could be identified from previous schedules in support of project planners’ decision making. A case study was implemented on a project control database to identify special problems in preparing, representing, and analyzing project schedules for this purpose. 4.2. Data preparation The project control database for this case study was collected from a large capital facility project. Three data tables were used in the scheduling analysis: one data table with detailed information about individual activities; one with a complete list of predecessors and successors of dependencies among the activities; and a third table with historical records about activities that were not completed asplanned during implementation, and reasons for their non-completions. In this step, we first separated nearly 21,000 activities into about 2600 networks based on their connectivity. Another special data preparation task identified was the removal of redundant dependencies in networks [46]. As shown in Fig. 7, a dependency from activity X to Y is redundant if there is another activity Z, such that Z is succeeding to X, and Y is succeeding to Z, directly or indirectly. Obviously, the dependency from X to Y is unnecessary because Y cannot be started immediately after X anyway.

4.3. Data representation Architectures of scheduling networks may influence their risks and performance in implementation to a great extent. A challenge here was to describe such networks in a concise and precise way, so that networks having similar architectures and performance could be abstracted into same or close descriptions. On the other hand, it would also be difficult, if not impossible, to identify common patterns among all 2600 scheduling networks with varied size and complicated graphical architectures by just reviewing them visually or in their original formats. Like the developed solutions in data analysis for text-based and imagebased construction data sources, a general data representation is necessary to transform scheduling information into appropriate formats for desirable data analysis. Intuitively, how a network is split into branches or converged from branches could be a good indicator of its reliability during implement. A possible explanation for such an intuition is that such splits or converges generate or eliminate parallelisms, if we assume that parallel implementation is a major cause for resource conflicts, and thus variability during actual implementation. We used such an intuition as our major hypothesis in this case study, and developed a data representation to describe scheduling networks in an abstract way focusing on how a work flow goes through it by splitting/converging, and finally arriving at its end. In this study, we developed a novel abstraction process comprising of two major stages: – Recursive Divisions in a top-down manner: the network in Fig. 7 could be divided into two sequential sub-networks, N1/N2, so that N1 is preceding to N2 only through activity B, and N2 could be further divided into two parallel sub-networks, N21/N22, which only have common starting task B and ending tasks C; such divisions could continue until a sub-network eventually consists of only ‘atomic’ sequences of tasks without any branches (e.g., all the three sequences from A to B). – Type description in a bottom-up manner: in this stage, task identifications were removed since we only concerned the abstract description of the network architecture. In a reverse direction to recursive divisions, the type description for a network/sub-network could be obtained by following the rules illustrated in Fig. 8.

Fig. 7. An example for recursive division of a scheduling network.

L. Soibelman et al. / Advanced Engineering Informatics 22 (2008) 15–27

25

These initial results also demonstrated the potentials of extending our current work in many directions: (1) the recursive division method makes it possible to decompose large and complicated scheduling networks into multiple levels of components for further representation of project planning and schedule control history; (2) the abstract descriptions could be viewed as extracted graphical features of scheduling networks, which could be integrated with other features of activities and dependencies for identifying other general and useful patterns in scheduling data; and (3) the complete process of scheduling data preparation, representation, and analysis provided primary insights and experiences for the following development of this research on a larger scale. 5. Conclusions

Fig. 8. Examples for type descriptions rules.

4.4. Data analysis and primary pattern evaluations Type descriptions for all 2600 networks in this study were generated using a computer program developed on a MatlabÒ 7.0 platform. By comparing type descriptions of networks with their probabilities of non-conformances, i.e., having non-completion tasks during implementation, some interesting patterns were found. For example, in this project, the non-conformance rate of networks that started with a single task (type description: ‘‘1 ! n !   ’’) was significantly lower than that of networks started with multiple tasks (type description: ‘‘n ! 1 !   ’’). This pattern is simple in its statement, but it could be hard to discover if the complicated network structures had not been abstracted into precise and concise descriptions. And it makes sense considering that a schedule that starts with multiple parallel tasks at the beginning could be more risky due to lack of sufficient preparations by managerial team at an early stage. Also, comparisons between non-completion rates of activities at different positions turned out some meaningful results. The non-completion rate of tasks connecting sequential or parallel sub-networks, i.e., those with more than one preceding and/or succeeding tasks, was significantly larger than that of tasks among ‘atomic’ sequences, i.e., those with at most one preceding and succeeding activities. It also verify our previous hypothesis made during data representation that how a network is split into branches or converged from branches could be a good indicator of reliability during implement for specific networks and activities.

It is certainly no surprise that a construction project, even a small one, requires a huge amount of information from specifications, plans construction documents, process control, inventory management, cost estimating and control, and scheduling. As the construction industry adopts new computer technologies, computerized construction data are becoming more and more available. There exist numerous opportunities to exploit and extract knowledge from the vast amount of construction data. Unlike much previous research in Knowledge Discovery in Databases (KDD) that has been successfully applied in several domains, in construction, however, the data are of multiple types and from many different sources, some with very low quality. The general direction of our research has been, and will continue to be, the organization of such diverse data into forms that are amenable to the application of knowledge discovery methods. Unstructured construction data types are different structured data in their structures, information content, and storage sizes. This paper presented text-based, web-based, image-based, and network-based construction data as examples of our vision for possible advanced data management and analysis on these unstructured types of data. Several required steps essential for achieving this goal were presented and discussed, including problem investigation to define relevant issues; data preparation to remove features not carrying useful information; data representation to transform data sources into new precise and concise descriptions; and proper data analysis for information search or for extraction of new knowledge. The success obtained by the implementation of the prototypes developed to test our vision suggests that further research should be developed to extend the results obtained and to allow the development of the required tools to deal with other non-traditional construction data sources. Our research concentrates on studying the increasing amount of available construction data, developing Data Mining and Knowledge Discovery in Databases methods, processes, and tools to discover novel knowledge from large construction databases, and organizing the large

26

L. Soibelman et al. / Advanced Engineering Informatics 22 (2008) 15–27

amount of available construction data with the development of improved information retrieval methods. Our contributions to understanding construction data, life cycle of data acquisition, data management, and knowledge generation suggest a new paradigm for automated knowledge discovery in construction projects. We expect our research to help construction management professionals and researchers to understand the opportunities for improved data management, achieve improved information retrieval by providing access to the required information at the right time to the right project member, and learn from best examples and to avoid common mistakes through the use of data mining and construction knowledge discovery methods and processes. The ultimate goal of our research is to develop new concepts, principles, methodologies, and tools that will help the AEC Industry analyze a large amount of construction data. This analysis will support the development of the needed theoretical models of construction management. We are currently working to develop frameworks that allows the development of data warehouses from complex construction unstructured data and on data modeling, managing, and mining tools to support analysis of geospatial data, another common construction data type. Acknowledgements The authors would like to thank US National Science Foundation for its support under Grants No. 0201299 and No. 0093841, and to all industrial collaborators who provided the data sources for our previous and current studies. References [1] R.B. Buchheit et al., A knowledge discovery case study for the intelligent workplace, in: Proc. of the 8th Int. Conf. on Computing in Civil & Building Eng., Stanford, CA, 2000. [2] L. Soibelman, H. Kim, Data preparation process for construction knowledge generation through knowledge discovery in databases, J. Comput. Civil. Eng. 1 (2002) 39–48. [3] L. Soibelman, H. Kim, J. Wu, Knowledge Discovery for Project Delay Analysis, Bauingenieur, Springer VDI Verlag, 2005, February. [4] C. Caldas, L. Soibelman, J. Han, Automated classification of construction project documents, J. Comput. Civil. Eng. 4 (2002) 234–243. [5] D. Hajjar, S. AbouRizk, Integrating document management with project and company data, J. Comput. Civil Eng. 1 (2000) 70–77. [6] Y. Zhu, R. Issa, R. Cox, Web-based construction document processing via malleable frame, J. Comput. Civil Eng. 3 (2001) 157– 169. [7] R. Fruchter, K. Reiner, Model-centered world wide web coach, in: Proc. of 3rd Congress Computing in Civil Engineering, ASCE, Reston, VA, 1996. [8] B. Kosovac, T. Froese, D. Vanier, Integration heterogeneous data representations in model-based AEC/FM systems, in: Proc. of CIT 2000, Reykjavik, Iceland. [9] G. Lee, C.M. Eastman, R. Sacks, S.B. Navathe, Grammatical rules for specifying information for automated product data modeling, Adv. Eng. Inform. 20 (2) (2006) 155–170.

ˇ . Turk, Construction informatics: definition and ontology, Adv. [10] Z Eng. Inform. 20 (2) (2006) 187–199. [11] M. Yang, W. Wood, M. Cutkosky, Data mining for thesaurus generation in informal design information retrieval, in: Proc. Int. Computing Conference, 1998. [12] R. Scherer, S. Reul, Retrieval of project knowledge from heterogeneous AEC documents, in: Proc. of the 8th Int. Conf. on Comp. in Civil & Building Eng., Stanford, CA, 2000. [13] W. Wood, The development of modes in textual design data, in: Proc. of ICCCBE-VIII, Palo Alto, CA, 2000. [14] C. Caldas, L. Soibelman, L. Gasser, Methodology for the integration of project documents in model-based information systems, J. Comput. Civil Eng. 1 (2005) 25–33. [15] MasterFormatTM, Construction Specifications Institute, Alexandria, VA, 2004. [16] R. Dewan, M. Freimer, A. Seimann, Portal Kombat: the battle between web pages to become the point of entry to the World Wide Web, in: Proc. of the 32nd Hawaii Int. Conf. System Science, IEEE, Los Alamitos, CA, USA, 1999. [17] D. Harrison, M. Donn, H. Skates, Applying web services within the AEC industry: enabling semantic searching and information exchange through the digital linking of the knowledge base, in: Proc. of the CIB W78’s 20th Int. Conf. on Construction IT, CIB Report 284, Waiheke Island, New Zealand, 2003. pp. 145–53. [18] OmniClass Construction Classification System (OCCS), 2006. (April 15, 2007). [19] T.A. El-Diraby, C. Lima, S. Feis, Domain taxonomy for construction concepts: toward a formal ontology for construction knowledge, J. Comput. Civil Eng. 4 (2005) 394–406. [20] C. Lima, J. Stephens, M. Bo¨hms, The bcXML: supporting ecommerce and knowledge management in the construction industry, ITCON, 2003, pp. 293–308. [21] S. Staub-French, M. Fischer, J. Kunz, B. Paulson, K. Ishii, An ontology for relating features of building product models with construction activities to support cost estimating, CIFE Working Paper #70, Stanford University, 2002. ˇ . Turk, T. Cerovsˇek, Mapping the W78 papers onto the construction [22] Z informatics topic map, in: Proc. of the CIB W78’s 20th Int. Conf. on Construction IT, CIB Report 284, Waiheke Island, New Zealand, 2003, pp. 423–32. [23] F. Meziane, Y. Rezgui, A document management methodology based on similarity contents, Inform. Sci. 158 (2004) 15–36. [24] Y. Rezgui, Ontology-centered knowledge management using information retrieval techniques, J. Comput. Civil Eng. 20 (4) (2006) 261– 270. [25] K. Lin, L. Soibelman, Promoting transactions for A/E/C product information, Automat. Construct. 15 (6) (2006) 746–775. [26] E. Greengrass, Information retrieval: a survey, 2000. (November 20, 2003). [27] T. Craven, Thesaurus construction, online thesaurus construction tutorial, 1997. (January 15, 2005). [28] I. Brilakis, L. Soibelman, Material-based construction site image retrieval, J. Comput. Civil Eng. 4 (2005) 341–355. [29] O. Abudayyeh, Audio/visual information in construction project control, J. Adv. Eng. Software 2 (1997). [30] I. Brilakis, L. Soibelman, Multi-modal image retrieval from construction databases and model-based systems, J. Construct. Eng. Manage. 4 (2006). [31] I. Brilakis, L. Soibelman, Shape-based retrieval of construction site photographs, J. Comput. Civil Eng., Am. Soc. Civil Eng., in press. [32] Water Environment Federation (WEF), Existing sewer evaluation and rehabilitation, ASCE Manuals and Rep on Eng Pract, Rep No. 62, Alexandria, VA, 1994. [33] Pipeline Assessment and Certification Program (PACP) Manual, second ed., Reference Manual, NASSCO, 2001.

L. Soibelman et al. / Advanced Engineering Informatics 22 (2008) 15–27 [34] M.J. Chae, D.M. Abraham, Neuro-fuzzy approaches for sanitary sewer pipeline condition assessment, J. Comput. Civil Eng. 15 (1) (2001) 4–14. [35] S.K. Sinha, P.W. Fieguth, Automated detection of cracks in buried concrete pipe images, Automat. Construct. 15 (1) (2005) 58–72. [36] S.K. Sinha, P.W. Fieguth, Neuro-fuzzy network for the classification of buried pipe defects, Automat. Construct. 15 (1) (2005) 73–83. [37] T. Hegazy, K. Zhang, Daily windows delay analysis, J. Construct. Eng. Manage., ASCE 5 (2005) 505–512. [38] J. Shi, S. Cheung, D. Arditi, Construction delay computation method, J. Comput. Civil Eng. 1 (2001) 60–65. [39] T. Hegazy, M. Kassab, Resource optimization using combined simulation and genetic algorithms, J. Construct. Eng. Manage. 6 (2003) 698–705. [40] C. Feng, L. Liu, S. Burns, Using genetic algorithms to solve construction time–cost trade-off problems, J. Comput. Civil Eng. 3 (1997) 184–189.

27

[41] K. El-Rayes, A. Kandil, Time–cost–quality trade-off analysis for highway construction, J. Construct. Eng. Manage. 4 (2005) 477–486. [42] B. Akinci, M. Fischer, J. Kunz, Automated generation of work spaces required by construction activities, J. Construct. Eng. Manage. 4 (2002) 306–315. [43] V. Kamat, J. Martinez, Large-scale dynamic terrain in threedimensional construction process visualizations, J. Comput. Civil Eng. 2 (2005) 160–171. [44] R. Dzeng, I. Tommelain, Boiler erection scheduling using product models and case-based reasoning, J. Construct. Eng. Manage. 3 (1997) 338–347. [45] R. Dzeng, H. Lee, Critiquing contractors’ scheduling by integrating rule-based and case-based reasoning, Automation in Construction, vol. 5, Elsevier Science, 2004, pp. 665–678. [46] R. Kolisch, A. Sprecher, A. Drexl, Characterization and generation of a general class of resource-constrained project scheduling problems, Manage. Sci. (1995) 1693–1703.