An intelligent and improved density and distance-based clustering approach for industrial survey data classification

Accepted Manuscript An intelligent and improved density and distance-based clustering approach for industrial survey data classification Jingjing Zho...

Download PDF

8MB Sizes 0 Downloads 35 Views

Report

Full Text

Accepted Manuscript

An intelligent and improved density and distance-based clustering approach for industrial survey data classification Jingjing Zhong , Peter W. Tse , Yiheng Wei PII: DOI: Reference:

S0957-4174(16)30536-X 10.1016/j.eswa.2016.10.005 ESWA 10910

To appear in:

Expert Systems With Applications

Received date: Revised date:

21 December 2015 3 October 2016

Please cite this article as: Jingjing Zhong , Peter W. Tse , Yiheng Wei , An intelligent and improved density and distance-based clustering approach for industrial survey data classification, Expert Systems With Applications (2016), doi: 10.1016/j.eswa.2016.10.005

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Highlights • An intelligent and automatic process to rank the performance in asset management • An intelligent system to automatically find the most suitable practice for benchmarking • An improved approach to determine the center of clusters • Define outlier factors and analysis so that the best and poorest performers can be identified

ACCEPTED MANUSCRIPT

An intelligent and improved density and distance-based clustering approach for industrial survey data classification Jingjing Zhong a, Peter W. Tse a,*, Yiheng Wei b a Department of Systems Engineering & Engineering Management, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China

CR IP T

b Department of Automation, University of Science and Technology of China, Huangshan Road, Hefei, 230027, China

* Corresponding Author

E-mail address: [email protected] (J.J. Zhong), [email protected] (P.W. Tse), [email protected] (Y.H. Wei)

Abstract

PT

ED

M

AN US

Engineering Asset Management (EAM) emphasizes on achieving sustainable business outcomes and competitive advantages by applying systematic and risk-based processes to decisions concerning an organization’s physical assets. Nowadays, there is no specific method to evaluate performance of EAM and lack of benchmark to rank performance. To fill this gap, an improved density and distance-based clustering approach is proposed. The proposed approach is intelligent and efficient. It has largely simplified the current evaluating method so that the commitment in resources for manual data analyzing and performance ranking performance can be significantly reduced. Moreover, the proposed approach provides a basis on benchmarking for measuring and ranking the performance in Engineering Asset Management (EAM). Additionally, by using the intelligent approach, companies can avoid to pay expensive consultant fees for inviting external consultancy company to provide the necessary EAM auditing and performance benchmarking.

AC

CE

Keywords: Engineering asset management, Clustering, Performance evaluation, Density and distance-based clustering, Outlier analysis, K-means

ACCEPTED MANUSCRIPT

1. Introduction

CR IP T

Engineering Asset Management (EAM) as a discipline, addresses the value contribution of asset management to an organization’s success (Amadi-Echendu, et al., 2010). Good asset management has been becoming an expected practice in mature organizations all over the world. PAS 55:2008, which is the first publicly available specification for optimized management of physical assets, was developed by a consortium of 50 organizations from 15 different industry sectors in 10 countries. It offers a 28-point checklist of requirements for an effective asset management system, defined terms and practical guidance on the implementation of the standard(PAS, 2008). 1.1 Background

AC

CE

PT

ED

M

AN US

PAS 55:2008 is increasingly being recognized as a generically applicable definition of good practices in the whole life cycle management and optimized management of physical assets. Given the popularity of PAS 55, and after consultation with industry and professional bodies around the world, the specification was put forward in 2009 to the International Standards Organization (ISO) as the basis for a new ISO standard for asset management. This was approved and the resulting ISO 55000-2 family of standards has been developed with 31 participating countries (Woodhouse, 2013). Since PAS-55 only lists a general guideline in what elements are required to be accomplished so as to obtain the certificates in EAM, it does not provide any evaluation method in scoring the performance in EAM and example of the best practice in a particular type of public services. In order to evaluate the degree of achievement in EAM and benchmarking with similar type of service providers, a number of public utility service providers have obtained their certificates in the achievement of EAM through hiring professional consultancy companies to perform auditing and performance assessment on the service providers. However, many small and medium-sized enterprises (SMEs), such as building services management companies, cannot afford to hire such professional company to help them in obtaining the EAM certificate. Even they may be affordable to hire the professional consultancy company, they do not know how to choose or which consultancy company is qualified to provide such auditing and performance assessment work in EAM. Therefore, the purpose of our current research is to build an intelligent system so that it can automatically classify the stages of performance of a particular company and then identify the most suitable practice in EAM for that company after benchmarking with the information and performances given by other companies. First, a number of suitable Key Performance Indicators (KPIs) for SMEs to promote the implementation EAM according to AM standards have been determined. Second, an ebased questionnaire or survey form to evaluate the performance of SMEs in EAM has been designed and implemented. With the help of the e-based survey form, each SME-based candidate can fill in the questionnaire thru the Internet. Third, our intelligent method can

ACCEPTED MANUSCRIPT

CR IP T

automatically classify the success level of performance in EAM for each candidate and then provide the candidate the most suitable practice of EAM that has similar business nature of the candidate. Based on this most suitable practice, the particular candidate can identify the gap between his current practice and that of the most suitable practice in EAM. Hence, the candidate will know how to improve his performance to match with that of the most suitable practice. With the foundation work done by the above three processes, an AI- and web-based solution can be implemented in future to help a particular SME to automatically evaluate her performance in EAM, score the degree of success, and benchmark her performance with the current most suitable practice of EAM. Hence, the need of hiring expensive consultation company to certify the practice in AM can be avoided.

AN US

1.2 Related work

AC

CE

PT

ED

M

The main purpose of using data clustering techniques is to improve the performance of data access by summarizing the data objects into groups(Bouguettaya, Yu, Liu, Zhou, & Song, 2015). Clustering is applied in many fields, such as medical and telecommunication, which is much required in expert system and its applications (Binu, D. 2015, Khanmohammadi, S., Adibeig, N., & Shanehbandy, S., 2017, Carvalho, L. F., Barbon, S., de Souza Mendes, L., & Proença, M. L., 2016). The basic idea of density-based clustering algorithms is that the data which is in the region with high density of the data space is considered to belong to the same cluster (Kriegel, Kröger, Sander, & Zimek, 2011). The typical ones is DBSCAN(Ester, Kriegel, Sander, & Xu, 1996) , Disadvantages are resulting in a clustering result with low quality when the density of data space isn’t even, a memory with big size needed when the data volume is big, and the clustering result highly sensitive to the parameters(Duan, Xu, Guo, Lee, & Yan, 2007). Khan and Ahmad use density-based multiscale data condensation approach with Hamming distance to extract cluster centers from the datasets, however, their method has quadratic complexity with respect to the number of data objects(Shehroz S Khan & Ahmad, 2003; Shehroz S. Khan & Ahmad, 2013). Lately, Ros and Guillaume provided a new density-based sampling method for clustering, named DENDIS (Ros, F., & Guillaume, 2016), compared with K-means, it has higher accuracy. Density and Distance -based Clustering (DDC) algorithm can identify cluster centers through investigating the local density and the intra-cluster distance of each data point (Rodriguez & Laio, 2014). This method utilizes two reasonable assumptions. The first one is that the cluster centers must have the highest local density. The second one is that these centers have relatively larger distance to the points of higher density. Based on these two assumptions, a simple criterion has been adopted to find the independent density peaks. That is, each data point is ranked by the product of its local density and the distance from the points of higher density. Then the cluster

ACCEPTED MANUSCRIPT

M

AN US

CR IP T

centers are recognized as points with anomalously large ranking scores (Jia, Tang, Zhu, & Li, 2015). Remaining data points are assigned to the same cluster as its nearest neighbor of higher density. This assignment step is performed in a single pass, which is much faster than other clustering algorithms such as K-means (MacQueen, 1967). Meanwhile, the DDC approach only requires the measurement of the distance between all pairs of data points. Because of the ranking behavior in the clustering process, the DDC algorithm can be considered as a ranking-based clustering method. It has been proven that can generate good results in several non-spherical clustering problems (Jia, et al., 2015). Due to its simple design process and excellent clustering performance, the DDC approach has been adopted in many application since it was proposed in 2014, such as, hyperspectral band selection (Jia, et al., 2015), anomalous cell detection (Miao, Qin, & Wang, 2015), age estimation (Chen, Wang, & Du, 2015), image processing (Lu, Liong, Zhou & Zhou, 2015, Lu, Wang, Deng & Jia, 2015) and fault diagnosis in cloud computing (Wang, et al., 2015). (Kang, Xiu, & Lu, 2015) provided computing exemplar score to have higher classification accuracy by applying DDC. But their approach requires human supervision to determine the cluster centers, and the clustering quality is sensitive to the parameter cutoff distance. This algorithm was used to deal with the mixed data by self-determining cluster center (Chen & He, 2015, Chen & He, 2016). Wang, Zuo & Wang improved DDC method and applied in social network (Wang, Zuo & Wang, 2016). DDC method was improved by using kernel density estimation (Xu, Yan & Xu, 2015) via heat diffusion (Mehmood, Zhang, Bie, Dawood & Ahmad, 2016). Despite the results are satisfied, there is room for further investigation. First, this method

ED

does not provide any efficient way to select the threshold value of cutoff distance d c . That is, one must need to estimate the value of d c . Hence, the selection of the proper threshold value

PT

becomes based on the selector’s subjective experience. The concept of entropy was introduced by Wang to determine d c (Wang, Wang, Li, & Li, 2015), while it greatly increased the

AC

CE

computation complexity and did not provide an easy-to-understand reason. Second, in a decision graph, the data points that have the following two criteria will be named as the cluster centers. That is, the cluster centers must have high nearest neighbor distance and relatively high local density simultaneously. It is necessary to make a tradeoff between the two indicators. An indicator was defined in (Kang, et al., 2015) measuring the possibility of samples to be cluster center, while the differences of scale and concern degree was ignored. Furthermore, outliers could be caused by noise, it is essential to determine the outliers accurately and effectively with quantitative analysis. Nonetheless, the original approach of DDC and the existing derived method did not provide any specific guideline to handle this issue. Motivated by the discussions above, an improved density and distance-based clustering (IDDC) approach is developed in this paper, which has the following improvements:

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

1) The cutoff distance is selected automatically not by experience. 2) Normalized the local density and the cluster separation distance. 3) Determine the cluster center by using quantitative analysis rather than qualitative analysis. 4) Define outlier factor and do in-depth analysis. The proposed method can solve the aforementioned problem as well as enhancing the features for better clustering, especially for noisy industrial data. The final clustering results generated by this new approach were compared with that generated by K-means approach, which is a popular and commonly used clustering method. It is well-known that K-means has some shortcomings. First, it needs a predefined number of clusters. However, in most cases, the number of clusters cannot be known in prior. Second, it only adapts to find those globular clusters with near radius because of its assignment strategy of point to the nearest cluster. For example, when the number of clusters is large, K-Means becomes very sensitive to the initialization of means (X. Chen, 2015). Besides our new approach does not have the above shortcomings, it can evaluate the performance of EAM and rank the achieved performance levels in EAM automatically and rapidly. Usually performance data of different company is collected by conducting a survey via the setup of a proper questionnaire. Once the data has been collected, then manual works is required to analyze the data, classify the achievements or result of each company based on its performance, compare result with other companies’ results and then rank the level of achievement accordingly. Such processes are time-consuming, labor-intensive and tedious. In contrary, our new approach can provide an automatic process so that after sufficiently large quantity of data has been collected, the above tedious processes can be performed automatically. Finally, any related SME can use this approach to input the required data and then obtain a rank of her current performance in EAM automatically. Moreover, the surveyed SME can realize the gap between their performance and the current most suitable or typical performance of related service sectors so that companies can understand what is needed to further improve in the practice of EAM to reach the typical level achieved by associated sectors after applying the automatic benchmarking process provided by our new approach. Initially, our new approach is targeted to the industrial sector of building operation and maintenance services providers situated in Hong Kong. The service’s scope is limited to the information services and management in EAM supported by these providers. The main reason is to narrow down the coverage to a particular type of services and limited number of companies so that the variation will not be too deviated to affect the accuracy of the results generated by the new approach. Once it has shown its success and met the aimed target and result, then the new approach can be extended to other type of services and to other groups of company, such as the routine operation and maintenance on locomotive systems and aimed at mass transit providers, like underground trains owners or management companies. This improved approach has been

ACCEPTED MANUSCRIPT

AN US

CR IP T

found to be able to classify the new data samples automatically, so the intelligent system can be used to simplify the current evaluating method and improve the efficiency of measurement. Waste of resources while analyzing and ranking is reduced and the processes are becoming intelligent. Companies using the new approach are not required to pay expensive consultant fees for whole procedures. Moreover, further suggestions, from experts and researchers for improving the questionnaire, would enhance the chances of obtaining PAS-55 certification subsequently. The rest of this paper is outlined as follows. The fundamental theories related to an improved density and distance-based clustering approach is stated in Section 2. The brief content and aims of the industrial survey, the method of analyzing the collected data and the experimental results are reported in Section 3. The effectiveness of our proposed method and the comparisons with some randomly selected parameters are discussed in Section 3.2. Finally, the discussion and conclusion remarks are drawn in Section 4 and Section 5 respectively. 2. The Brief Framework of the Proposed Methodology

M

The DDC approach is adopted in this study since this approach as strong robustness and stability compared with the well-known k-means method. To further overcome the insufficiency in the DDC method, obtaining better clustering results, some improvements will be conducted for the corresponding aspects. 2.1 The original density and distance-based clustering approach Rodriguez established the DDC approach due to the following two reasonable assumptions:

ED

1) The cluster centers must have the highest local density  i . 2) The cluster centers have relatively larger distance  i to the data samples of higher density.

PT

Based on these two assumptions, the density-peak-based clustering approach can be established. For the data set to be clustered S   xi i 1 , xi 

CE

and x j as

N

m

, define the distance between xi

dij  xi  x j . 2

(1)

AC

According to the Gaussian kernel function, the local density i can be defined as follows

 dij2  i   exp   2  , j 1, j i  dc  where the cutoff distance d c is to be determined. N

(2)

The computation of  i is quite simple because of its definition is the nearest distance to the samples of higher density from data point i , as follows

 i  min dij . j: j  i

(3)

ACCEPTED MANUSCRIPT

For the data point that has the highest local density, we simply let  i  max  j , where  i j i

plays an important role in the suppression of highly correlated bands. From the above definitions of i and  i , it is clear that this method depends strongly on the chosen cutoff distance. By choosing the proper cutoff distance, it is possible to distinguish the cluster by using these two indicators i and  i . The two indicators can properly characterize the

CR IP T

location of the samples in data set and play a vital role in our method. More specifically, when

i is large and  i is small, it implies that the i th data point is close to but not the center of cluster since there is a data point that is in the same cluster (since  i is small) but has a larger i . When i and  i are both small, this point lies around the edge of the clusters. When i is small and  i is large, on the other hand, the point is away from the entire data set, indicating that the

AN US

point is probably an outlier. It is only when both i and  i are relatively large then that data point can be the center of the clusters.

2.2 The improved density and distance-based clustering approach

From Equation (2) one can easily observed that d c has a direct impact on the results of finding the center of cluster. In particular, if d c is assigned to a large value or i  N then the

M

overlapping neighborhood can even include the data from other clusters. If d c is chosen to be too small or i  0 then every point has a similar sparse density neighborhood. In both cases,

ED

local density i has less significance in discriminative power. Consequently, a suitable d c is crucial for the process of data clustering. The original algorithm has not provided any efficient way to select a proper threshold value of d c . That is, a user must arbitrarily select a value of d c

PT

depending on his subjective experience. Considering the purpose of clustering, if all the data samples have the same i , then the uncertainty of data distribution is the largest. Whilst if they

CE

are uneven, then the dispersion degree of the data is small, and the distribution of data can be well determined. As a result, we utilize the variance  to represent the dispersion of i . To

AC

increase the comparative, the local density i and the separation distance  i are normalized to the scale of [0,1] , which can be done by

i 

i  min i i

max i  min i

 i  min  i i

max  i  min  i i

And the variance  is defined as

(4)

.

(5)

i

i

i 

,

i

ACCEPTED MANUSCRIPT

 where  

1 N 2  i    ,  N  1 i 1

(6)

1 N  i is the mean of the local density. N i 1

As mentioned above, the value of d c should be selected when the variance is maximum. i.e. d  dc  d

where d  min dij and d  max dij i j

i j

(7)

CR IP T

dc  arg max  ,

To determine the center of clusters, one can use the quantitative analysis rather than the qualitative analysis and is defined as

 i  i  i ,

(8)

AN US

where   0 is the adjustment factor. If   1 , the local density is emphasized and if   1 , it reflects that the separation distance is the key indicator.

Once  i has been calculated, the following steps of algorithm is simple to proceed by ranking the bands according to  i from high to low and then selecting the top one. The one that has a large value indicates a high chance to be the center of clusters. Then the other samples can

M

be classified into a cluster with the center of clusters that has highest local density i and nearest

ED

distance  i .

Additionally, we can detect any outlier exists in the data set by using the following approach. If the local density is very small and the separation distance is very big, such as i  o ,  i   o ,

CE

PT

then the related sample can be regarded as an outlier because seldom any data point is surrounding this outliner point. A new variable, i, has been introduced for the quantitative analysis (9) i  i   i , where   0 is the adjustment factor. The higher the value of i , the more likely it can be

AC

treated as an outlier. Such outliner can be removed from the data set. The procedure of the proposed algorithm is following: The algorithm procedure of IDDC algorithm Input: S   xi i 1 , xi  N

m

(data), K (number of clusters),  (adjustment factor);

Output: the optimal result of clustering, C defined as C  ci i 1 . N

Procedure Begin

ACCEPTED MANUSCRIPT

Step 1: Initialization and preprocessing: 1.1 Calculate the distance d ij ; 1.2 Obtain the cutoff distance d c by optimization; 1.3 Calculate the normalized local density i ; 1.4 Calculate the normalized separate distance  i Step 2: Determine the center of clusters M  mi i 1 and initialize the cluster assignments

C  ci i 1 ; N

CR IP T

K

Step 3: Classify the samples without clusters center and obtain complete C  ci i 1 ; N

AN US

Step 4: Detect any outlier exists in the clusters; Procedure End

3. Data Set for Verifying the Improved Algorithm and the Experimental Results

AC

CE

PT

ED

M

3.1 Case Background In Hong Kong (HK), certificates related to EAM have been awarded to a number of public utilities corporations, such as China Light and Power Co. Ltd. (CLP), Mass Transit Railway Corporation (MTRC), the Hong Kong and China Gas Co. Ltd. (TG), etc. Some buildings services providers in the operation and maintenance sectors, small and medium-sized enterprises (SMEs) and many plants all have a substantial part of EAM activities. However, they have not adopted the EAM standards completely. The consultancy fee for hiring a professional company to conduct EAM auditing and assessing the necessary performance that a certain company can be certified by EAM standards is extremely high. Hence, only large public utilities service providers can afford to hire such consultancy company but not for buildings services providers and SMEs. In order to help these smaller companies and investigate their level of performance in EAM, a structured questionnaire was used to survey the performance of these companies. The structured questionnaire was designed according to the guideline of PAS 55. The PAS-55 certification is a formal recognition that an organization’s integrated life-cycle asset management system is adequate and effective, in line with PAS 55 (BSI, 2008). It is subject to a rigorous evaluation and validation of competency across all the PAS 55 requirements. Certificate provides recognized credibility in good practice and corporate governance, and a robust platform for developing further improvements. In order to investigate the performance of asset management and EAM standard applications, sampling survey was conducted. According to the PAS-55 guideline, the structured questionnaire was designed, and all questionnaires were sent to 40 Operation and Maintenance (OM) departments in Hong Kong. The targeted providers include the public services departments of

ACCEPTED MANUSCRIPT

M

AN US

CR IP T

Hong Kong SAR Government, the commercial buildings, the residential buildings, the buildings that offered industrial services, the composite buildings in Hong Kong. The design of the questionnaire is focused on the information management of assets and can be separated into three parts, namely the general information of the surveyed companies, the information management standards, and the significance level on the implementation of the standards. There are 60 questions in total. The main part is focused on the guidelines for information management in EAM. General information included categories of operations of respondents, the number of employees, number of years in business operation, and total O&M cost of Buildings/Plans. Then, the questionnaire listed criteria of asset information management which needed to be applied. The significance level stands for the importance for applying those standards strictly or the influence degree of adopting standards for respondents. For companies/ organizations have different natures, guidelines may not perfectly match their business requirements. Besides filling questionnaire, face-to-face interviews were conducted by a researcher, which could explain more about the purpose and specific meaning for respondents. It improved accuracy and efficiency of survey data. For another, some questions needed to be discussed indeed in order to obtain the information comprehensively. After interviewers finished questionnaire, researcher had to check if anything missed or overlooked. During the whole process, the interviewers could communicate with the researcher and ask for further explanations to the listed questions. Consequently, the integrity and validity check were completed for all questionnaires.

ED

3.2 The application of the proposed algorithms and the results In order to investigate the effect of the cutoff distance to the application, a series of different cutoff distances d c within the interval [d , d ] was selected. The result generated by having

PT

different cutoff distances is shown in Fig. 1. Based on having the maximum variance of  , the desired cutoff distance dc  5.033 . Since the best cutoff distance is selected by the principle of maximum differentiation of local density, it amplified the differences of sample and made data

CE

samples easier to be clustered. Under the current conditions, the   ,   -decision graph can be

AC

drawn in Fig. 2. From the distribution of clustering for the 40 data samples as shown in Figure 2, the 8 th sample is the center of clusters. However, it is difficult to decide which one has the greatest potential to be the second candidate as the center of clusters. The 30th sample has the domination in separate distance, while the 14th sample has the advantage in local density.

ACCEPTED MANUSCRIPT

With the help of inspecting the descending order of  , the possible candidate for the center of clusters can be identified easier. With the increase of the adjustment factor,  from 0.5 to 2 as shown in Fig. 3 to 5 respectively, the influence on the local distance gradually weakened. In other word, if  =1, the significance of local density and separation distance is the same. If   1 ,

CR IP T

the importance of local density is emphasized. If   1 , it reflects that the separation distance is the key indicator. Consequently, by inspecting the distribution of different samples in Figures 3, 4 and 5, and considering the effect of α, one can observe that the 30th sample moved away from the center of clusters gradually as tabulated in Table 1. In the table, the ranking of the potential candidates that can be the center of clusters can be determined.

AN US

For the case of   0.5 , all of the samples are rearranged based on the K-means approach and our IDDC approach—see Figures 5.6 (a) and (b). Basically, all of the 40 samples can be clustered into three clusters or groups, namely clusters 1, 2 and 3.

M

In Fig. 6, for the case of   0.5 , the clustering results of K-means and IDDC method are listed to compare with the reference set. The data inputs for the two methods are essentially the same: the 40 samples with each sample having 53 attributes. The outputs are the three clusters corresponding to different performance levels. As for the K-means method, k value is equal to 3 because there are three performance levels. Therefore, it should be easier to compare with other methods by setting the k value equal to 3.

AC

CE

PT

ED

As shown in Table 2, the samples belonging to cluster 1 exhibit good performance in terms of standard adoption (high percentage in achievements). Note that the percentages range from 75% to 100% according to the definition of scale for survey. This is consistent with performance against requirements, which means that the companies had exhibited good to excellent performance on some key processes in a manner matching with the requirements in the information management section. Cluster 2 shows average performance as compared with classes 1 and 3. The degree of standards adoption is between 45% and 74%. Samples in cluster 3 are companies or buildings exhibiting low adoption (20% to 44%). There exists an apparent gap against requirements and EAM standards in the samples in this cluster. They either ignore some criteria stipulated by the standards or they do not comply with the standards closely. Hence, these companies could face high risks in relation to their O&M activities so there is a need to engage in some quick actions. Comparing the results of proposed method with K-means method (Table 2), we found the results from our improved method were more accurate and reasonable, because all clustering results were concluded from professional interviews and conducted by human experts who have more than 30-year working experience in building asset and maintenance industry. For all samples, the basic information management stipulations for asset management

ACCEPTED MANUSCRIPT

CR IP T

and maintenance skills had been covered; all companies or organizations had met over 20% of the criteria. In Table 2, the K-means method has reassigned samples 10 and 20 to cluster 2 after applying our improved method. Likewise, samples 9 and 34 have been moved from cluster 3 to cluster 2. After reviewing and analyzing the original survey with professional researchers, it was determined that these cluster changes were justified. The review followed the following procedure. Firstly, the original questionnaire was reexamined and the score for each question was re-checked and analyzed. Secondly, with respect to the significance level of survey results, the researcher looked for consistency between performance and requirement for each respondent, which meant that it was important to check whether the significance level associated with the adoption of standards was matching asset management performance.

CE

PT

ED

M

AN US

Further, in agreement with intuition, in most samples the outliers identified are distant; very few are surrounding outliers; almost all are isolated and corresponding local densities are quite small—see Fig. 2. The outliers could be regarded as extreme cases, which means they possess the best and/or the worst performance according to the asset management standard. The method clearly provides a basic benchmark for evaluating asset management performance in Hong Kong. As shown in Table 3, the threshold can be selected on the basis of outliers. For sample 39, the normalized density is zero, but the distance is between 0.9 and 0.95 (see Fig. 2), which means that its detection is not fully assured; different thresholds could lead to different results owing to high sensitivity, that is, more sensitive to the threshold. There are no specific criteria for selecting the expected threshold value. Nonetheless, according to the proposed algorithm, outliers can be determined automatically. Outliers are presented clearly in descending order in Fig. 7. Parameter β is used for weight and adjusts the value of local density and separation distance. It represents the degree of attention we paid to the distance between samples in the same cluster and the distance between samples belonging to different clusters. As per the outlier analysis, sample 38 is exhibiting the best performance compared with other samples, while the worst one is sample 39. The value of β can be adjusted for fulfilling different analyzing requirements. 4. Discussions

AC

To evaluate the performance of EAM of an O&M based company, its degree of adoption in standards and to simplify the auditing and evaluating procedures, an improved density and distance-based clustering algorithm has been developed and reported here. Our proposed method can easily determine the exact value of cutoff distance d c since it has a direct impact on the results of the cluster centers. To increase the comparativeness for clustering data, the local density i and the separation distance  i have been normalized. The center of clusters can be determined by using quantitative analysis rather than qualitative analysis. Furthermore, an outlier detection algorithm has been developed to identify the

ACCEPTED MANUSCRIPT

AN US

CR IP T

best or worst exemplars for ranking performance more accurate but also more simply. Most importantly, the center of clusters can be identified automatically to evaluate and rank performance in the practice in EAM instead of human expert analyzing and evaluating. This research work suffers from several limitations. First, the input of the proposed approach is the distance matrix, hence the input data need to be standardized in some cases. Second, for determining the center of clusters and outliers, the results are defined according to the specific data analysis. Since this research topic is new and most SMEs have so far been unable to receive certification based on an EAM standard, the number of survey respondents we have been able to muster may be deemed to be insufficient. Future researchers can investigate how performance measures could be used more effectively to drive performance improvement in practice. Another limitation of this work is that, although it has succeeded in building the whole system, the linkages and combinations associated with each method and function are still quite complex. As for industrial research, the desired system still requires many more companies to join the survey procedures and share their experiences to help improve and perfect the system while enhancing the accuracy continuously. Clearly, much more research is needed to arrive at the most suitable practices for different companies with different business natures.

M

5. Conclusions

The advanced method can easily determine the exact value of cutoff distance d c since it

ED

has a direct impact on the results of the cluster centers. To increase the comparative for clustering data, the local density i and the separation distance  i are normalized. The

AC

CE

PT

cluster center is determined by using quantitative analysis rather than qualitative analysis. Furthermore, outliers’ detection algorithm is developed by a new variable in order to discover outliers more accurate and simplified. The improvements mentioned above can perfectly apply to measuring performance of engineering asset management. A case study in Hong Kong is conducted, the clustering results are more accurate and reasonable, and it also provides a benchmark for further studying. This improved approach is able to predict and classify the new data samples as long as completing survey in future, which can be used to simplify the current evaluating system and improve the efficiency of measurement procedures. Waste of resources for analyzing and ranking is reduced and companies are not required to pay expensive consultant fees for whole procedures. Experts or researchers can provide comments and further suggestions for questionnaire respondents to support them to obtain the PAS-55 certificate according to their situations. For further study, more and more respondents or data samples can be collected, new cluster centers will be updated, which means it may find the most suitable model for companies or organizations continuously. Results of new clustering will be

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

updated timely by confirming from experts and professionals, at the same time, this improved and simplified method not only can reduce procedures for measuring performance, but also save cost for most of small and medium-sized enterprises to obtain asset management certificate. Moreover, updating data can provide a completely benchmark for industrial research on engineering asset management. In future, with the help of this intelligent classifying method for evaluating the performance of information management in EAM, more and more companies can join and apply this method to their current practice in EAM. First, a web-based survey method will be designed and implemented in future. Second, a company that is interested in checking performance in asset management must fill in designed questionnaire and input their opinions via the website. Third, our intelligent method will classify the performance of the company automatically to the belonged class. Fourth, once the result has been completed, the company will be notified its current performance and the gap between those benchmarked companies. Hence, the companies will realize how much difference on their performance as against to those companies that have been weighted as good performers. Moreover, the companies will acquire an indication which is how to improve their performance to match with the required standards in future. With the support of such website surveying and self-performance evaluation system, more and more SMEs and companies whose cannot afford to hire well-known consultancy to guide them in conducting EAM auditing and evaluating can apply this automatic evaluation system to benchmark their performance according to the standards. Additionally, due to the continuously increasing number of companies joining this automatic evaluation system, the amount of data samples will be increasing and updating continuously. New center of clusters can also be updated, that is, the identified center may become the most suitable model in EAM practice among the companies that have similar nature of business. Review of these new results can be further confirmed by hiring experts occasionally in case if the results of the automatic system found to be deviated. Consequently, the accuracy in conducting EAM auditing and then evaluating the current level of practicing EAM will be substantially enhanced in long term. The future web system that is embedded with our automatic evaluation system not only can perform auditing and performance evaluation in EAM automatically, but also save significant cost for SMEs in hiring expensive consultancy to obtain the required EAM certificate. Acknowledgments The work described in this paper was fully supported by grants from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. 11201315) and the National Natural Science Foundation of China (No. 61573332, No. 61601431). References

ACCEPTED MANUSCRIPT

Amadi-Echendu, J. E., Willett, R., Brown, K., Hope, T., Lee, J., Mathew, J., Vyas, N., & Yang, B. S. (2010). What is engineering asset management? In

Definitions, Concepts and Scope of

Engineering Asset Management (pp. 3-16), London: Springer. Bouguettaya, A., Yu, Q., Liu, X. M., Zhou, X. M., & Song, A. (2015). Efficient agglomerative hierarchical clustering. Expert Systems with Applications, 42(5), 2785-2797. Binu, D. (2015). Cluster analysis using optimization algorithms with newly designed objective

CR IP T

functions. Expert Systems with Applications, 42(14), 5848-5859.

BSI, P. (2008). 55-2: Asset Management. Part 2: Guidelines for the Application of PAS 55-1. British Standards Institution.

Carvalho, L. F., Barbon, S., de Souza Mendes, L., & Proença, M. L. (2016). Unsupervised learning clustering and self-organized agents applied to help network management. Expert Systems

AN US

with Applications, 54, 29-47.

Celebi, M. E., Kingravi, H. A., & Vela, P. A. (2013). A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications, 40(1), 200-210. Chen, J.Y., He, H. H., (2016).A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data, Information Sciences, 345, 271-293. Chen, J. Y., He, H. H. (2015). Research on density-based clustering algorithm for mixed data with

M

determine cluster centers automatically, Acta Automatica Sinica. 41(10), 1798-1813 . Chen, X. Q. (2015). A new clustering algorithm based on near neighbor influence. Expert Systems with

ED

Applications, 42(21), 7746-7758.

Chen, Y. W., Lai, D. H., Qi, H., Wang, J. L., & Du, J. X. (2016). A new method to estimate ages of facial image for large database. Multimedia Tools and Applications, 75(5), 2877-2895.

PT

Duan, L., Xu, L., Guo, F., Lee, J., & Yan, B. (2007). A local-density based spatial clustering algorithm with noise. Information Systems, 32, 978-986.

CE

Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on

AC

Knowledge Discovery and Data Mining (KDD 1996), pp. 226-231, Portland, USA. Jia, S., Tang, G. H., Zhu, J. S., & Li, Q. Q. (2015). A novel ranking-based clustering approach for hyperspectral band selection. IEEE Transactions on Geoscience and Remote Sensing, 54(1), 88102.

Lu, J., Liong, V. E., Zhou, X.Z., Zhou, J., (2015). Learning compact binary face descriptor for face recognition, 2041-2256

IEEE Transactions on Pattern Analysis and Machine Intelligence,

37

(10),

ACCEPTED MANUSCRIPT

Lu J., Wan, G., Deng, W., Jia, K. (2015). Reconstruction-based metric learning for unconstrained face verification, IEEE Transactions on Information Forensics and Security, 2015, 10(1), 79-89. Sun, K., Geng, X. R., & Ji, L. Y. (2015). Exemplar component analysis: a fast band selection method for hyperspectral imagery. Geoscience and Remote Sensing Letters, 12(5), 998-1002. Khan, S. S., & Ahmad, A. (2003). Computing initial points using density based multiscale data

CR IP T

condensation for clustering categorical data. In Proceedings of the 2nd International Conference on Applied Artificial Intelligence (ICAAI 2003), Kolhapur, India.

Khan, S. S., & Ahmad, A. (2013). Cluster center initialization algorithm for K-modes clustering. Expert Systems with Applications, 40(18), 7444-7456.

Kriegel, H. P., Kröger, P., Sander, J., & Zimek, A. (2011). Density-based clustering. Wiley

AN US

Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(3), 231-240.

Khanmohammadi, S., Adibeig, N., & Shanehbandy, S. (2017). An improved overlapping k-means clustering method for medical applications. Expert Systems with Applications, 67, 12-18. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics (pp. 281-297). Berkeley: University of California Press.

M

Mehmood, R., Zhang, G. Z., Bie, R. F., Dawood, H., Ahmad, H., (2016). Clustering by fast search and find of density peaks via heat diffusion, Neurocomputing, 208, 210-217.

ED

Miao, D. D., Qin, X. W., & Wang, W. D. (2015). Anomalous cell detection with kernel density-based local outlier factor. Communications, China, 12(9), 64-75. PAS, B. (2008). 55-1: Asset Management. Part 1: Specification for the Optimized Management of

PT

Physical Assets. British Standards Institution. Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191),

CE

1492-1496.

Ros, F., & Guillaume, S. (2016). DENDIS: A new density-based sampling for clustering

AC

algorithm. Expert Systems with Applications, 56, 349-359. Wang, S. L., Wang, D. K., Li, C. Y., & Li, Y. (2015). Comment on" Clustering by fast search and find of density peaks". http://arxiv.org/abs/1501.04267.

Wang, M. M., Zuo, W. L., Wang, Y., (2016). An improved density peaks-based clustering method for social circle discovery in social networks, Neurocomputing, 179, 29, 219-227. Wang, T., Zhang, W., Ye, C., Wei, J., Zhong, H., & Huang, T. (2016). Fd4c: Automatic fault diagnosis framework for web applications in cloud computing. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 46(1), 61-75.

ACCEPTED MANUSCRIPT

Woodhouse, J. (2014). Briefing: standards in asset management: PAS 55 to ISO 55000, Infrastructure Asset Management, 1(3): 57-59. Xu, X. Y., Yan, Z., Xu, S. L., (2015). Estimating wind speed probability distribution by diffusion-

ED

M

AN US

CR IP T

based kernel density methodElectr. Power System Ressource, 121, 28-37.

AC

CE

PT

Fig. 1. The change of variation of  with different d c

AN US

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

Fig. 2. Decision graph of clustering for PAS-55

AN US

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

Fig. 3. The descending order  with   0.5

Fig. 4. The descending order  with   1

AN US

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

Fig. 5. The descending order  with   2

AN US

CR IP T

ACCEPTED MANUSCRIPT

Fig. 6. The clustering of different samples by using (a) the K-means approach and

AC

CE

PT

ED

M

(b) our proposed approach.

AN US

CR IP T

ACCEPTED MANUSCRIPT

M

Fig. 7. The descending order  with   0.5

Table 1



ED

The potential candidate to be the center of clusters based on having different  values. 1

2

8,30,14

8,14,30

8,14,24

Table 2

CE

PT

cluster centers

0.5

AC

Clustering results with K-means and our porposed method methodology

K-means

the porposed method

Cluster 1

1、3、6、8、11、12、13、15、16、 25、26、27、29、33、38、40

1、 3、 6、 8、10、11、12、13、15、 16、20、25、26、27、29、33、38、40

Cluster 2

2、4、7、10、14、18、19、20、21、 23、24、31、32、35、36

2、 4、 7、 9、14、18、19、21、23、 24、31、32、34、35、36

ACCEPTED MANUSCRIPT

Cluster 3

5、9、17、22、28、30、34、37、39

5、17、22、28、30、37、39

Table 3 The identified outliers when applying different threshold values

 0.05, 0.95

 0.05, 0.90

 0.10, 0.95

 0.10, 0.90

outliers

38

38,39,9

38,11

38,39,9,11

AC

CE

PT

ED

M

AN US

CR IP T

 o ,  o 

An intelligent and improved density and distance-based clustering approach for industrial survey data classification

An intelligent and improved density and distance-based clustering approach for industrial survey data classification

Recommend Documents