13th Symposium on Automation in Mining, Mineral and Metal Processing Cape Town, South Africa, August 2-4, 2010
Cause&Effect Analysis of Quality Deficiencies at Steel Production using automatic Data Mining Technologies H. Peters*, N. Link.* *VDEh-Betriebsforschungsinstitut GmbH, Duesseldorf, Germany (Tel: 0049-211-6707-311; e-mail: harald.peters@ bfi.de).
Abstract: The application of statistical methods is state-of-the-art in all steel companies worldwide to investigate data coming from technical processes. In many cases the aim of such an investigation is to find cause&effect relationships between process / plant variables and detected quality deficiencies. For these kinds of investigations special departments are responsible. They use complex statistical tools and in-house written procedures. The main disadvantage here is that the experience of the people at the production lines can not directly be used. On the other hand the mostly used uni-variate and linear statistical techniques are in many cases not sufficient to explain the behavior of the complex chain of steel production. Out of both reasons the development of automatic data mining technologies which can be handled by plant engineers without knowledge about statistics are under development at many places worldwide. This article presents some approaches of automatic and robust Data Mining which can be used by process and plant engineers. Keywords: Data processing, data mining, databases, diagnosis, product quality, machine learning, neural networks, decision trees, statistical analysis, steel industry. features from the raw data of the quality measurement device) have to be pre-defined. Summarizing one can say, that as most processing steps as possible have to be automated and at the same time the correctness of the DM results have to be guaranteed.
1. INTRODUCTION Data Mining (DM) is usually an interactive technology: a user with domain knowledge about the investigated problem interacts with the computer system and uses the results of the different DM technologies to expand his knowledge about the given task. The result of the investigation is out of this reason depending on the background of the user and the suitability of the applied Data Mining technologies. Unfortunately it is very rare that the user combines the domain knowledge in both fields of interest, the steel quality problem and the DM methodology. One solution for this problem could be the availability of powerful and simple to handle Data Mining methods which can be applied by plant engineers. On this way it is guaranteed that knowledge about the production process, the behaviour of measurement systems, limitations of data acquisition and special properties of just generated or treated product is taken into account during the Data Mining process.
1.2 Constraints caused by the user Regarding the possible user (refer fig.1) nearly no constraints shall be made. There should be nearly no requirements to user skills concerning Data Mining and related methods. That means that all the below listed tasks have to be performed automatically without intervention or support by users:
1.1 Problem description The given task now is to make DM technologies available which can be applied by people who don’t have deep knowledge about the DM methods. Therefore all constraints which are coming from the DM methods have to be checked automatically. Furthermore the selected DM technologies have to cover all possible challenges of data from industrial processes (see chapter 1.3). Finally the different types of quality deficiencies have to be classified into groups and for each group suitable grading techniques (=the way to calculate 978-3-902661-73-9/10/$20.00 © 2010 IFAC
•
decision if the given problem is a classification or a regression task,
•
selection of suitable pre-processing steps,
•
selection of DM methods out of a given pool of techniques,
•
parameter adjustment for the selected DM method.
Apart from this the user has to select the relevant process route (e.g. from vaccum degassing up to hot rolling) and the process variables, which shall be taken into account for the investigation. For these tasks his domain knowledge is indispensable and can not be replaced by any kind of automatic procedure. Important is here, that the users selects the variables very generous to have the chance to find also the “unexpected” results. 56
10.3182/20100802-3-ZA-2014.00012
13th IFAC MMM Cape Town, South Africa, August 2-4, 2010
quality defects it is mandatory for the steel companies to enlarge the number of people which can perform such analysis. Additionally it is essential to integrate the domain knowledge about the given quality problem as good as possible. The solution here is to develop tools which allow process or quality engineers to apply Data Mining techniques and to use at the same time their own background knowledge about the process technology. Therefore simple to use and powerful DM tools are necessary and at most pre-processing steps as possible have to be performed automatically. 3. COMPONENTS OF TECHNICAL SOLUTIONS
Figure 1: The typical user
In this chapter the different components shall be discussed which are necessary to finally construct a suitable Data Mining process chain under the given constraints which are described before. Therefore the construction of the data sample, the automatic pre-processing of the data and methods for the selection of the most important variables are necessary. Also techniques to evaluate and compare results between different approaches are presented.
1.3 Constraints caused by the industrial process The origin of the data which have to be investigated can not be ignored. In the industrial environment the following challenges are typical: non-linear relationships, multi-variate problems, incomplete data sets, unbalanced data samples (=many data sets from defect free products and only some from products with quality deficiencies) and highly correlated input variables, because of the large grade of automation in the steel industry. Additionally it has to be considered, that most input information (process variables as well as quality information) can only be measured at a limited accuracy. Furthermore it is important to select a suitable grading method for the target variable: if the investigated quality deficiency is e.g. a special kind of surface defect the grading method will be different to that in case of a flatness problem.
3.1 Construction of data sample The harsh environment at the steel production plants requires robust and unfailing sensors, and the complexity of the processes increases the number of parameters necessary to describe, monitor and control these processes and related plants. Additionally ways to assess the product features and quality are manifold because of the different applications steel products are made for. So there exists a plurality of sources for information which may be important for a cause&effect analysis.
2. CURRENT SITUATION IN STEEL COMPANIES
Therefore one major task in the Data Mining process is the construction of a suitable and reliable data sample. "Suitable" means here that all sources for information which may be relevant for the interesting problem are available. This includes measurement data of sensors, manual inputs of staff documenting the behaviour of the related process or the occurrence of events, the documentation about applied changes of the process and the involved plants as well as data describing the target quality deficiencies. "Reliable" means that the collected information is correct and the matching between input and output data in time or length can be done at the required degree of accuracy. All these information has to be put into data base tables accessible for analysis tools.
2.1 Availability of data Because of the highly automated processes in the steel industry and the world wide tendencies to store all measured data in large data bases, the availability of data is in many companies quite good. Larger problems are on the one hand the quality of these data and on the other hand the necessity to “track” the product (and starting from slab or billet casting also it’s length) along the complete route of manufacturing. The latter is a pre-condition to prepare process and quality data in such a way, that a powerful evaluation is possible. With regard to the given quality problem the correct base for product tracking has to be selected (e.g. heat oriented or oriented to 100m length segments of hot rolled coils, etc.) and both process and quality variables have to be aggregated to this base.
This step of sample construction cannot be done automatically, here the knowledge and the experiences of the user about the information sources are in demand. The efforts necessary for this task depend on the degree of automatic data acquisition realised at the plant and the amount of different types of information sources.
2.2 User demands It is usual in the steel industry since many years, that quality departments or specialised statistic departments are performing data analysis to look for cause&effect relationships for severe quality problems. Therefore specially educated people apply complex statistical tools and construct the data samples “by hand” on a very time consuming way. Because of the increasing importance of the time to find the reason for
The next step of sample construction belongs to the generation of a flat table combining the corresponding inputs and outputs for further analysis. One essential prerequisite is the existence of identity numbers (ID) to identify single production pieces, or the existence of synchronous time or length measurements to allocate data from different sources 57
13th IFAC MMM Cape Town, South Africa, August 2-4, 2010
to the same part of the product. This part of a data sample construction depends on the way the data are archived and organised in the data bases of the steel plant. A description of the possibilities and solutions for such a material tracking will be extraordinary complex and large and therefore it shall be skipped here.
existing neighbours (the reason is that physical processes don't vary in steps), otherwise
The pre-processing of the data (described in the next chapter) may happen after the compilation of single tables or just after the generation of the flat table including all input and output data. The best practice is always to pre-process data as near as possible to their origin. For instance to increase the reliability of derived features it's better to eliminate outliers in a signal firstly and calculate then the mean value of a time window than to calculate the mean and look than for outliers in the mean values.
•
missing values have to be marked to identify them during the application of the different mining procedures,
•
for not-normal distributed parameters a transformation should be applied before eliminating outliers by calculated standard deviation (here statistical values like skewness or kurtosis help to classify the distribution),
•
for the detection of outliers a limit value basing on the standard deviation has to be defined, in case of uniform distributed parameters lower and upper quantiles can be used for outlier detection,
•
for time or length related signals an outlier test should use the first derivative of the signals to identify sudden steps in the signal even if the amplitude is within the defined standard deviation,
•
outliers are eliminated by deleting the value (producing a missing value) or exchanging by another value depending on the signal/parameter type (overall mean/median, interpolation of neighbours etc.),
•
redundant data can be detected by simple linear correlation analysis, but also the occurrence of inputs linearly depending on other inputs has to be checked (ranking deficit in the input matrix), and those inputs should be rejected.
3.2 Automatic pre-processing Pre-processing of data is an indispensable procedure before executing any type of data analysis. It conduces to the keeping and increase of data accuracy and reliability. Conventional pre-processing ranges from low level procedures, e.g. removing duplicate data sets in the sample or treating missing entries in the data sets, to ambitious ones like treating outliers or detecting redundant information in the data. Because of the fact that the users interested in the cause& effect analysis at steel production are mostly inexperienced in the application of data analysis techniques, the combination of different pre-processing actions in an automatic scheme will support users work and will reduce the risk of wrongly applied methods.
In preparation of the following Data Mining analysis the sample has to be examined and prepared concerning some special requirements of data mining methods. E.g. some approaches require a normal distributed input data sample to deliver accurate and reliable results. Other methods need a more or less uniform distribution of the output parameter to reach usable results. Therefore at the end of an automatic preprocessing scheme the data sample has to be examined concerning the type of distribution of the included input parameters to identify suitable data mining methods or to reject e.g. not-normal distributed parameters from the analysis, the distribution of the output variable.
The aim of automatic pre-processing is to relieve the user of time-consuming work and ensure an objective and reproducible treatment of the data. The main items of an automatic pre-processing scheme are •
elimination of duplicate data sets,
•
treatment of missing observations,
•
identification and treatment of outliers,
•
elimination of redundant data,
•
generation of a data sample respecting the requirements of the target Data Mining method.
For these items there are different requirements and prerequisites which have to be complied: •
Figure 2: Balancing of an unbalanced data sample
the identification numbers or position information (both called ID) and the output have to be marked before starting the automatic pre-processing,
•
a rule has to be defined how to treat data sets with equal IDs (e.g. eliminating the "younger" set of a duplicate, or take the more completely filled data set),
•
in case of time series or length-related data the missing values may be exchanged by an interpolation of the
To respect data mining methods sensitive to the distribution of the output parameter the compiled data sample should be balanced, that means the sample has to be processed in a way to ensure a more or less equal distribution of the selected output. That's often necessary in terms of quality data, because there are much more observations of high quality products than of products with quality deficiencies. But the balancing method must ensure that the reduced sample represents the same input space as the original sample. A 58
13th IFAC MMM Cape Town, South Africa, August 2-4, 2010
Because the result of the decision tree is depending on the order of the data sets in the data sample and because normally the data sample is rather too small than too large a kind of cross validation can be used to get a robust result. Figure 3 illustrates the procedure.
suitable approach here is to cluster the input space (e.g. KMeans, Neural Networks) and then to select single representatives of each cluster. The described parts of an automatic pre-processing can be used in any case of data analysis. 3.3 Methods to evaluate the importance of variables The typical situation at the beginning of a data-driven investigation of cause&effect relationships is, that plant engineers make a collection of process and plant variables which could be of interest regarding the given technical problem. Normally several dozen of variables are now candidates for relevant influencing variables. The main task of Data Mining is now to select the most important ones. This can be done by techniques, which allow the generation of a “priority list”. Depending if the given task is a regression or a classification problem several methods are suitable for this purpose. In the following chapter some methods for both fields are described briefly. Multiple Decision Trees Decision Trees are a classification technique out of the group of machine learning. By the recursive application of axis parallel splits the high-dimensional feature space is separated as long in sub-spaces as each sub-space contains only data sets of one class. The selection of the optimal dimension and position of the next split is done by an information measure, which is very often based on the entropy:
Figure 3: Procedure of multiple decision trees At the end of it for each variable the number of “hits” can be counted. By sorting this hit list in a descending order a priority list has been generated which gives the user an information about the importance of the variables regarding the given classification problem.
k
I ( P (v1 ),...P (vi ),..P (v k )) = ∑ − P (vi ) log 2 P (vi ) i =1
Here I is the entropy, vi are single events, P(vi) is the probability of event vi. The gain of entropy is e.g. used to select the optimal split.
Self organising map (SOM) In /Peters et. al. 2001/ the general idea of this approach has been presented for the given purpose. In the meantime a very simple algorithm for the automatic evaluation of the component plant of the SOM has been tested. Here the weight matrix between one input node of the SOM and all output nodes is sorted as a vector with one column of the matrix after each other. Between this vector of the target variable and all input variables a very simple linear correlation coefficient is calculated:
One feature of decision trees is that the method normally continues until all sub-spaces contain only data sets of one class. This leads sometimes to many leaves, which each cover only a very small number of data sets (in extreme only one). This is not really an useful information. A solution therefore is the application of “pruning” techniques. Here less important leaves are deleted and replaced by nodes. Finally the method selects a number of variables which are necessary to build the complete decision tree. For the description of the classification accuracy of the pruned tree in a scalar manner the following “fitness function” E is used: M
E = ∑ ε i ω i with i =1
mci
M
n ci
i =1
n
∑ ωi = 1 and e.g. ωi =
r=
If these correlation coefficients are sorted in descending order a priority list of the variables regarding the given problem is generated. Because the SOM can handle classification tasks as well as regression problems this type of generation of a priority list can be used in both cases.
M −1
m * ci − βi ∑ Here ε i = α i nci j =1 nci
α i , β i ∈ [0,1]
and
∑ ( X − X )(Y − Y ) ∑ ( X − X )∑ (Y − Y )
Other techniques
αi + βi = 1
The just described methods are only a very small cutout of the group of Data Mining techniques, which are able to eva59
13th IFAC MMM Cape Town, South Africa, August 2-4, 2010
luate the importance of a variable for a given cause&effect problem. Statistical methods in this field are e.g. the Discriminatory Analysis, ANOVA (analysis of variances) or stepwise regression. Although the “categorised histogram” is only a uni-variate technology, in practice it is in many cases very useful. Here a suitable visualisation and the calculation of a similarity value, which allows to describe the importance of a variable, are the tools to analyse the results. Regarding more methods we refer to the Data Mining literature (e.g. /Cherkassky, Mulier 1998/, /Berthold, Hand 1999/).
number of functions have to processed in a pre-defined order, the option to record a macro is normally available. The alternative is the block oriented software which allows the user to generate macros by a visual programming. Looking to the target user which should deal with the problem of caus&effect analysis of quality deficiencies in steel industry (refer to chapter 1.2) none of these possibilities offer really a good solution. That’s the reason why the group of the authors have investigated and realised new solutions in the last years. 4.1 Wizard like user guidance
3.4 Evaluation and comparison of results A well known technique to guide an unexperienced user through a complex software is a so called “wizard”, which is very often used for the installation of software. Here the user has to answer some more or less simple questions and has also the possibility to navigate through the different software installation steps by using a “back” and a “continue” button. Exactly this procedure has been transferred to a Data Mining tool. Therefore is was necessary to structure the DM process into clear steps with understandable questions for the user. The following steps have been used:
Most Data Mining techniques make the assumption of a special kind of model behind the data. E.g. the Discriminatory Analysis needs Gaussian distributed data and can handle only linear relationships between input and output variables. If these assumptions are not valid, the technology will fail to find cause&effect relationships. Based on the used DM techniques and the corresponding model assumption different solutions for the evaluation of results are available. E.g. is the f-statistic value a suitable measure to tell something about the contribution of a single variable to the overall classification results of a discriminatory analysis. But this f-statistic value is not suitable to evaluate the results of a decision tree. At the beginning of a Data Mining session it is normally totally unclear if the given problem is linear or non-linear, if all necessary input information is available or not. Furthermore some of the investigated variables are normal distributed, some equal and for some no clear distribution can be found. Out of this reason usually several techniques are applied and the results are compared, but how to compare these results? It happens very often that the results are contradictory. Which results are correct and which not?
•
Selection of data base and data table to load,
•
Selection of variables which have to be investigated,
•
Presentation of pre-processing results and possible selection of the user to remove some of the variables because of these results,
•
Presentation of the now available data sample,
•
Selection of the aim of the user (see figure below),
•
Question if ranking information is of interest,
•
Question if time intensive techniques shall be integrated or not,
•
Presentation to the user which DM technique has been selected and some information about the found data characteristic,
•
Presentation of DM results, e.g. a priority list of the variables including a numerical ranking value,
•
Selection of the variables which now shall be used for the following modelling process (e.g. the first 5 variables of the ranking list),
Any kind of the above mentioned measure values (like the fstatistic value), which give information about the contribution of each variable to the final regression or classification results are available for many DM technologies. Another possibility to evaluate the results is the usage of the most important variables as inputs for a model (classification or regression) and then to check the accuracy of a trained model with a suitable validation data set. The problem here is, that the above described methods only produce a relative “ranking list” regarding the importance of the variable, but tell nothing about the number of necessary variables and their combination. Out of this reason the model has to be calculated with a varying number of variables. 4. TECHNICAL REALISATION Beneath the applied DM methods and data handling strategies the technical realisation is a key point for a successful implementation. Regarding the user guidance and the type of HMI two different but general concepts are available at the moment on the market: the classical menu and icon oriented software with random access to all functionalities (typical for the most statistic tools). If a 60
13th IFAC MMM Cape Town, South Africa, August 2-4, 2010
•
Selection if only a linear modelling technique shall be used or if a non-linear techniques should be possible,
•
Presentation of modelling results, and possibility to store the model for further applications.
The two most important variables (here the amount of cooling water at a special position and the stopper rod position) are exactly those which were found on a much more time consuming manual DM investigation. 5. CONCLUSIONS AND OUTLOOK
Experiences of users from industry show, that the described procedure can be applied easy to actual problems. They can concentrate only to the given quality problem and the application of their background knowledge to evaluate the results. For the more interested user in many steps an additional “details”-button offers them the chance, to get more information. Necessary for this procedure was furthermore to do the selection of the applied DM technology and the calculation of all it’s necessary parameters automatically, which was the main challenge during the development.
The different investigations and developments have shown, that DM methods and software tools for “automatic cause& effect analysis of quality deficiencies”, which can not only be handled by DM specialist but by people, who have the domain knowledge are a suitable way. On the one hand the results become more robust compared with a “manual Data Mining investigation” and on the other hand the group of people which are able to apply these tools could be enlarged drastically. A very important point here is the good integration of the described solution into the overall data base technology of the company. The tasks of product and length tracking and the user friendly availability of all relevant input and output variables are key points for the acceptance of such solutions in the industrial reality.
4.2 Automatic combination of different approaches At the just described approach one single DM technologies is selected by checking the data structure and evaluating the different wishes of the user. The result is highly depending on the applied DM method and the implicit underlying model approach.
Future developments are necessary in two directions: first of all more investigations to classify all possible quality deficiencies into suitable groups and to define the optimal pre-processing steps for each of these groups are essential. This is the basis for all further calculations and mistakes which are made here could never be repaired at the following DM steps. The second point is the DM procedure itself. For both presented solutions (“wizard like user guidance” and “automatic combination of different approaches”, see chapter 4.1 and 4.2) the selection of the best DM method, the automatic calculation of its necessary parameters based on the characteristic of the data sample and the way to combine several DM techniques are not finally solved. Here future research is necessary.
To get a more robust and reliable result it could be useful not only to apply one method but to combine different techniques. Therefore it is also necessary (as above) that the needed parameters for each technique are selected automatically. Furthermore a suitable presentation and comparison of the results is necessary. The group of the authors has investigated different combinations. One example was the connection of a discriminatory analysis (linear, multivariate), a categorised histogram (model free, uni-variate), a decision tree (OC1, model free, multi-variate) and a Self Organising Map (non-linear, multi-variate). All these methods have been applied for a given quality deficiency problem. For each DM method a valuation value has been calculated which was normalised to the range 0 to 1 (0=variable has no influence, 1=variable has large influence). For each valuation value a ranking order concerning the input variables is determined. By finally calculating the mean value of all rankings for each variable, an overall ranking can be defined. In figure 4 an example for a real case is presented.
REFERENCES Peters, H., Link, N., Holzknecht, N. (1998). Prediction of Product Quality by Classification Algorithms to improve Quality Control Systems, Proceedings of the 1998 Japan-USA Symposium on Flexible Automation, Otsu, Japan, 12.-15.7.1998, Volume I, pp. 421-428 Peters, H. (2000). Application of Computational Intelligence for Quality Control in Steel Production, Proceedings of Toolmet’2000, pp. 1-14, Oulu, Finnland, 12.-13.4.2000, ISBN 951-42-5595-X Peters, H., Link, N., Heckenthaler, T. (2001). Application of Data Mining Techniques to find correlation between Quality Data and Process Variables. In M.Araki, 10th IFAC Symposium on Automation in Mining, Mineral and Metal Processing (MMM2001), September 4-6, 2001, Tokyo. Berthold, M., Hand, D.J. (1999). Intelligent Data Analysis, Springer Verlag, Berlin. Cherkassky, V., Mulier, F. (1998), Learning from Data, John Wiley&Sons, Inc., New York
Figure 4: Result table for real example
61