INSPECT: A program system to visualize and interpret chemical data

INSPECT: A program system to visualize and interpret chemical data

n 141 Software Description Chemometricsand Intelligent Laboratory Systems, 22 (1994) 147-153 Elsevier Science Publishers B.V., Amsterdam m Softwa...

575KB Sizes 0 Downloads 72 Views

n

141

Software Description

Chemometricsand Intelligent Laboratory Systems, 22 (1994) 147-153 Elsevier Science Publishers B.V., Amsterdam

m

Software Description

INSPECT: a program system to visualize and interpret chemical data H. Lohninger Institute of General Chemistry, Technical University Vienna, Lehargasse 4/152,

(Received 14 September

A-1060 Vienna (Austria)

1993)

Abstract

Lohninger, H., 1994. INSPECT a program system to visualize and interpret chemical data. Chemomettics and IntelligentLaboratory Systems, 22: 147-153. INSPECT is a DOS-based system for the visualization and interpretation of data. Although running on IBM PCs under DOS, INSPECT features a graphical user interface. Over 250 commands provide the basis for editing, displaying, analyzing and modelling data. All major types of charts can be generated, including on-screen three-dimensional rotation of data. A parser for mathematical formulas allows the processing of the data by entering almost arbitrary formulas. Furthermore a set of mathematical tools is available, including principal component analysis, multiple linear regression, statistical tests, k-nearest neighbour methods and neural networks.

INTRODUCTION

INSPECT is a program which allows the visualization and interpretation of data. The system is based on a powerful yet easy to use graphical user interface. INSPECT provides a collection of important methods to edit, visualize and interprete numerical data. These are some of the features of INSPECT: - mouse-driven graphical user interface - up to four simultaneous windows to process and display data - hardcopy routines for all major printers, including PostScript EPS and HPGL - various methods to import and export data

- creation and display of electronic slide shows - windows are freely configurable in size and colors - zoom and pan of data - data can be labelled by additional class information - the axes and the titles of the graphs can be designated almost arbitrarily - sorting and mixing of data - built-in numerical data editor - a large number of available chart types: point plots, line plots, spectra, x-y plots, contour lines, principal components, histograms, table survey, etc. - on-screen three-dimensional rotation of data

0169-7439/94/$07.00 0 1994 - Elsevier Science Publishers B.V. All rights reserved SSDZ0169-7439(93)E0054-8

148

H. Lohninger /Chemom.

- built-in math interpreter - mathematical/statistical functions: univariate linear and parabolic regression, rank correlation, multiple linear regression, principal component analysis, neural networks, k-nearest neighbour (KNN) modeling, hierarchical clustering - automatic feature selection: stepwise regression, growing neural networks - cross-validation of calculated models

CONCEPT

The concept of INSPECT is quite simple. The system works on a central two-dimensional data matrix which can be configured to consist of any number of rows and columns. This data table can be loaded, stored and edited in various ways and the data can be visualized and analyzed. IN-

Intell. Lab. Syst. 22 (1994) 147-153/Software

T

Export

graphical

dota

survey

variable-object plot variable-vo;ioble plot contour line plot histograms on-screen X-rotation

t-----j

I

Visualisation

I

Fig. 1. Flow of informationwithinINSPECT.

1

n

SPECT provides a graphical user interface with a maximum of four simultaneously open windows, each of them displaying a special view of the data. These windows can be adjusted in their sizes, colors and contents. The windows can show a wide variety of charts, which can be zoomed, panned and labelled individually. These charts cover the most commonly used visualization tools, ranging from simple x-y plots to histograms, contour plots and on-screen three-dimensional rotation. Besides the powerful graphics engine several mathematical tools provide the basis for changing and analyzing these data. Among these methods are principal component analysis, multiple linear regression, basic statistical tests, a mathematical formula interpreter and neural networks based on radial basis functions. INSPECT also provides for the handling of class information. A maximum of 127 classes can be defined which are indicated by different colors (max. 16)

external ASCII file dBase III/IV Lotus 123

external ASCII file

Description

scaling smoothing orbitrory mothematicol tronsformotions univariote statistics multiple linear regression principal component analysis cluster analysis neurol network feature selection cross validation

n

H. Lohninger/Chemom.

Intell. Lab. Syst. 22 (1994) 147-153/Software

p variables A I

I var. names

I

pd:tan

Lk~ctiption

149

lowed by the data table) or from any free form text file (in this case only one variable for all objects could be read at a time). Secondly, INSPECT can read several proprietary formats including dBase and Lotus l-2-3 files. In this case INSPECT provides some means to select the information needed before importing the data. The export of the processed data can be achieved as easily as the import of data although INSPECT cannot create proprietary data formats. Besides the basic methods of exporting data, INSPECT creates protocol files for the more elaborate methods of data interpretation.

Fig. 2. Basic structure of the data used in INSPECT.

VISUALIZATION OF THE DATA

and/or different symbols. The class information can be included with the mathematical calculations. The idea behind INSPECT is to import data from an external source, edit and process the data and visualize the internal relations of the data. All this can be done in an interactive way by using a versatile and convenient graphical user interface. Fig. 1 shows the main functions of INSPECT. The data model of INSPECT is rather simple and therefore easy to understand and to handle (Fig. 2). INSPECT relies on two data tables which can be configured freely as long as the product of the number of variables and objects does not exceed 8100 (current limit, will be extended in the future). The main data matrix holds the numeric data and the names of the variables and objects. Besides this main data table, an auxiliary vector is maintained which holds an optional class information on the objects.

IMPORT AND EXPORT OF DATA

An important feature of any data processing program is how to get data into and out of the program. INSPECT provides several methods to get data from external sources. First, INSPECT provides a universal way to read any ASCII data. Data could be read either from a very simply formatted ASCII table (four header lines fol-

INSPECT provides a large set of visualization methods which can be used in four windows in parallel in order to get different views of the data. These windows can be freely configured in their size allowing overlapping windows too. The following chart types are available: a special table survey plot plots of any variable against the object number plots of all variables of an object plots of two variables against each other (x-y plot)

contour line plots histograms on-screen three-dimensional rotation principal component projection principal components loading plot Each of these charts can be zoomed, panned, labelled, and edited in colors and display mode (point plots, lines, spectra). A hardcopy utility enables the user to make a screen copy of the current display on a wide variety of printers (including PostScript printers and HPGL plotters). The philosophy of INSPECT is to work on the data using graphical tools as much as possible. In order to accomplish this a special survey plot of the data table is available which shows the whole data table at a glance. This survey plot can be used to select variables and objects interactively by just clicking on them. For many investigations it would be of great help to look at a sequence of ‘slides’ where each

150

H. Lohninger / Chemom. Intell. Lab. Syst. 22 (1994) 147-153 /Software

slide represents the outcome of a lengthy calculation. One can look on a series of slides which are displayed consecutively and thus get a good insight in the dynamics of a simulated process. INSPECT allows the creation of such slide shows by producing the slides individually and then replaying them according to a simple script. The distribution disk of INSPECT contains such a slide show showing the process of adaption during the learning phase of a neural network trained by backpropagation.

MODELLING AND ANALYSIS OF DATA

INSPECT provides some of the major methods of data analysis and modelling. Although the list of available methods is by far not comprehensive, the implemented procedures give a basic support for data analysis, especially when combined with the visualization tools. Further methods are to be implemented in the near future. The methods listed below are not discussed in their theoretical details because the aim of this paper is to show only the possibilities of INSPECT: formula interpreter for mathematical formulas simple univariate statistical tests univariate linear and parabolic regression multiple linear regression principal component analysis hierarchical clustering (dendrograms) KNN modelling neural networks (based on radial basis functions) automatic feature selection using growing neural networks stepwise linear regression

EXAMPLE OF APPLICATION

A short example will show the application of INSPECT to multivariate data analysis. The data of this example are taken from literature and deal with the prediction of retention times of peptides (in reversed-phase high-performance liquid chromatography) from their amino acid

Description

W

residues [l]. The authors reported retention times of 104 peptides which are constituted of 2 to 64 amino acids. They used multiple linear regression (MLR) to predict the retention times. A similar work had been done by Sasagawa et al. [2] ten years earlier who have assumed a logarithmic relationship between retention time and the sum of the (estimated) individual retention times of the constituent amino acids: t, =A

ln(1 + BCDjnj)

+ C

where A, B and C are constants to be determined, Dj is the calculated retention time of the individual amino acid residue j, and nj is the number of residues of the amino acid j. Although Chabanet and Yvon [l] follow a somewhat different approach, the data in this example (the retention times, the amino acid sequences and the Dj) have been taken from ref. [l]. The peptides used in this study consisted of nineteen different amino acids which are counted individually for each peptide giving a data matrix of 104 objects and 21 variables (nineteen amino acid residues plus the retention time plus the sum CDjnj). In addition, the total number of amino acids per peptide has been included (variable ‘leng’), thus arriving at a data matrix of 104 objects and 22 variables. Fig. 3 shows the survey plot of the data matrix (top) and the dependence of the retention time on CDjnj (lower right) and on the number of constituent amino acids (lower left). In order to follow the work of Sasagawa the logarithm is applied to the weighted sum CDjnj (INSPECT allows to transform the data, or part of it, by entering arbitrary mathematical formulas). Next, MLR is used for two experiments: one using ‘leng’ and CDjnj as input variables and another with ‘ieng’ and ln(1 + CDjrzj> as inputs. The results are shown in Fig. 4 (left charts). Another way to build a model for the prediction of retention times is to train a neural network using the same input variables as with MLR. INSPECT provides a specific network model which is based on radial basis functions 131.This network architecture has the advantage of exhibiting short training times (compared to the standard backpropagation procedure). In this case

n

H. Lohninger /Chemom.

Intell. Lab. Syst. 22 (1994) I47-153/Software

provides both for multiple linear regression and neural networks a simple to use cross-validation procedure. The results of the cross-validation are shown in Table 1. These results show clearly that (1) the model obtained by the neural networks comes closest to the optimum estimation of the retention time and that (2) the neural network does not need any pre-processing of the data.

TABLE 1 Results of the cross-validation (leave-two-out). r* indicates the goodness of fit, s is the standard deviation of the residuals Model used

r

s

MLR-Lin MLR-Log RBF-Lin RBF-Log

0.5675 0.8243 0.9050 0.9208

4.87 2.97 2.08 1.90

the logarithmic transformation could be omitted, since artificial neural networks provide for the non-linearities in the model automatically. The results of the training of the network (fifteen hidden neurons, S = 0.04, R = 0.0, see ref. [31 for explanation) are shown in Fig. 4 (right charts) and evidently indicate the superior performance of the neural network. Of course, no modelling is of much worth (especially when dealing with neural networks) if the results are not cross-validated. INSPECT

Erct .t

151

Description

IMPLEMENTATION, AND AVAILABILITY

HARDWARE

REQUIREMENTS

INSPECT has been implemented in Pascal (approx. 19000 lines of code), additionally using two PASCAL libraries [4,5] which contributed the graphical basis system for the user interface. A manual (approx. 100 pages) is provided as a PostScript file together with example data and the binaries of INSPECT. Currently, INSPECT is

incl . .

.

. .

n

. . . .

10.0

1

. n

. .

.

.

.. l

o.oj, ,,,,,,,,,,,,,,, 0

JO

100

CZDJnJ I

Fig. 3. Visualization of the data table (survey plot, top window) and the dependence of the retention times on the length of the peptide chain (bottom left) and the sum of Chabanet’s constant (bottom right). The top window shows the graphical representation of the data matrix, the left most column indicating the retention time, the second column the length of the peptide, the third column the sum of Chabanet’s constants and all other columns the count of the nineteen different amino acids.

H. Lohninger / Chemom. Intell. Lab. Syst. 22 (1994) 147-153 /Software Description

152

8

rzz0.77:

rrt.timal

. .

CIlLR-LI

20.0 I

I

rz30.9L:

Cret.tinal

0:o

t

2d.o

IRBF-LINI ra=0.970

Crrt.tincl

. . 30.0. ao.o-

lO.O-

1.4 . ..m . ’ I

. .

+,y:.;:

l

.

20.0.

.sr

I 0.0







I 20.0

’ ’ tRBF-LO03

Fig. 4. Comparison of modelling by multiple linear regression and neural networks. All windows show predicted values versus actual values of retention time. The windows at the left show the results obtained using multiple linear regression (with and without logarithmic pre-transformation); the windows at the right show the results from the neural network. It can be easily seen that the neural network estimates the retention times equally well for both types of pre-processing. The goodness of fit (square of the correlation coefficient) is indicated at the upper right corner of each window.

restricted to 8100 data points, which means that the product of objects and variables must not exceed this number. The following summary gives a list of basic hard- and software, which is necessary or strongly recommended for convenient utilization of INSPECT: IBM-compatible PC (at least 80286, 80386 or higher recommended) 640 kbyte memory (2 Mbyte recommended) XMS manager recommended EGA or VGA graphics card (VGA recommended) hard disk, at least 4 Mbyte free math coprocessor recommended DOS 5.0 or higher University scientists and users of non-profit organizations can get a free copy on request (email to: [email protected]) if they have access to the Internet. Distribution via sur-

face mail is possible but will be charged by US$ 20 in order to compensate for postage and handling. (Please include cash, in order to have fast delivery; no money orders, credit cards or bank cheques except Eurocheques will be accepted, since the fees for clearing foreign cheques are very high in Austria and would eat up most of the US$ 20. Eurocheques should be made out in Austrian Schilling.)

REFERENCES

1 C. Chabanet and M. Yvon, Prediction of peptide retention time in reversed-phase high-performance liquid chromatography, Journal of Chromatography, 599 (1992) 211-255. 2 T. Sasagawa, T. Okuyama and DC. Teller, Prediction of peptide retention times in reversed-phase high-perfor-

W

H. Lohninger /Chemom.

Intell. Lab. Syst. 22 (1994) 147-153/Software

mance liquid chromatography during linear gradient elution, Journal of Chromatography, 240 (1982) 329-340. 3 H. Lohninger, Evaluation of neural networks based on radial basis functions and their application to the prediction of boiling points from structural parameters, Journal of Chemical 736-144.

Information

and Computer

Sciences,

Description

153

4 H. Lohninger, Turbo Pascal 7.0 Toolbox, IWT-Verlag, Miinchen, 1993. 5 H. Lohninger, Borland Pascal Graphik Toolbox, IWTVerlag, Miinchen, 1993.

33 (1993)

“I downloaded the program ‘INSPECT’ of Hans Lohninger via file transfer protocol and had no problems in running this software. Furthermore I think this piece of software is a good contribution to chemometrics and the field of data interpretation. ” Robert Hiillering, Organisch-Chemisches Institut, Technische Universitdt Miinchen, Lichtenbergstrasse 4, D-85747 Garching, Germany.