Computer Programs in Biomedicine 18 (1984) 259-264
259
Elsevier CPB 00659
Multidimensional curve fitting program for biological data A r t h u r T. J o h n s o n Agricultural Engineering Department, University of Maryland, College Park, MD 20742, USA
A program written for use with the IBM-PC can be used to find least squares solutions to linearized multidimensional equations. The program is ' user-friendly' by requiring little from the user except to make decisions; most responses can be entered by a single keystroke. Once data are entered by the user, they can be repeatedly manipulated, graphed, and correlated. Many models relating data variables can be tried relatively easily, and best fit results found. Examples using respiratory mechanical data illustrate the ease of model comparisons. Least squares Regression Graph Respiration
Linearization
1. I N T R O D U C T I O N Biological d a t a is often n o n l i n e a r a n d d e p e n d e n t u p o n m u l t i p l e i n p u t p a r a m e t e r s . This d a t a is usually c o n v e r t e d to e q u a t i o n f o r m to assist in its t r a n s m i s s i o n a n d use, a n d derived e q u a t i o n coefficient values m a y be a c c u m u l a t e d to f o r m p o p u l a tion m e a n s a n d ranges. T h e process of d e t e r m i n i n g e q u a t i o n coefficient values usually involves s o m e f o r m of least squares analysis. T h e m a j o r difficulty of this p r o c e d u r e is the t r a n s f o r m a t i o n o f variables to form a d e p e n d e n t variable that is some linear c o m b i n a t i o n of i n d e p e n d e n t inputs. T h e p r o b l e m m a y be s o m e t i m e s a v o i d e d b y using p o l y n o m i a l s to describe the data, b u t this a p p r o a c h is limited in the range of types of curves which m a y be linearized a n d m a y also b e theoretically unsatisfactory. Because the p r o b l e m with n o n l i n e a r biological d a t a analysis is so pervasive, m u l t i d i m e n s i o n a l curve fitting p r o g r a m s are m a n y , a n d this p r o g r a m is n o t c l a i m e d to be u n i q u e in this respect. T h e Scientific Article Number A3798 Contribution Number 6775 of the Maryland Agricultural Experiment Station (Department of Agricultural Engineering). 0010-468X/84/$03.00 © 1984 Elsevier Science Publishers B.V.
originality of the p r o g r a m to be d e s c r i b e d herein lies in its ability to facilitate linearizing the relationship between i n p u t s a n d outputs, a n d in the c o m p l e t e l y user-friendly interactive n a t u r e of the transfer of i n f o r m a t i o n b e t w e e n ~ h u m a n and c o m puter. M o s t o p e r a t i o n s can be c o m p l e t e d b y m e a n s o f a single keystroke, a n d c o m m o n errors are d e t e c t e d b y the p r o g r a m , which also suggests solutions. F r o m the beginning, where d a t a is entered, to the end, where linearized e q u a t i o n s have h a d coefficient values e s t i m a t e d b y a least squares p r o c e d u r e , little is required of the user except to m a k e choices, a n d this can greatly reduce the t e d i u m of curve-fitting. Extensive use is m a d e of graphics. The entire p r o g r a m is p r e s e n t e d in color, with n o r m a l text c o l o r e d light green on a b l a c k b a c k g r o u n d , some specialized text c o l o r e d blue on black, a n d error s t a t e m e n t s i n d i c a t e d with red on black. A t a l m o s t a n y p o i n t in the p r o g r a m , the d a t a can be p l o t t e d a n d c o m p a r e d , if the user wishes, with an e q u a t i o n d i r e c t o r y c o n t a i n i n g 14 e q u a t i o n forms p l o t t e d sequentially on the screen. Choices of variable t r a n s f o r m a t i o n s can be m a d e according to the m a t c h between the form o f the d a t a plot a n d form of the e q u a t i o n plot. A f t e r the least squares equa-
260 tion has been determined, transformed predicted values are plotted against transformed observed values and compared to the line of identity (where both are equal). Original observed data can also be compared against untransformed predicted data. This program is written in BASIC for an IBMPC with color monitor and graphics printer, and operating under DOS 1.1 or DOS 2.0. The program requires about 37 K diskette space and 25 K internal memory. It can therefore be run on a 64 K machine.
2. P R O G R A M D E S C R I P T I O N A greatly simplified flowchart for the program is shown in Fig. 1. After being introduced to the program, the user must enter his data, either from the keyboard or from a disk file in the format required by the program. If data is entered from
the keyboard, an estimate of the number of data entries to be made will be required, but additional data can be entered if the original estimate was in error. The text on the screen indicates this. If data is entered from a disk file, and the user has forgotten the file name, the program helps the user to identify the correct file. Once data is entered, it may be checked for completeness and accuracy, modified, or removed. Additional data can be entered. The data can be printed or stored in a disk file. All these manipulations are handled by the program at the direction of the user. Keystroke or file-naming errors are detected by the program and corrected by the user under program control. A central focus in the program is the variable manipulation point. Here, the user is allowed choices of: 1. Variable creation 2. Variable removal 3. Variable transformation 4. Equation solution 5. Data reconsideration 6. Return to start 7. End 8. Equation directory Variable creation allows new variables to be made from combinations of original variables. This choice is useful, for instance, in producing a polynomial data set from an original data set containing one independent variable. Up to 9 additional variables may be created, with choices of: 1. X 2. X N, N an integer 3. X A, A a non-integer
4. X1/X2 5. (X1) (X2) 6. X1 - X2 7. X 1 x2
Fig. 1. Program structure.
where X, X1, and X2 can be specified by the user to correspond to any of the program variables. Created variables are considered to be independent variables. Variable removal may be important if variables are no longer to be considered in the equation. F o r instance, in the example shown later in this paper, the dependent variable could be considered to be ' p ' or ' p / f " . In the latter case, 'plY" would be
261
formed from the two variables ' p ' and '12', then ' p ' would be removed as the dependent variable. Removal of the dependent variable causes the program to consider the last independent variable to be the new dependent variable. Variable transformation is often required to linearize the data. The choices are: 1. X 2. A X 3. X - A
4. 5. 6. 7. 8.
X N, N an integer X A, A a non-integer exp(X) e x p ( - X) log(X)
9. A x
Each of these can be chosen with a single keystroke, and repeated transformation is possible. The program checks for unallowable transformations, such as taking the logarithm of a negative number, and will not perform the operation if an error condition prevails. The user may choose to reconsider the data, and remove offending data sets. Repeated creation, removal, and transformation of data would certainly tend to be confusing if a computer scratch-pad were not provided to keep an updated version of variable forms. Each time one of these operations is completed, an updated variable list is presented, and this list contains all operations that have been performed on the data. For example, in the pressure-volume example to be shown later, the dependent variable is first shown as ' Y ' , then ' I / ( Y ) ' , ' ( 1 / ( Y ) - 1 ) ' , and, finally,' L O G ( O / ( Y ) - 1))'. The equation directory provides a good teaching and demonstration tool as well as a guide to linearization of the data. Thirteen linearizable and one nonlinearizable equation forms are presented sequentially, showing the linear equation form, plots of the unlinearized equations, asymptotes, and intersects. This directory can provide amusement as a cartoon for technical personnel. Part of the equation directory is a plot of the data being manipulated. The original data can be plotted, the data transformed and replotted. In this way, the user can see the effect of the transformations on his data, and react accordingly. When
the data plots linearly, he is ready to solve the equations. Solution of the equations uses a G a u s s - J o r d a n matrix technique based upon the one shown in Dodes [1]. Double precision manipulations are performed in the matrix operations only. The solution appears on the screen as coefficients and associated transformed variables. Also appearing is the coefficient of determination, r 2. Least squares solution can be obtained for a zero intercept, if theory indicates this to be more correct than a non-zero intercept. A plot of observed vs predicted transformed dependent variables can be selected. Since a perfect fit would indicate one to be identical in value to the other, the line of identity also appears on the graph. A graph of original (nontransformed) dependent variable cs nontransformed predicted dependent variable can also be obtained. Any of these graphs can be reproduced on paper by a graphics printer. Results can be printed, if desired, to eliminate the need to manually copy data from the screen. Included in printed results are the equation, correlation coefficient, observed results, predicted resuits, and residuals. Finally, by indicating that the data is to be reconsidered, the user can easily return to the variable manipulation point and try other possible equation forms for his data. If, as is sometimes necessary, the curve-fitting process involves trialand-error, the trials are made tolerably easy. 3. EXAMPLES OF P R O G R A M USE Two examples of program use on biological data will be given. Both involve curve-fitting of respiratory system data. The first uses 39 pressure-volume data points from the lungs plus chest wall taken from Fig. 3 - 6 of Jacquez [2]. A printed plot of the data appears in Fig. 2. The curve has a pronounced ' S ' shape which is not easily recognized as hnearizable. Using the equation directory, it can be seen that the shape of the linearizable curve most similar to the plot of the data has the form: 1/y = a + exp(cx)
(1)
262 NORMALIZED DRTR PLOT
Y
Y
min =
.550622
.~
...... ' X
Y
Fig. 2. Printed plot of lung volume (y)-intrapleural pressure (x) data obtained from Jacquez (1979) [2].
obsel~ved
Fig. 4. Transformed observed vs predicted dependent variable data using the form of eq. (5).
elled according to: Subjecting the dependent variable data to repeated transformation and solving equations, yields (with y = lung v o l u m e as a fraction of vital capacity (V/VC), and x = pressure, p):
(2)
log( - 1 + l / y ) = 1.01 - 0.177x with r 2 = 0.995
A plot of transformed observed vs predicted results is shown in Fig. 3. A good fit is obtained for all but the two end values. The second example uses isovolume inhalation pressure-flow data appearing in Bouhuys and J o n s o n [3]. This data describing inhalation pressure across the respiratory airways can be m o d -
p =
+
(3)
Rv))
where pressure is again symbolized by ' p ' , volume b y ' V ' , flow by '17', and residual volume by ' R V'. Two independent variables are assumed: 17 and ( V - RV). kt, k2, and k 3 a r e u n k n o w n coefficients for which the values are to be estimated. The p r o g r a m can be used to c o m p a r e an equation of the form:
p l Y ' = k I + k2V + k 3 / ( V - R V )
(4)
with one of the form: p = kill"--I- k 2 ~ / 2 "1- k 3 V / / ( V -
RV)
(5)
Both would appear to be equivalent,but there is a Y
g
y
J
f
rain
Y
max
=
=
1.28 3.82
/. .... /. j-
Y o6se~ved
Y min V max
Y
= =
•
$
-4.539956 4.539956 ~ obse~vedl
Fig. 3. Printed plot of transformed observed d e p e n d e n t variable data from Fig. 2 and values predicted from eq. (2).
Fig. 5. T r a n s f o r m e d observed vs predicted d e p e n d e n t variable data using the form of eq. (4).
263 great difference between them. Using 44 data points, the first linearized equation becomes:
p/17=
RV)
(6)
17/(V- RV)
(7)
1.01 + 0.339 17+ 0 . 9 7 8 / ( V -
with r 2 = 0.915 The second is found to be: p = 0.744 17+ 0.426 172 + 0.679
gram is interactive and user-friendly, and reduces m u c h of the tedium of linearization and equation solving. A listing of the p r o g r a m can be obtained by writing to the author at the Agricultural Engineering Department, University of Maryland, College Park, M D 20742. By sending a 5.25 in. diskette, a copy of the p r o g r a m can be obtained which can be loaded directly into your IBM-PC.
REFERENCES with r 2 = 0.999 Plots of observed vs predicted values clearly show that eq. (7) is almost a perfect fit (Fig. 4), whereas eq. (6) predicts values too low for small values of p/17 (Fig. 5). With this program, almost no extra effort was required to test both these possibilities.
4. C O N C L U S I O N S A general p r o g r a m has been developed for multidimensional curve fitting purposes. The pro-
[1] I.A. Dodes, in: Numerical Analysis for Computer Science, pp. 93-105, (Elsevier/North-Holland, Amsterdam, New York NY, 1978). [2] J.A. Jacquez, in: Respiratory Physiology, p. 45(McGraw Hill, New York NY, 1979). [3] A. Bouhuys and B. Jonson, Alveolar pressure, airflow rate, and lung inflation in man, J. Appl. Physiol. 22 (1967) 1086-1100.