Computer Programs in Biomedicine 5 (1975) 39-45 O North-Holland Publishing Company
A COMPUTER ANALYSIS OF ELECTROPHORESIS
AND ULTRACENTRIFUGATION
PATTERNS E. TRANKLE Institut /'fir Theoretische Physik, Freie Universita't Berlin, 1 Berlin 33, Arnimallee 3, Germany A program DIANA is designed to fit the scanning curves of electrophoresis and ultracentrifugation patterns with 10-15 overlapping peaks by a sum of Gaussian-like distribitions. The parameters of the distributions are adjusted by minimizing x, using a problem oriented minimizing procedure. The number of molecular components as well as starting values of the positions and the widths of the peaks must be choosen as input for each type of pattern. Programs are written in FORTRAN IV. Electrophoresis
Ultracentrifugation patterns
Least-square fit
1. Introduction Paper electrophoresis and centrifugation techniques have been widely used for estimating the molecular components of body fluids. Using less adsorptive media, such as agar gel or cellulose acetate, in electrophoresis, and media with a suitable concentration gradient in ultracentrifugation better resolved molecular fractions are obtained. Most o f the peaks o f the scanning curve have a Gaussian-like shape. The remaining ones usually have a very large width and a non Gaussian-like shape which suggests that they may in fact be a superposition of two or three strongly overlapping molecular components. In the evaluation of a pattern Gaussians are usually hand-drawn to the pronounced maxima of the scanning curve, or a superposition of Gaussians is fitted to the curve by means of an analogue computer. Using a digital computer, however, the evaluation of the pattern is more precise, and the parameter adjustment is reproducible. Moreover, one obtains printed informmation about the goodness of the fit: X (the sum of squares of differences) is a measure of the overall accuracy, whereas the local accuracy can be judged from the remaining difference between the observed scanning curve and the adjusted distribution. In fact, a deviation from the Gaussian has been observed which is not small enough to be neglected in a digital computer analysis. We therefore have modified the Gaussian in the following way [ 1 ] :
fk(x) : ' / F k , exp {--~2(1 +C1 tanh ~ + C 2 tanh 2 xk \ ~3 C3] '
(1) where Xk stands for (x - Sk)/B k, F k is the area parameter, S k the position and B k the width of the distribution for the kind k of molecules. The first correction term in eq. (1) described the asymmetry o f the distribution, whereas the second term represents the symmetrical part of the deviation. This function is Gaussian-like if the exponent is negative, which is the case f o r - I C l l + C 2 > - 1. The deviation parameters C1, C 2, C 3 are assumed to be the same for all peaks. Certainly this assumption can only be correct, if a medium specific effect, such as the molecule medium interaction, causes the main part of the deviation and if the deviation due to the heterogeneity of the molecular fractions is small. As usual the overall distribution function, viz. the pattern itself, is taken to be sum of the single Gaussianlike distribution functions (1). The parameters Fk, S k, Bk, Cj are adjusted by minimizing NDAT, KZ
where Xi, Yi are the (NDAT) digitized coordinates of the scanning curve and K Z is the number o f molecular components. At present some agar gel electrophoresis patterns
40
E. Trankle, A computer analysis of electrophoresis
of the cerebrospinal fluid [2] and 60 ultracentrift,gation patterns of polysomes [3] have been analyzed successfully by means of DIANA (digital analysis).
2. Program description For patterns of some 10-15 overlapping peaks the number of parameters (33-48) is large and some of them are strongly correlated. This gives rise to three problems in the x-minimizing procedure: First, the global minimum of × is often only reached if the starting values of the parameters are not too far away from its values for the global minimum. Secondly, if the recursive minimizing procedure does not take care of the parameter correlations, the speed of approaching the minimum is slow. Thirdly, the convergence of the minimization at the global minimum is sufficiently fast only if the parameters are adjusted simultaneously. These problems may be tackled by using a sophisticated general-purpose program, such a MINUIT [4]. In fact, this routine is an integrated system of three FORTRAN-programs. By a Monte Carlo searching routine a point in the parameter space near the global minimum of × is determined. Starting from this point the minimum is approached by a simplex method. To speed up the convergence of the parameter adjustment near the minimum one changes to a routine based on a variable metric method. Such a general-purpose minimizing program certainly is very useful for the analysis of a few patterns in research studies. For the evaluation of large series of patterns for clinical purpose, however, it is not appropriate since it tends to waste computer execution time as well as core memory. In this case all patterns of a certain series are usually obtained by using the same experimental separation techniques so that the same set of starting values can be used. Very likely the change to a rapidly converging procedure at the minimum is not necessary since in these cases we are not interested in an extremely accurate parameter estimation. Moreover, the velocity of approaching the minimum can be speeded up by making use of special properties of the sum of Gaussian-like distributions. This is very well accounted for by DIANA. In DIANA we use a problem oriented minimizing strategy. The parameter space is divided into sub-
spaces in which × is minimized successively. Because of the correlation between parameters which belong to different subspaces the whole cycle is repeated several times. In the first subspace, × is minimized with respect to all F k. Since the sum of distributions depends linearly of the Fk, this can be done by solving a system of algebraic equations. A subroutine LINEQ 1 [4] is applied which is based on the Gaussian elimination method with partial pivoting. By this "exact" method strong correlations between some of the F k are treated in an optimal way. The area parameters Fk are prevented from becoming negative by a "penalty" function. If some F k has become negative in the last cycle a term F 2 is added to × which causes the parameter F k to become positive again in this cycle. Obviously, starting values for the parameters F k are not needed by this nonrecursive method. For each peak k we choose a subspace to minimize × with respect to S k and B k. By definition the position and width of a Gaussian-hke distribution are only slightly correlated. Therefore we can use a subroutine CHIFIT which is based on a recursive stepping procedure along fixed directions in the parameter space. Starting values, step sizes, upper and lower bounds as well as a tolerance parameter for terminating t.he iteration must be specified. The Gaussian-like function (1) decreases rapidly with I~kl and is usually much smaller than the contribution of the next peak for Ixkl > 7. Therefore the adjustment o f S k and B k cannot be improved by fitting the distribution k to digitized points outside of this region. Cutting out these points in the subspace k reduces the job time considerably. In the last subspace, the deviation parameters C1, C 2 and C 3 are adjusted by means of CHIFIT again. Since C1 and C 2 are the strengths of the asymmetrical and symmetrical deviations, respectively, they are only slightly correlated. Again a "penalty" function is used to guarantee the subsidary condition - ] C l l + C 2 > - 1 . Of course, starting values, etc. must be given. Certainly the Gaussian-like function (1) is a special choice for modifying the Gaussian. In DIANA the modification can be changed by means of the switches NPOW and NPRO: The power of the asymmetrical or the symmetrical deviation term can be increased by two (tanh (x/C3)~ tanh 3 (.~/C3)for NPOW = 3 and
E. Tri:hkle, A computer analysis of electrophoresis
READ : SWITCHES AND
41
INITIAL VALUES
B,S,C
< READ AND NORMALIZE
: IDE, SO, SC ,Y0,X,Y
I WRITE : SWITCHES
AND
J SET :INITIAL
VALUES
INITIAL VALUES
i B,S,
< KZ= 1 COMPUTE
B, S
C, KZ = KA
M I C _< MAC < : MATRIX
H
MIC=
> ~ LINEO, 1
MINIMIZE
I I
FOR ALL F
COMPUTE
MINIMIZE FOR B,S,C
<
CHI
WRITE : MIC
C , CHI
MINIMIZE
<
I CH I Ir -FIT --" ) i l
, J
M~,C
FOR B,S
MINIMIZE FOR
K~IKZ
MIC<_~NSCH
ALL C
I
><
Is~:~Z:~E I M,C
*
KSCH
< )<
1PR°'Lii Ii
MINIMIZE FOR ALL F
"
I
IIMAT,N
I
COMPUTE : CHI I
WRITE: MAC,C , CHI
H ~,'LAR ! S,MPS ~ I
WRITE : B,S,C,CH
,
I
I
NORMALIZE :F, B, S I
COMPUTE : FL I
I
WRITE : FL,DF, B , S
I
WRITE : CORREL. MATRIX
t
CURVE b + - 1
I
B,S,C,CHI
] 2-
IBLOC ~ NBLOC )
) <
I COMPUTE :MEAN
<
PLOT : Y~ DY )
I PR~ = 0
)
STATI Jl--<---ICOMPUTE
MEAN FL,B, S
I
i
J
WRITE : MEAN B,S,C,CHI I
] >
KZ=I
I
~
WRITE : M E A N
FL,B,S
I
I
< KZ:~ 1
Fig. 1. F l o w c h a r t o f the p r o g r a m D I A N A .
':
ISTAT =
0
!
~--,
%
o
•
o
.
i
.
,
o
.
o
o
o
.
o...:
~")
.
•
.
.
.
•
•
i
.
o
•
.
o
...~
i
0
i
LLLLZLLLZLL%'~
i
~
,v
.
o
o
.
.
.
.
.
.
,
.
,
.
~,~,
.
-
.
.
.
.
.
0
.
.
.
.
.
.
.
.
,-,
.
.
.
.
.
.
.
,.,,
o
o
.
.
.
o
o
o
, .
o
.
.
•
0
.
.
i
o
.
.
.
.
o
,
.
,
o
.
.
.
.
.
.
i
.
.
.
,
o
.
.
o
,
.
o
.
i
,
.
,
°
o
o
i
°
°
,
*
o
i
o
o
.
,
°
.
°
o
.
I
°
o
.
, .
°
°
i
.
t
o
°
,
°
°
i
°
°
i
o
-¢
o
•
°
•
o
-
•
°
.
•
•
•
Q
~
,
°
°
•
°
.
o
o
Q
°
0
-
°
.
.
-
~
o
•
o
-
.
°
I
S
,
o
o
t
•
o
o
.
~
o
°
1
•
•
°
o
o
Q
°
1
o
-
°
.
1
.
o
Q
, . . ,
,
•
•
o
1
.
.
Q
°
1
o
.
•
-
1
o
o
.
°
Q
.
,,
o
•
o
°
1
o
~
.
.
°
.
o
1
.
.
Q
,
-
.
~
o
,,
-
.
°
•
o
o
°
o
z
.E
z
c
-~
u~
2
3
Z
~
<
u
u
°
c:
~
o
~ ' ~ f un
~ m II
I1
ru
~
o z z ~ u ,u
rn~
u
co :J m
~J* z
m
~ z
u
8
bJ
E. Trankle, A computer analysis of electrophoresis ~
"
S C A N N I ~ S CURVE 3,~3 I .........
I ......... I
I~o~9 I .........
3~og8 I .........
50,8~ I .........
~ 6 , E9 I .........
÷
+ + +
I
+
-Tg,II
+ I I !
+ + +
I
* + +
-7~,$I I I I I 'I 1 I I I
+ + + + * + +
I Z
+ +
I I
+ +
I I I I
+ * + +
I
-gqo-~
+ I I
* +
I r
+
-~,~I
+ T I I I
+ + ÷ +
-3Jo~T
+ I
+
I I I I I T I I
-3~
+ + + + + + +
I I I I
+ + + +
I I I I
+ + +
I I I I
+ + +
I
+
I
+
I I
+ +
I
"
÷
I
* +
~o~I I I I ,81[ I÷ I
43
R E 4 A I t & I N G D I F F E R E N C E BETWEEN F I T
+ +
1 I I I
~3
++++++
*
I ~l
-BJ.
AND
+ * ÷ + +
Fig. 2b. Ouput of a sample run (KZ ~ 1).
ANO ] U P V E =
82,5~ I .........
98,41 I--
44
E. Trhhkle,A computer analysis electrophoresis
tanh 2 (~/C 3) -~ tanh 4 (2/C3) for NPOW = 4). Moreover tanh (~/C3) (NPRO = 3) can be replaced by another function of the same shape, such as 2/lr tan -1 (n/2 x/C3) for NPRO = 1. Let us now have a look at the structure of DIANA as it is shown by the flowchart (fig. 1): In the case of a pattern with a single molecular fraction the analysis proceeds along the left branch (KZ = 1), whereas for more than one molecular components the main branch (KZ > 1) is taken. Several patterns (NBLOC) are analyzed one after another. Then the mean values and the standard deviations of the adjusted parameters can be calculated (subroutine STATI). The routine LINEQ1 and the matrix inversion routine MATIN1 are taken from the CERN library [4], CHIFIT, the integration routine SIMPS and the plotting routine CURVE2 are available at ZEDAT [5 ]. Furthermore CHU, CHE, CHO (calculation of ×), PILLAR (integrand), PROFIL (Gaussian-like function (1)) and STATI (mean and standard deviation) are special subroutines of DIANA.
3. Sample run We shall now describe a sample run of an ultracentrifugation pattern of polysomes (cf. fig. 2). The data of a pattern comprise: a pattern number (IDE) for identification, three position values (SO, SC, YO) for normalizing the pattern, a sequence of digitized coordinates of the scanning curve (X1, Y1, X2, Y2 "-) and a data bloc end label. After reading such a data bloc the pattern is normalized by a linear transformation in x- and y-direction. In the case of the ultracentrifugation pattern SO is the x-position of the maximum of the monosome peak and SC is that of the 5-aggregate polysome peak. The above transformation maps SO onto 0 and SC onto 1-50. These values are chosen as starting values for S 1 and S 5 . Y0 is the zero of the scanning curve in y-direction. After printing the switches and starting values for documentation the first minimizing cycle starts. At the beginning the area parameters F k are determined by solving a system of KZ linear algebraic equations (H being the corresponding coefficient matrix). The cycle number, the actual values of Cj and x(CHI) are printed for documentation. Then X is minimized with respect to the position S k and width Bk for each component k. Finally the
minimization with respect to the deviation parameters C/is performed which, in fact, takes place only for higher cycles (MIC/> NSCH). This whole minimizing cycle is processed several (MAC) times. At last the Fk are again determined. This time we do not use LINEQ 1 but MATIN 1 which, in addition to F k, computes the correlation coefficient matrix. The positions Sk are normalized such that S 1 = 0 and S 5 = 1-50. Then the areas of the distributions k are evaluated by numerical integration which is necessary, since in the case of the Gaussian-like function (1) the parameters Fk are not exactly the areas. The relative areas (in %), their errors, the widths and the normalized positions are printed. The correlation coefficient matrix of the Fk can be printed, too. Finally the scanning curve and the remaining difference (enhanced by a factor 2) between the scanning curve and the adjusted distribution is plotted. If necessary, the number of molecular components can be changed (KA ~ KE) after a few (KSCH) cycles. Let us choose as another sample run the evaluation of three electrophoresis patterns of albumin (cf. fig. 3). In this case (KZ = 1) the analysis proceeds along the left branch of the flowchart. The position, width and the deviation parameters C/are adjusted simultaneously by means of CHIFIT, the parameter F being calculated in CHU at each call of CHU. The plotting of the curves is skipped (IPR = 0). After the evaluation of the three patterns the means and the standard deviations of the adjusted parameters and of X are calculated and printed (ISTAT :/: 0). For the users convenience and to facilitate modification of DIANA we have used the same symbols for the variables in this description and in the FORTRAN code itself. Moreover, the headlines in the flowchart boxes are repeated as COMMENT in the FORTRANprogram.
4. Hardware and software specifications The program is coded in FRANTRAN IV and is run on a CDC CYBER 7200 computer at the Computer Center (ZEDAT) of the Free University of Berlin, The CYBER is a 60-bit per word computer with 64 k words of core memory locations. The maximum size required for the program is 14 k words. The computer job time required for the first sample run (KZ =/:1)is 190 sec, and 35 sec for the second sample run (KZ = 1).
E. Trahkle, A computer analysis of electrophoresis SWITCHES|
NPRO=3
45
NCOEF=3 NPOW=4 EDSI= ,1~
STARTING VALUES: PARAMETER
VALUE
STEP
30UNr)
~OUNO
POSITIO'I WIDTH Ct C? C~
0,000 3,000 0,000 0,000 1.000
,130 ,IDO .05n ,050 .050
- 5 . OOq 1,~00 - t . OOO -I,000 .ZOO
5.000 10.0q0 1.000 l,O00 5,000
ADJUSTED VALUESI aATTE~;
NO 17 18 t9
POSITION - , 136 .t33 ,018
WIDTH
C1
C?
C3
cHI
~.9 8~ 2,939 x,359
.~39 .275 ,147
- , ?70
1.778 1,493 1,716
.Ia57 ,IXOI ,0701
C3
CH[
MEAN AND STANDARD DEVIATION FOR
MEAN DEVIATION
-.405 - , 520
3 PATTEP~S I
POSITION
WIDTH
C1
C2
,005 ,1~5
3,1 94 .224
,787
-, 3q9 .125
.146
1,663 .150
,I~20 ,0628
Fig. 3. Output of a sample run (KZ = 1).
5. Mode of program availibility The F O R T R A N - c o d e o f D I A N A , including all subroutines, is available o n tape (7 track, 556 or 800 bpi) or as card deck f r o m the author at the Institut ffir Theoretische Physik, Freie Universit~t Berlin, 1 Berlin 33, Arnimallee 3.
References [1] E. Tr~'nkle, Digital computer analysis of electrophoresis and ultracentrifugation patterns by a sum of Gaussianlike distributions, to be published in Z. Naturforsch. 30c (1975).
[2] M. Siegert and H. Siemes, Methodische Untersuchungen zur Agarosegel ~ Mikroelektrophorese des Liquor Cerebrospinalis yon Kindern unter Anwendung eines Analogtechners zur Auswertung der Ergebnisse, F.U. Berlin preprint (1974). [3] I. Dornacher, Untersuchungen zur Auftrennung und Charakterisierung yon Polysomen aus Warmblfiter-Embryonen under fiber den Einfluss einiger embryotoxischer Pharmaka auf due Polysomenprofile, Inaugural-Dissertation des Fachbbreichs Chemie der Freien Universit~t Berlin (1974). [4] CERN, computer program library; [5] ZEDAT, Frei Universit//t Berlin, computer program library.