0031-3203/82/030253-09 $03.00/0 Pergamon Press Ltd. © 1982 Pattern Recognition Society
Pattern Recognition Voi. 15, No. 3. pp. 253 261. 1982. Printed in Great Britain.
CONTRIVEDNESS: THE BOUNDARY BETWEEN PATTERN R E C O G N I T I O N AND N U M E R O L O G Y R. F. EILBERT and R. A. CHRISTENSEN Entropy Limited, South Great Road, Lincoln, MA 01773, U.S.A.
(Received 10 April 1980; receivedfor publication 26 February 1981) Abstract--The irreducible informational loss expended in a pattern search procedure is quantified using the concept of contrived entropy. In multivariate analysis this quantity is of value in distinguishing true patterns from statistical noise and in deciding to what depth a search procedure should be conducted. For a specific partition, the contrived entropy is defined as the partition entropy averaged over all possible permutations of event outcomes. The contrived entropy associated with a search procedure or set of attempted partitions is taken to be the expectation value of the minimized partition entropy for each permutation. The behavior of the contrived entropy is illustrated for a simple univariate case. Contrivedness Contrived entropy Entropy Pattern recognition Multivariate statistics
INTRODUCTION Contrivedness is defined as the tendency of a search procedure to uncover apparent patterns where none exist. It can be quantified using concepts from information theory. A search procedure is regarded as a set of attempted partitions of events in feature space. The entropy content of a partition is computable from the conditional probabilities of the cell occupancy numbers. For a specific partition the contrived entropy is defined as the partition entropy averaged over the ensemble of all outcome scramblings, that is, all permutations of event outcomes. Such scramblings retain all correlations between features, while the correlational structure between features and outcomes is lost. A more relevant quantity is the contrived entropy associated with a search procedure, which is taken as the expectation value of the minimized partition entropy for each permutation. The behavior of the contrived entropy is illustrated for a simple univariate case with 40 binary events. S W A P D P search procedures, which employ bit extraction (successive bisectioning of intervals), and SWAPDP probability assignments are utilized. Monte Carlo ~:omputation of the contrived entropy is performed under various conditions of a priori weighting. Values of contrived entropy are explicitly computed for the cases of the null partition and the complete partition. Contrived entropy results are compared with pattern entropy values from a relatively organized data set. The renormalized entropy or pattern entropy with contrivedness removed is examined. The implications for the exemplary case is that one to two fewer bits should be extracted than originally supposed. Future prospects are considered. A first step toward
Information theory
Chance correlations
reducing the complexity of the N! term sum is made through the use of hypergeometric weightings. The importance of the distribution law for the contrived entropy is stressed since its knowledge would establish confidence limits against the null hypotheses that a particular pattern is merely statistical noise. Under a general class of probability estimators the distinction between multivariate and univariate cases is bridged so the solution of the contrivedness problem under fairly general circumstances is possible. MOTIVATIONALBACKGROUNDFOR CONTRIVEDNESS ANALYSIS Researchers in the fields of pattern recognition, taxonomy, prediction/decision theory and statistical inference are generally aware that particular findings may in reality be atrributable to chance distributions. The problem is especially acute for those working with small sample populations. Univariate statistics allows a number of powerful tests to discriminate a real effect from a chance effect. Researchers using multivariate data may be confronted with many thousands of features and often have great difficulty in verifying their findings. In complex systems, such as meteorological or economic models, the number of variates is extremely large. Predictors that perform adequately on one data base often fail when applied to new data. The tendency of a search procedure to uncover apparent patterns where none exist is called "contrivedness". Contrivedness may be quantified using concepts from information theory. An entropy of contrivedness can be specifically associated with a search procedure. Authors in various fields have recognized contrivedness in different guises. As early as 1926 Yule realized
253
254
R.F. [~ILBERTand R. A. CHRISTENSEN
the undermining role played by chance correlations3 t~ Classifiers designed by maximimizing performance on the complete set of samples are known to be overly optimistic32.3~ The incidence of spurious patterns in multivariate analysis with large numbers of features is sometimes referred to as the dimensionality problem. Dimensionality considerations have given rise to a paradox that inclusion of additional features beyond a certain number can actually reduce ones ability to make correct classifications on independent test sampies. ~4,s'6~ The capacity of hyperplanes to classify outcomes has been analyzed by Cover37~ Complete separability between two outcome classes is ensured when the dimensionality exceeds the sample size. For high dimensional spaces, nearly complete separability occurs at just under half the sample size. Emphasis has been placed on the need for "parsimony" in feature selection? 81Inclusion of"nuisance" parameters, that is, features of low informational relevance, has been shown to degrade a pattern's significance level under statistical tests. ~91 Lack of reproducibility in multivariate analysis has been a persistent difficulty.
* StepWise Approximation Pattern Discovery Program, see references 19, 20, and 23.
O
O ~', X ~x o
0
j \ 0
O'tl
X
v~
P
Fig. 1. Hypothetical bivariate data set showing complete separability using a quadratic discriminant (dashed line). An optimal linear discriminant (solid line) misclassifies two events. Because of the small sample size, it is not clear that the quadratic discriminant is preferable.
among outcomes that are random. This ability to organize random data is an expression of contrivedness. The contrivedness defines the base level of organization for a pattern recognition system. In general, the degree ofcontrivedness will depend on the system, the pattern search scope, and the data. QUANTIFYING CONTRIVEDNESS
Information theoretic concepts provide a natural way for quantifying contrivedness. Suppose there are N data events or observations consisting of feature vectors (xj,j = 1, N) and outcomes (dj, j = 1,/~). Here dj is assumed quantized into one of K possible outcome classes, (k = 1, K). The components of xj may be qualitative or quantitative in nature as may be the outcome variable d I. A pattern is a partition n of feature space dividing it into a number v~ of cells. The cells are indexed by i, (i = 1, v~) and may represent either continuous or disjoint regions in feature space. Let the probability for a future event having outcome class k given that it falls in the ith cell of partition n be p(kli, n, Io). The estimation of this probability may be quite complex. In nonparametric statistics it is often expressed in terms of the occupation numbers nik(n) for all the cells plus any a priori knowledge available, which we denote I0. As a working hypothesis, the probability may be equated to its estimator (though technically it should be equated with the estimator's expectation value)
p(kli, n, Io) = p(kli, In], Io)
(1)
where [hi indicates the ordered set or matrix of occupation numbers, nik. The SWAPDP code, for example, uses
p(k [i, In], Io) = n'k + w-------A Ni+ W where
(2)
Contrivedness: the boundary between pattern recognition and numerology K
N, = ~. n,k = the population of the ith cell k=l
% = an a priori weighting factor which is proportional to Mk, where
255
mutation P~ be indexed by ~ Ca = 1, N!), the cell occupation numbers after applying P~ are denoted nTk(n). The contrived entropy Sco, can then be defined by the expectation value 1
Sco.OZ) = ~ .
v~
~'r ~ S(P~,rt)
(8)
~t=l
M k = )" nik i=1
where
= the total number of outcomes in class k. The proportionality constant is determined a priori by cross-validation of the data.
S(P,,~) = - Z P¢(ilrn~],lo) E i=1
K
W = ~ % = the total a priori weight. k=l
The partition entropy is defined (in units of In2) by v~
S(n) = - ~ p~(i]n, Io) i=1 K
x ~. p(kli, n, I0)ln p(kli, n, Io). (3) k=l
Here p¢(iln, Io) is the probability for an event to inhabit the ith cell. In general, its dependence may be complex. We shall assume it depends on the matrix of occupation numbers. Equating the probability with its estimator, we set
p¢(il~, to) = p¢(il[n], Io).
(4)
The SWAPDP code in this instance simply uses the cell occupation frequency
p¢(il[n], Io) = Ni/N.
(5)
The estimate for the partition entropy thus takes the form v~
g
s(~) = - E P,(il[n],lo) E i=1
k=l
×p(kli,[n],lo) lnp(kli,[n],Io).
(6)
The contrived entropy of a partition will take a form similar to equation (6) except that the occupation numbers are based on random data. A very simple way of randomizing data without imposing any new structure in feature space is achieved by "scrambling" the outcomes. This requires randomly permuting the event index j for the outcomes dj while leaving the features xj unaltered. In this way the correlations between features are retained while all correlational structure between features and outcomes is lost (assuming that observations are independent). Use of event permutation in this respect is original although statisticians have commonly used permutation tests to assess significance. 125-28~ The permutation may be defined as a linear transformation on outcomes N
dj=
Z Plmd,,
(7)
m=l
where P is a permutation matrix. Letting the per-
k=l
× p(k]i,[nq, lo)lnp(kli,[nq, Io). (9) For many estimators, as in equation (5), the cell occupation probability Pc is independent of outcome class labels and hence independent of P~. In practice S¢o,(n) can be evaluated by Monte Carlo techniques requiring considerably less than the full number N! of permutations. One apparent objection to equation (8) is that it includes permutations that leave some event outcomes unaltered (in particular it includes the identity permutation) and that only permutations that alter all events should be employed. However, use of this restricted set of permutations would create a bias, namely that each data point would have a slightly more than chance probability for its class identity to be altered. Thus it is best to average over the full set of permutations. CONTRIVEDNESS
AND
SEARCH
PROCEDURES
The contrived entropy of a specific partition is not of real importance except as a building block. Contrivedness is more relevantly associated with a search procedure I-/which may be placed in correspondence with some set of attempted partitions. Thus if the search procedure examines L distinct partitions we defne r I = {rq, l = 1, L]. The contrived entropy for the search procedure is then 1
N~
Soon(I-I) = ~.w ~ Smi,(P~, I-I)
(10)
at=l
where Smi.(P., l-I) = rain s(P~, ~zl) Smi.(P~, l-I) corresponds to the entropy value that a search procedure on scrambled data would find for its "best" sorting partition. "Best" here is in the sense of maximal extraction of information toward classification. The entropy minimizing partition will in general depend on the particular scrambling. S~o,,(I-]) is the expected value of Sr~i.(P~, I-I) averaged over all possible permutations. Letting S(I-I) be the minimum entropy found on the original data set, a renormalized entropy S*(I-[) may be defined by subtracting off the contrived component S*(I~) = s(l-I) - Sco.(l~).
(11)
This provides a means of solving such questions as whether the linear or quadratic discriminant partition
256
R.F. EILBERTand R. A. CHRISTENSEN
of Fig. 1 is preferable. The one with smaller S* value has more predictive validity.
s ' ° " ( I - I N ) = - h=l ~ --M~II+M~W/Nlnl+MkW/NI+w l+W
+ (K-l).
Mk W/N In Mk W / N ] 1+--~ - ~ J '
SOME P R E L I M I N A R Y RESULTS
The behavior of the contrived entropy was investigated for a simple univariate case. SWAPDP equations (2) and (5) were used for probability estimates. The allowable space of partitions was that formed by successive stages of minimum entropy "bisection" of existing intervals. Bisection here is used in the sense of dividing the interval into two parts not necessarily of equal length or population. One program option gives the freedom not to bisect an interval if that whole interval has lower entropy than any of its bisected pairs. Otherwise a cut is forced and partitioning into singly occupied cells is ensured after the order of log 2 N stages. One series of runs was performed with N = 40 events. Two outcome classes of equal population were assumed. A priori weigh ts W were chosen at 0.0, 0.1, 1.0 and 10.0. Statistical uncertainties for the evaluation of Sc,, were reduced by employing 300 Monte Carlo outcome scramblings. Entropy values were averaged over partitions with equal numbers of cells. In this univariate case, the cells correspond to continuous regions in feature space. A search procedure may be regarded as those partitions generating some fixed number of regions. Figure 2a shows the entropy of contrivedness for the case where bisections are forced. The jaggedness in the curves is not caused by statistical uncertainty (which is typically 1~ and rarely exceed 3~) but rather reflects some unusual partitioning sequences. For instance, three regions can only be achieved if the first bisection divides the events into groups of 1 and 39 members. The general trend of the curves is toward increased entropy with increased a priori weight. At low a priori weight the entropy decreases almost monotonically. At an a priori weight of 1.0 or 10.0 a point of diminishing returns is reached at approximately 18 and 7 regions respectively. Such behavior is expected since a priori weighting drives lowly populated regions toward the expected outcome probabilities of (1/2,1/2) and therefore maximal entropy. In two cases the contrived entropy is easily calculated explicitly. Since % = M, W/N, for the null partition J~ we have
Sco.(J~) = -
~ Mk"l-Wk, Mk-~Wk k=l
~--~
m N+ W
Mk In Mk h=t
N
N
With two equally populated outcome classes Sco,(~) = - I n 2 = 0.6931472. Likewise, the complete partition I-IN into N singly occupied cells permits direct computation.
Note that this result depends only on the number of outcome classes, their relative frequencies (MdN), and the total a priori weight. For the four cases shown in Fig. 2a the contrived entropy is given below S~o.(l-I,,o; W = O) = - [1 in l+OlnO] = 0.0 S¢o.(I-I,,o ; W = 0.1) = - [2~ In ~21 + ~-~ l n l ~2] = 0.18491 Sco.(1-l,,o;W= 1)
= -
In
+~ln = 0.56233
Sc,,(I-I4o;W=I0, =-[61n6+51n5
1 = 0.68901.
These are in exact agreement with the Monte Carlo results since I-I4o is uniquely determined. Figure 2b illustrates the entropy of contrivedness in the case where bisections are not forced. In general, the entropies are less than those in Figure 2a due to the additional freedom to leave intervals unbisected. When non-vanishing a priori weighting is used, more than 31 regions never resulted. There is no fundamental reason for this limit, but clearly event orderings that lead to higher numbers of regions are rare. For a priori weights of 0.1, 1.0 and 10.0 a minimum entropy of contrivedness occurs when the number of regions is about 24, 19 and 6 respectively. It is interesting to compare the contrived entropy with the entropy that occurs with a reasonably strong pattern in the data. Synthetic data is generated by imposing gaussian noise on a linear ramp with 40 points and selecting outcome classes based on whether the resultant signal exceeds the median. At one choice of noise level the data in Table 1 was produced. Figure 3(a and b) plot the entropy as a function of the number of partition regions where bisections are forced and not forced respectively. With bisections forced, the pattern entropy at first decreases but eventually increases as the number of regions increases. This rule is excepted with zero a priori weight for which zero entropy is maintained at or beyond 33 region. In fact, ten properly chosen regions suffice to reduce the entropy to zero, but the rule of forced bisectioning is not economical in this regard. With bisections not forced, the entropy must decrease monotonically with the number of regions. Maximal information here is extracted with 33, I0, 10 and 4 regions for the a priori weights 0.0, 0.1, 1.0 and 10.0 respectively. In the absence of a correction for
Contrivedness: the boundary between pattern recognition and numerology
257
.8
tel
.7
10.0
.6
1.0
.5 tj
>o_ 0
(a)
.4 .3 0.1
.2 w
.1
0.0
0 5
i0
15
20
25
30
35
40
.8
~
.7 ILl Z r'~
.6
~
.0
I0.0
.5 Z
(b)
8 .4 .3 o z uJ
.2 .i 0
0
5
I0
15
20
25
30
35
40
NUMBER OF REGIONS
Fig. 2. The contrived entropy for a search procedure classifying 40 binary events vs the number of partition regions using a priori weights of 0.0, 0.1,1.0 and 10.0: (a) with forced bisections (b) with bisections not forced.
contrivedness, these numbers of regions would be considered optimal for classifying the synthetic data outcomes. The situation is altered after the subtraction of the contrived entropy. Figures 4(a and b) show the renormalized entropy plotted against the number of partition regions with bisections forced and not forced respectively. Note in Fig. 4b that the renormalized entropy is minimized with fewer regions than the original entropy. Specifically, 7, 6, 4 and 2 regions are optimal with the a priori weights 0.0, 0.1, 1.0 and 10.0. Predictive ability is maximized by extracting one to two bits less than the level which minimizes the original entropy. Figure 4a points up a similar result when bisectioning is forced. For both the null partition and the complete partition, S* = 0 is constrained. Orderly data will usually exhibit S* < 0 for all levels of bit
extraction. However, S* may be positive, and for random data S* has expectation value 0 with nearly equal probability for attaining positive or negative values. COMPUTATIONAL
CONSIDERATIONS
The amount of computation implied in averaging over N ! permutations in equation (8) is excessive. Even the Monte Carlo approach as taken in the illustrative example, requires many times the computing time of the original search procedure. Closed form solutions or approximations for the contrived entropy would be helpful in speeding its computation under reasonably general conditions. As a first step, the sum with N! terms in equation (8) can be reduced by summing over permutation subsets with fixed occupation numbers,
258
R.F. Eu m~wr and R. A. CUatSTI~NsrN .8
i0.0
.7 .6 .5
(a)
.4 .3 .2 .i 0 0
5
i0
15
20
25
30
35
1 /
.7 i0.0
.
40
o.
g
,[.
O. 0
0
0
5
i0
15
20
25
30
35
40
NUMBER OF R E G I O N S
Fig. 3. The partition entropy for a synthesized pattern in 40 binary events vs the number ofpartition regions using weights of 0.0, 0.1, 1.0 and 10.0: (a) with forced bisections (b) with bisections not forced.
a priori
Table 1. Synthesized data for 40 binary events possessing a strong trend.An outeome, dj, is by convention assigned a value of either 1 or 2 j dj j dj j dj
1 2 3 4 5 6 7 8 9 10
1 1 1 1 1 l 1 1 1 1
14 15 16 17 18 19 20 21 22 23
2 1 1 1 1 2 2 1 1 2
27 28 29 30 31 32 33 34 35 36
1 2 2 2 2 2 2 2 2
11
l
24
1
37
2 2
12 13
2 2
25 26
2 1
38 39 40
2 2 2
[n=]. The weighting is then dictated by the generalized hypergeometric distribution since the occupation number, treated successively, can be regarded as a sampling from the original data without replacement. Equation (8) takes the form (fi
Ni,!)(k6 Nw!)
1 ~, r N! [-I I-I nih! t=1 k=l v, t~ x E E [p.(i" I[n ],I0) r=l k"=l x i C j ,, ,~. x-
i'=l
S~o~(tr) = - 2., lal
ptV"li",rnl, t o X qlo~'.np'Vl "li",~rnl, i
L j
.
ij
(12)
The summation is performed over distinct matrices of occupation numbers n~ subject to the following K + v~ + 1 constraints
Contrivedness: the boundary between pattern recognition and numerology
259
NUMBER OF REGIONS
0
>-
0
5
I0
15
20
25
30
35
40
-.1
I-
1.0
(a)
-.2
N --I <
0.1 0 Z W
-.3
-.4
0
0
5
i0
15
I
t
I
25
20
30
35
40
I
' ' o.o
~_~ i0.0 -.1 o
/
,-m
/ (b)
-.2 .J < >-.o z ILl
-.3
-.4
Fig. 4. The renormalized entropy for a synthesized pattern in 40 binary events vs the number of partition regions using a priori weights of 0.0, 0.1, 1.0 and 10.0: (a) with forced bisections (b) with bisections not forced.
v~
n~ = Mk,
(k = I , K )
(13)
(i = 1, v.)
(14)
i=1
Jr,
K
~. njk = Ni,
E
N 1 -n
2
E [p,(i'l[.],10)
k=l
v~
K
~. ~ n,k = N.
(15)
x p(k'ti', [n], lo)in p(k'lr, [n], lo)]
(16)
i=1 k=l
Equation (12) is itself computationally complex and further reduction is non-trivial. One special case deserving mention is that of binary classes and bisectional partitioning. In this instance, only one occupation number needs specification because constraints (13)-(15) determine the other three. Equation (12) reduces to a form with the customary hypergeometric distribution
where the occupation matrix is [n]=
M2 - n
N_NI_M2+
N .
(17)
The relatively simple form of equation (16) gives some confidence that Soon(n) may be efficiently computed, at least for partitions obtained by sequential bit extraction.
260
R.F. EILBERTand R. A. CHRISTENSEN CONCLUSION
The contrived entropy is a useful concept by which an informational loss can be associated with a search procedure. In particular, the entropy of contrivedness bears directly on the question : to what depth should a search procedure be carried ? Its effects are in evidence for the illustrative univariate case for which search optimization occurs with fewer regions than was originally supposed. More generally, a hierarchy of search procedures,
l-I1 C 1-I2 C I"13... ~ I"Im which might correspond to power function discriminants or bit extraction levels (bisection, quadrasection, octasection, etc.), can be organized for a specific problem. Instead of using rules of thumb to determine the level of search conducted, an optimal search level may be calculated unambiguously with the concept of contrivedness. The distribution law for the term Sm~,(P~, I ] ) is as important as its expectation value Soon(I-l). Knowledge of this distribution would establish confidence limits against the null hypothesis that a particular pattern is merely statistical noise. This question plagues many researchers and a great deal of multivariate analysis has been unreproducible because the results are not inconsistent with this null hypothesis. A closed form solution for the distribution law seems unlikely. However, for particular classes of partitions, Monte Carlo computation of the distribution is possible. Such results, which involve considerable computing time, could be made available to researchers in a tabular form. A certain class of outcome probability estimators, of which equation (2) is an example, has the property that the estimators are independent of the geometrical configuration of the cells in feature space. The estimated value for a specific cell depends only on that cell's occupation numbers and quantities, such as a priori constants, the Mk, K and N, that are independent of external partitioning structure. The probability may then be expressed p(kli, n, Io) = p(k, nil . . . . niK, K , M 1. . . . M K, C)
where C represents any a priori constants. A similar constraint may also be demanded of the cell occupation probability p,(iln, Io) = pc(N, K, M 1. . . . M r, C).
Under these conditions, the distinction between the multivariate and univariate case is bridged because the topology of feature space is irrelevant. Contrived entropy calculation in the multivariate case is then more complicated only in an operational sense that multivariate search procedures will ordinarily be richer in the variety of partitions examined than univariate ones. All information about a partition relevant for computing Soon(n) is contained in the total cell occupation numbers {Nl, i = 1, v~}.
REFERENCES 1. G. U. Yule, Why do we sometimes get nonsense correlations between time series? A study in sampling and the nature of time series, J. R. Star. Soc. 89, 1 (1926). 2. M. Hills, Allocation rules and their error rate, J. R. Stat. Soc. 28, 1-31 (1968). 3. P.A. Lachenbruch and R. M. Mickey, Estimation of error rates in discriminant analysis, Technometrics 10, 1-11 (1968). 4. G. F. Hughes, On the mean accuracy of statistical pattern recognizers, IEEE Trans. lnjb. Theory, IT-14, 55 63, (1968). 5. L.N. Kanal and B. Chandrasekaran, On dimensionality and sample size in statistical pattern classification, Proc. Natn Electronics Conf. 24, 2-7 (1968), also Pattern Recognition 3, 225-234, (1971). 6. D. H. Foley, Considerations of sample and feature size, IEEE Trans. Info. Theory II"-18, 618-626 (1972). 7. T. M. Cover, Geometrical and statistical properties of system of linear inequalities with applications in pattern recognition, IEEE Trans. Elec. Cutup.. EC-14, 326-334 (1965). 8. G. E. P. Box and G. M. Jenkins, Time Series Analysis: Forecasting and Control, p. 17. Holden Day, San Franciseo (1976). 9. P. R. Krishnaiah, Simultaneous test procedures under general MANOVA models, in Multivariate Analysis, VoL II, P. R. Krishnaiah ed., pp. 121-143. Academic Press, New York (1964). 10. A. P. Dempster, An overview of multivariate data analysis, J. Multivariate Anal. 1, 316-346 (1971). 11. R. E. Davis, Predictability of sea surface temperature and sea level pressure anomalies over the North Pacific Ocean, J. Phys. Oceanog. 6, 249-266 (1976). 12. S. Kullback, Information Theory and Statistics. Wiley, New York (1959). 13. A. Rescigno and G. A. Maccacaro, The information content of biological classifications, in Symposium on Information Theory, Royal Institution, London, 29 August-2 September, 1960, C. Cherry ed., pp. 437-446. Butterworths, London (1961). 14. P. M. Lewis, The characteristic selection problem in recognition systems, IRE Trans. Info. Theory IT-8, 171-179, February (1962). 15. R. A. Christensen, Induction and the Evolution of Language. Dept. of Physics, University of Califomia, Berkeley, 19 July (1963). 16. R. A. Christensen, Inductive Reasoning and the Evolution of Languages, Dept. of Physics, University of California, Berkeley, December (1964). 17. W. W. Bledsoe, Some results on multicategory pattern recognition, J. Assoc. Comput. Mach. 13, 304-316 (1966). 18. S. Watanabe, P. F. Lambert, C. A. Kulikowski, J. L. Buxton and R. Walker, Evaluation and selection of variables in pattern recognition, in Computers and Information Science, Vol. II, J. T. Tou ed., pp. 91-122. Academic Press, New York (1967). 19. R. A. Christensen, A general approach to pattern discovery, Tech. Report no. 20, University of California Computer Center, 29 June (1967). 20. R. A. Christensen, A pattern discovery program for analyzing qualitative and quantitative data, Behav. Sci. 13(5), 423-424, (1968). 21. R. A. Christensen, Seminar on entropy minimax method of pattern discovery and probability determination, Carnegie-Mellon University, Technical Report No. 40.3.75, 7 April (1971). 22. R. A. Christensen, Entropy minimax method of pattern discovery and probability determination. Arthur D. Little, Inc., Cambridge, Massachusetts, (1972). 23. R. A. Christensen, Entropy minimax, a non-Bayesian approach to probability estimation from empirical data,
Contrivedness: the boundary between pattern recognition and numerology Proceedings of the 1973 International Conference on Cybernetics and Society, 1973, Boston, Mass., 73-CHO 799-7-SMC, pp. 321-325, (1973). 24. R. A. Christensen, Contrived patterns, trying to avoid them without trying too hard, Carnegie-Mellon University Tech. Report No. 40.12.75, June (1975). 25. R. A. Fisher, The Design of Experiments, pp. 44-49, Hafner, New York, 7th ed. (1960). 26. E.J.G. Pitman, Significance tests which may be applied to
261
samples from any populations, Suppl. J. R. Stat. Soc. 4, 119-130 (1937). 27. W. Hoeffding, The large sample power of tests based on permutations of observations, Ann. Math. Star. 23, 169-193 (1952). 28. G. E. P. Box and S. L. Anderson, Permutation theory in the derivation of robust criteria and the study of departures from assumptions, J. R. Star. Soc. B. 17, 1 34 (1955).
About the Author--RIctt^~o EILBI!RTreceived his B.S. degree in physics from the City College of New York in 1967. Harvard University granted his M.S. in 1969 and Ph.D. in 1975, both in physics. Dr. Eilbert specialized in medical applications of proton beams including bone mineral measurement and radiography. He began work for Entropy Ltd. in 1976 and is currently senior research physicist there. He has been involved with various projects in multivariate pattern discovery including drought forecasting and nuclear fuel reliability.
About the Author--RONALD CHRISTENSENreceived his B.S. degree in Electrical Engineering from Iowa State University in 1958, his M.S. degree in Mechanical Engineering from the California Institute of Technology in 1959, his J.D. degree in Law from Harvard Law School in 1962 and his Ph.D. degree in Theoretical Physics from the University of California at Berkeley in 1969. He has worked for numerous organizations including the Theoretical Physics Group of the Lawrence Radiation Laboratory, the RAND Corp., IBM Deutchland, G. E., duPont and Arthur D. Little Inc. He has been on the faculty of the University of Maine and of Carnegie-Mellon University. Dr. Christensen is currently President of Entropy Ltd., a firm specializing in multivariate data analysis.