New measures of association for numerical variables

New measures of association for numerical variables

Mathematical Social Sciences 11 (1986) 161-182 161 North- Holland NEW MEASURES VARIABLES OF ASSOCIATION Michael and Kenneth SMITHSON FOR NUME...

1MB Sizes 0 Downloads 26 Views

Mathematical Social Sciences

11 (1986) 161-182

161

North- Holland

NEW MEASURES VARIABLES

OF ASSOCIATION

Michael

and Kenneth

SMITHSON

FOR NUMERICAL

KNIBB

Behavioural Scrences Department, James Cook University, Queensland 481 I, Austraha Communicated by F.W. Roush Received 30 March 1985 Revtsed 1 May 1985 Hypothesis-testing in the social sciences has been constrained by the inability of conventional measures of association for numerical variables to model only a restrtcted class of one-to-one predictive statements. In order to test ‘if-then’ or other one-to-many predictions, we require a suitable paradigm for constructing new measures of bivariate association. This paper combines concepts from fuzzy set theory and the proportional reduction of error logic (PRE) which has been used for measures of association in contingency table problems, thereby providing a general framework for custom-designing measures of association. This framework incorporates probabilistic, fuzzy, and distance models for measuring deviation away from a prediction. This paper explores criteria for choosing among several competing measures, the issue of relative predictive stringency, estimation bias and stability, and optimization procedures for fitting a priori predictions to data. Examples are provided from real data sets. Key words: Bivariate association; error.

fuzzy sets; correlatton;

prediction;

proportional

reduction

of

1. Introduction In the social sciences there is a largely unnoticed problem of operationalizing common predicates. Perhaps the most outstanding example of this difficulty is the inability to distinguish between ‘if-then’ and ‘iff’ propositions for bivariate relationships. While elementary formal logic provides the means to do so for categorical variables, corresponding techniques have not been developed for numerical variables. Commonly available measures of bivariate association (e.g. Pearson’s r and eta-squared) are capable of testing only one-to-one predictions for pairs of variables. Fig. 1 demonstrates the translation problem by contrasting the prediction region for the proposition ‘if X is high then Y will be high’ with that for the proposition ‘Y will be high iff X is high’, both for dichotomous and numerical X and Y. Clearly, researchers who wish to test ‘if-then’ or other one-to-many predictions for numerical variables require a suitable class of measures of association. This paper presents some methods for overcoming this problem which are based in fuzzy set theory and so-called proportional reduction of error logic (PRE from now on). PRE measures of association have been used for some time in contingency 0165-4896/86/$3.50

0

1986, Elsevier Science Publishers

B.V. (North-Holland)

162

M. Smithson and K. Knibb / N u m e r t c a l vartables

If X is high, then Y is high. high

X~'hlgh

Y.

Y

~

Y

LOW Low H i g h

Y 1s high lff X is high. high

X~

high

X

High

Y.

Y

Low

~ Low HI gh

Y

X

X

Fig. 1.

table problems (Guttman, 1941; Costner, 1965), and their strengths and weaknesses are well known. Hildebrand et al. (1977) provided a general framework for PRE measures in which they permitted weighted error-counting functions for reckoning deviations away from predicted regions. However, they did not extend this framework to numerical variables, nor did they fully develop rationales for assigning weights. The approach here, following suggestions in Smithson (1985), ,is to develop distance, probabilistic, and fuzzy logical models for error-weight schemes, and criteria for choosing among them for specific research problems. Accordingly, after a brief review of PRE logic and concepts, the different error weight models are laid out in the next section. Subsequently, some necessary modifications are made to the Hildebrand, Laing and Rosenthal (HLR) framework for PRE measures. The sections after that compare different weighting schemes on a variety of criteria, including a Monte Carlo study of estimation bias and related statistical properties. Finally, questions of a priori hypothesis testing versus ex post facto optimization are discussed.

2. Proportional reduction of error framework

The fundamental process behind constructing PRE measures of association involves four steps: (1) specify a rule 'K' for predicting the state of a dependent variable Y when the state of an independent variable X is known; (2) specify another rule ' U ' for predicting Y when the state of X is unknown; (3) .define what constitutes predictive error and how it is to be counted or

M. Smrthson and K. Knibb / NumerIcal variables

163

; and define a measure of bivariate association which has the form (rule U error rule K error)/rule U error, or (U-K)/U. Following the HLR framework, the measure defined in Step (4) will be denoted by V. Depending on how rules U and K, predictive error, and error accounting have been defined. V takes many forms and includes as special cases a number of familiar measures of association (see Hildebrand et al., 1977, for a review of these). For variables with a finite number of possible states, V has a simple computing formula. Let Y be indexed by i and have I states, and X be indexed by j and have J states. Let Pi. and P, denote the appropriate marginal probabilities for the ith state of Y and hejth state of X, respectively, and let P,, be the observed cell frequency. Then when only the marginals are known, we predict a cell probability of P, P,, which constitutes our rule U procedure. Now, assuming each cell has been assigned a degree of error We,, which equals 0 for cells in the prediction region and is nonzero otherwise, the formula for V becomes:

measured (4)

There, are generally two kinds of PRE measures in the mainstream literature. One ;is essentially ex post facto, in the sense that the prediction rules U and K, and the [error -counting method depend on prior knowledge of the distributions of X and Y, Isometimes jointly. For contingency tables, Guttman’s lambda is such a measure, since it specifies a prediction according to the modal values of X and Y. For ~numerical variables, Pearson’s r has such an interpretation, as does eta-squared. ‘The rule U prediction for r and eta is the mean of Y. Rule K and error-counting for r also are ex post facto, in so far as they count deviations away from the regresision line determined by least-squares, which utilizes prior knowledge of the joint idistriburtion of X and Y. Only the linear form of the relationship is given a priori. Eta is even more ex post facto; the form of the relationship itself is determined by Ithe joint distribution of X and Y. That both measures of association have a PRE ~interpreta.tion in terms of residual sums of squares is well known. The second kind of PRE measure involves prediction regions and error weight schemes estabished independently of any foreknowledge of X and Y. The HLR framework focuses on this kind, and so does the treatment offered here. Obviously, Ithere can be different degrees of ex post facto and a priori measures, and several ‘intermediate’ cases will be discussed in which the form of a prediction is a priori, ‘but its parameters are permitted to ‘flex’ with the data and/or the researcher’s reIquirement s . Before leaving this general discussion, one additional concept needs introducing. In Fig.. 1 it is evident that one-to-one predictions are more stringent than one-tommy predictions. One might well expect that a one-to-many prediction would yield a higher. P-value than a one-to-one prediction, and of course it is possible to define trivial prediction regions to maximize V . If we are to admit one-to-many predictions

M. Smithson and K. Kmbb / Numertcal variables

164

in the social sciences, then we require a measure of prediction 'precision'. The HLR framework provides a definition and measure for precision, namely rule U. The rationale is that any prediction must be compared against a 'maximally precise' benchmark, and Hildebrand et al. point out that the strictest possible prediction is one in which all cells are errors, or wv -- 1 for all i and j. Hence, the maximum possible value for U is 1. For reasons given in a subsequent section, this definition is inadequate for error weighting schemes in which the wv take on values other than 0 and 1.

3. Models for assigning error weights There are three main reasons for assigning different weights to errors. One is simply that one error may be considered more drastic than another. Probably the most common example is the notion that deviations far away from the regression line should count as greater errors than deviations which are close to the line. A second reason reflects an assignment of conditional probabilities to particular events. Thus, an investigator might predict that P ( Y , ] X j ) = q v . This is equivalent to assigning wv = 1 - qv" A third reason is that a given prediction may be 'fuzzy' in its use of graded concepts or fuzzy categories. Homans' (1961) propositions for elementary social laws are good examples, to wit: 'Loss in competition tends to arouse anger' (p. 123). Not only are the variables 'loss in competition' and 'arousal of anger' graded and therefore fuzzy, but the predicate itself also is fuzzy. Intuition suggests that the truth-values of different outcomes in relation to this kind of proposition also ought to be fuzzy, which entails assigning the w,Tvalues between 0 and 1. The probabilistic assignment of error weights is thoroughly discussed in the HLR framework, and so will not be dealt with here. Instead, the distance and fuzzy models will be discussed, along with issues arising from them. Under rather general conditions, the distance models may be subsumed as special cases of fuzzy logical models. The distance model is easily summarized in terms of a measure, d v, of how far a given pair (X~, Yi) is from the predicted region:

w,j=dv=min [Yt- ypj[n,

for n>_l,

(2)

p

where Ym refers to all Y-values in the predicted region for t h e j t h state of X. in the special case where the prediction takes the form 'if X is high, then Y is high' we have:

w v = O, = [YI-- yp]n,

if Y, is above the diagonal in Fig. 1. otherwise,

(3)

where Yp is the lowest-valued II, of the pairs (Xj, Y,) within the predicted region.

M. Smithson and K. Knrbb / Numerical variables

165

The fuzzy logical model requires considerably more development. Assume that f&X and Y there are constructs which are measured by these variables, denoted by C,, and Crm, respectively, where k runs from 1 to K and m runs from 1 to M. For each pair (X,, C,,) define a mapping fJk into (0,l) which reflects the truth-value ofthe statement ‘X, is C,,,,‘. Define a similar mapping t,, for ‘Y, is Cum’. A common special case is where K and A4 both equal 1, so that the constructs have only one state to be measured by their respective indicator variables (e.g. income being used to measure degree of ‘high purchasing power’). Under that condition, the mappings reduce to simple truth-scales t, and t,. On the other hand, if we had a predictive state,ment of the form ‘If high education, then high income; otherwise low to moderate’ income’, then it might be necessary to operationalize three tJk, one each or ‘high’ , ‘low’, and ‘moderate’ income. We are now in a position to define fuzzy logical implication functions for predictive statements such as ‘If X, is C,, , then Y, is Cy,,,; otherwise Y is (C,, or C.v2or... or c ‘.ys)‘. We discuss first the special case where the ‘otherwise’ alternative is the complement of Gym, in which case the prediction reduces to the classical implication : ‘If X, is C,,, then Y is Cy,‘, or C,, * Cum. Two alternative definitions of fuzzy logical implication have been studied (Zadeh, i 19751). The first is called the ‘Arithmetic Rule’ and specifies f(Cxk

The’ second,

*

cyrn

)=not(X,

the ‘Maxmin f(Cxk

*

is C&) or (Y, is cym)=(l-t,k)

or t,,.

(4)

Rule’, stipulates

&I ) = [(X, is C,Vk) and (Y, is C,,)] = (tJk and t,,)

or (1 - t,k).

or not (X, is Cxk) (5)

Any suitable fuzzy logical operators for ‘and’ and ‘or’ may be used in conjunction with either of these rules. The fact that there are many such operators, plus the existence: of two alternative formulations of implication, create an embarrassement of riches in assigning error weigths on the basis of fuzzy predictive statements. Before proceeding, it is worthwhile briefly surveying the available operators and their relevant properties. The three most popular operators for fuzzy ‘and’ and ‘or’ in the literature are the min-max, product, and bounded sum operators. Their formulas are given below in terms of fuzzy membership values u, where a refers to fuzzy set A and b refers to tizzy, set B: min-max:

u(A or B)=max(u,,

z+), and

u(A and B) = min(u,, product:

u(A or B)=u,+ub-u,ub,

(6)

ub), and

u(A and B) = u,ub, bounded:

u(A or B) = min(1, u, + ub), and

(7)

M. Smrthson and K. Knrbb / Numerrcal vanables

166

sum:

u(A and B) = max(O, u, + ub - 1).

(8)

Other operators have been proposed (for a more extensive review, see Trillas et al., 1982), and Smithson (1984) defined a family of operators which incorporates min-max and bounded sum as special cases, whose extremity (Archime’deanity varies linearly in one parameter. For present purposes, however, we will focus on the above three operators. The Arithmetic Rule yields the following formulas for conditional truth functions:

The Maxmin

(9)

min-max:

t(Cxk * C,,)

= max( 1 - t,k, t,,,),

product:

t(Cxk =P c_wn ) = 1 - $k + tjk t,, 9

(10)

bounded:

t(Cxk * C,,)

(11)

= min( 1,l - $k + t,,,?).

Rule yields these functions min-max:

t(C,, =$C_“,n) = max(

product:

f(Cxk*Cym)=

bounded:

t(Cxk * C,,,) = maX(

for the same operators: 1 -

t/k,

min(tjky

(12)

tlm )h

(13)

1 -fik+$t,,, 1 -

tJk,

t,,).

(14)

In addition to these fuzzy logical functions, there are a number of candidates available from the nonfuzzy literature on continuous-valued logics (Rescher, 1969; Zadeh, 1979). Two of the most commonly used are: Godelian:

t(Cxk * Cy,) = 1, = tImp

Ratio:

when t,, 2 t/k,

(15)

otherwise;

f(Cxk = CV,) = l, = t,, /tyk,

when t,, 1 t,,,

(16)

otherwise.

Neither of these logics are interpretable in terms of either the Arithmetic or Max&a min rules. However, the so-called distance models are, for the class of Minkows kian distance functions. Under the condition that the values for X, and Y, are monotonically related to tJk and t,,n, respectively, the d,J in equation (3) are monotonically related to the following ‘truth distance’ model: distance:

f(Cxk * C,,)

= l,

when t,, 2 tjk,

= 1 - (tjk - t,, >‘,

otherwise,

for n> 1.

(17)

Immediately, for n = I the distance-model reduces to the bounded sum function for the Arithmetic Rule. Furthermore, for any value of n, it is easy to show that the functions in (17) are equivalent to Arithmetic Rule functions generated by the following pair of fuzzy set set-aggregation operators: u(A or B) = 1 - (max(0, 1 - u, - u,))“,

M. S m t t h s o n and K. K n t b b / N u m e r i c a l variables

167

nnd u(A

and B) = (max(O, u ~ + u o - 1)) ~.

(18)

P,oth operators fulfill the usual criteria for fuzzy set union and intersection. Thus, Ibis class of distance functions is a special case of the Arithmetic Rule operating on tuzzy set theory. I'o convert these truth functions into error functions, we simply use fuzzy negaIron t(not A ) = 1 - G . We shall denote these error functions by W,mjk, tO indicate that they are components of the wu to be assigned to cell ( i , j ) . Equations (9) and 114) reveal that the truth function for min-max operators under the Arithmetic Rule ~ identical to that for the bounded sum operators under the Maxmin Rule. Thus, we drop the Maxmin Rule bounded sum function from here on. Arithmetic Rule: rain-max:

W~mjk = m i n ( t j k ,

product:

W,mjk = tjk -- tjk t,m,

1 - t,m),

(20)

bounded: w,.v~ = max(0, tlx -- tim ), distance:

(19)

(21)

when t,m >_ tjk,

w,,,v k = 0,

(tjk-- t,m) n,

otherwise, for n>_ 1;

(22)

Maxmin Rule:

(;odelian:

rain-max: W,mj~ = min(tjk, max(1 -- tjt, 1 - t,m)),

(23)

product:

(24)

Wtmjk = tjk -- tZk t,m, Wtmjt. = 0, = 1-t,m,

Ratio:

w ,,,,j~ = 0, = 1 - t,m/tjk,

when t,m >- tjk,

otherwise,

(25)

when t,m >_ tjk , otherwise.

(26)

The way in which these component error functions W,rnj k a r e combined to form the w,j depends on the form of the compound prediction statement. Obviously, if lor X and Y there is only one construct each, then Wlmjk= WU for all ( i , j ) . If there ,s more than one construct, then in a wide variety of cases the compound prediction statement will string its component predictions together with 'and' connectives. For example, given a quantitative scale measuring individuals' positions on the political spectrum from left to right-wing, one might predict for Australian political parties that Labour Party supporters will be left-wing, Liberal Party supporters will be moderate, and National Country Party supporters will be right-wing. These three component predictions are connected by 'ands', and therefore their component

M. Smithson and K. Knibb / Numerical vartables

168

truth functions also must be combined by fuzzy logical 'ands'. Owing to the distributive property of fuzzy negation, it follows that the component error functions in this case should be combined using fuzzy 'or' operators. Thus, for example, the min-max error functions under the Arithmetic Rule would result in the following Wq :

w,j = m a x (W,mjk) k,m

=

max (min(tjk, 1

-

k,m

t~m)).

(27}

We may now turn to the more general case where there is a specific alternative in the prediction. To simplify notation, denote the alternative (Cyl or Cy2 or... or Cys) by CA. Also, denote the general predictive statement by the shorthand C x k " Gym (C A ). Finally, let t A denote t(II, is CA ). Generalized versions of the Arithmetic and Maxmin rules have been studied in the literature (Zadeh, 1975; Mizumoto, 1982): Arithmetic: t(Cxx= Cym(CA))=(not(Xj

is Cxk ) or (Y~ is Gym) ) and

((Xj is Cxk) or (Y~ is CA)) = ( ( 1 - tsk) o r t,m ) and (tjk or t A), Maxmin:

t(Cxk=Cvm(CA))=((gj

(28)

is C,k ) and (Y, is Cvm)) or

(not(Xj is Cxk ) and (Y~ is CA) )

= (tjk and t,m) or ( ( 1 - tjk ) and tA ).

(29)

The Godelian and Ratio rules have been generalized as well. Tanaka et al. (1982) applied the following generalization of Godelian implication:

t(Cxk = Cy m (C A )) = t A , = ttm,

when t,m >_tjx , otherwise.

(30)

Mizumoto (1982), however, provided a more logically satisfying version by replacing the Arithmetic Rule 'or' operators with = :

t(Cxk= C y m ( C A ) ) = ( ( X j is Cxk)= (Y, is Cym) ) and (not(Xj is Cxk)= (Y~ is Gym) ) = tug m and ttjkA, where

tvg m = 1, = t~m, and

when ttm >_tjk, otherwise;

(31)

M. Smithson and K. Knibb / Numerical variables t rJkA -1 9 =

when

tA>

169

1 -tJk,

otherwise.

*A 9

Mizumoto leaves the choice of a fuzzy ‘and’ operator to the user. We may generalize the Ratio Rule in an obvious way using the same scheme. A common example of a compound prediction with a specified alternative in behavioral scientific research is the so-called ‘floor effect’ or ‘ceiling effect’ prediction, in which a relationship between X and Y holds only above (or below) certain boundary values in X and Y. A typical floor effect model posits that for any XJ greater than, say, k, C,k = C,,; otherwise Y, lies below some upper bound (say m). If the bounds k and m are treated and ‘crisp’ (unfuzzy) boundaries, then the Arithmetic and Maxmin rules give the same results: t[XJrk=(Cxk-

Cvm)(~5m)]

=O,

when X,
and Y>m,

= 1,

when X,
and Y,sm, and

= t(Cxk * C,,),

when X,2 k.

(32)

Of course, k and m may be treated as fuzzy numbers with appropriate definitions for r(X,> k) and t( Y,sm). Under those conditions, the Maxmin and Arithmetic rules results differ when XJ I k and Y,5 m : t[XJkkA

=0,

(cxk*

&,)(y=)l

when X,
=t(Y,sm), =t(X,rk)

when X, < k and Y,I m, and f(Cxk*Cym),

= [( 1 - t(x, >- k)) (Arithmetic

Or

t(Cxk

a

when X,rk C_“,)] and [t(X,r

and Y,>m,

and

k) or t( Y,s m)]

Rule) or

=[(l-t(X,rk)) (Maxmin

and y>m,

and t(Y~m)]

or [f(X,zk)

and f(Cxk*Cym)]

Rule), when XJz k and Y sm.

(33)

That these error functions result in quite different wu is evident from their formulas , However, it is not immediately evident whether they also produce different results for I7. A simple example suffices to demonstrate that they do. Table 1 displays some data from a study of attitudes toward political violence (Muller, 1972). These data were analyzed in Hildebrand et al. (1977, pp. 10%llO), and since the prediction hypothesis lends itself to quantification it provides an interesting case for comparismon with the HLR framework. Muller’s hypothesis was: ‘. . .( 1) APV (approval of political violence) and IPV (intention to engage in political violence) will show a strong degree of fit to a monotonic function, and (2) cases which deviate from direct monotonicity will show higher rank-order on APV than on IPV.’

M. Smtthson and K. Knibb / Numerical variables

170

Table 1 Muller's data on political violence X=IPV

1

0

Y= APV

5

2

4 3 2

1 0

89 97 242

2

3

4

5

4

4

5

2

2

2

4

1t

8

3

2

30

9 43

17

8

13

1

0

48

45

39

3

1

0

75

5

0

0

0

131 169

5

0

0

0

0

102

148

65

28

9

7

499

19

Muller's hypothesis is quite similar to the proposition 'if IPV is high then A P V will be high'. Table 2 shows the resulting values for 17 when each of the error functions in (19)-(26) is applied. Muller's hypothesis requires only one construct for X and one for Y, so for all (i, j ) W t m j k : W U . I have assumed here that t, and tj are simple linear transforms of the A P V and IPV scales, since both scales have Likert formats. The reason for the considerably lower 17-values for the m i n - m a x and product error functions lies in the fact that, unlike the others, they assign nonzero w,j to cells above the diagonal. There seems to be no simple relationship between the nature of a given error function and the value of 17, since 17 also depends crucially on the bivariate distribution of X and Y. However, it is possible to express differences between 17-values for two error functions in terms of differences in their error weights, which is useful when the two functions differ only on a subset of their weights. Denote the error weights for one function by w,j, and for the other by v,;. Now, let w , ; - o , j = e v, and define

Ve= E ~ j

eijPt.PjVtj/~ /-~

E etjPt P.j, J

where

Vu=I- P'j

Pt.P.j

Table 2 U, K, and 17 for error functions Error function

U

K

V

H L R 'unweighted' AR min-max A R product A R bounded sum

102.82 78.19 59.21 33.67

17.00 62.80 39.88 3.60

0.835 0.197 0.326 0.893

A R distance, n = 2

14.51

0.80

0.945

MM m i n - m a x

81.97

70.20

0.144

MM product Godelian Rule Ratio Rule

74.06 89.69 79.60

60.98 12.20 9.60

0.177 0.863 0.879

Note." A R

denotes Arithmetic Rule, and MM denotes Maxmln Rule functions.

.

M. Smtthson and K. Kmbb / N u m e r t c a l vartables

171

Jl hen it is not difficult to show that Vw-V~> 0 either when

~ e~jP, PjV~>O and t

Ve>O

j

or

~ e,jP,.Pj
Ve
(34)

j

4. Modifications to the H L R framework

The H L R framework is a special case of the perspective developed in the preceding section, in so far as they focus on error functions which assign only values of 0 or 1. However, the H L R approach to measuring precision requires modification before it can be assimilated into the present approach. The Muller data provide a good discussion point for this topic, since H L R analyzed that data in two ways. In addition to their 'unweighted' function (w,j = 0 or 1), they applied a squared-error function in which the error weights were defined by wv = ( A P V - IPV)2/25, which is equivalent to (22) for n = 2. Naturally, they obtained V = 0.945. However, they also assessed the 'precision' of this error function by dividing U by N and compared it with the U/N for the unweighted case. They arrived at a precision of 0.206 for the unweighted function and 0.029 for the squared-error function, and concluded that the latter is much less precise than the former. This conclusion is unwarranted for at least two reasons. First, there is no reason to suppose that sample size should be a benchmark against which to compare error rates from functions whose weights vary between 0 and 1. At best, this benchmark ts suited only for the unweighted error function in which each error counts as 1. As IILR themselves state (p. 107), precision is affected by the numeraire adopted for measuring error, and one cannot make comparisons across different error functions without specifying a c o m m o n numeraire. Their suggestion that V is unaffected by choice of numeraire because it is a ratio variable does not hold because precision also is a ratio variable. This brings us to the second objection, which is that just as l' is unaffected by a multiplicative transformation (although it is affected by adding a constant to the wu), so too should precision be unaffected by multiplication of the w,j by a constant. The decision to divide the squared differences (APV - IPV) 2 by 25 is arbitrary. If we do not divide by 25, then the value o f 17 still is 0.945, but precision as measured by U/N changes dramatically to 0.726. A general approach to assessing prediction precision must distinguish between two kinds of comparison: (1) between identical predictions with different error functions, and (2) between rival predictions with the same error function. Obviously, a comparison involving both different predictions and error functions would confound the effects of the two, and so they are forbidden in the approach provided here.

M . S m i t h s o n a n d K . K m b b / N u m e r i c a l vartables

172

Both comparisons involve using a 'maximally stringent' prediction as a bench. mark, which uses the same error function as in the original prediction. This bench. mark is a one-to-one prediction, and the relative precision of the original prediction is assessed by taking the ratio of its U to the rule U error rate (denoted by Umax) for its maximally stringent counterpart. The proportional reduction in error due to shifting from a maximally stringent hypothesis to the hypothesis under consideration is (Umax-U)/Umax. Thus, relative precision may be assessed by subtracting this PRE measure from 1, which gives U/Uma×.The distinction between the two types of comparison hangs on how the benchmark prediction is constructed. For the first comparison, we wish to establish a benchmark such that any difference between precision for different error functions is due entirely to differences in those functions and unique up to a multiplicative transformation. These criteria are satisfied by an error function constructed on the basis of the complement of the prediction region. That is to say, a complementary set of error weights, w,j, are assigned to (i, j ) by reversing the prediction and error regions and then applying the error function used in the original prediction. For 'if-then' propositions, this procedure has the effect of constructing the contra-positive of the original proposition. Then, the maximally stringent error weights are defined by m u = w,j + wtj. Obviously, the maximally stringent counterpart to any 'if-then' prediction is the corresponding 'iff'. Table 3 shows the wu for selected error functions, and precision (H) values for all functions discussed so far, again using the Muller data. The fact that H for the HLR, distance, Godelian, and Ratio functions is less than 0.5 is due to the skewed marginal distributions in the Muller data. It is easy to verify that for symmetric marginals, the precision of any 'if-then' prediction whose error function assigns w,j=0 to points above the diagonal is exactly 0.5, while left- or top-skewed marginals increase H and right- or bottom-skewed marginals decrease H. Note thal the precision levels for these functions are much more in line with that for the unweighted HLR function, and in some cases (e.g. those which give nonzero weigths above the diagonal), exceed the H L R precision level. As with F, there appears to be no straightforward relationship between H-values and the properties of error functions. If we denote differences between error weights by e,j, and for the complementary weights by e u, it can be shown that when S, ~,j euP~.P.j>O, or if ¢

!

I

t

S e,jP, Pj
then

Hw-Ho>O

t

j

(35)

j

if

S euP, P I t

~ S P, P-j[mu-(e,j+e~)]
and

j

' j

' t

j

J

Likewise, when ~, ~ j e z j P ~ P.j 0 if the inequality in (36) also is reversed. The second kind of comparison requires that we construct a benchmark predi¢o

M. Smtthson and K. K m b b / N u m e r t c a l variables

173

Table 3

Ureax, H, and G R E for error functions Error function

Umax

H

GRE

H L R 'unweighted' AR m i n - m a x AR product AR bounded sum

378.77 183.35 182.27 138.83

0.271 0.426 0.325 0.243

0.227 0.084 0.106 0.217

AR distance, n = 2 MM m m - m a x MM product Godelian Rule Ratio Rule

67.73 183.37 200.38 345.06 337.17

0.215 0.447 0.370 0.260 0.236

0.203 0.064 0.065 0.225 0.208

t

w v for selected error functions:

MM ram-max

AR m m - m a x 1.0 0.80 0.60 0.40

0.80 0.60 0.40 0.20

0.60 0.40 0.20

0.40 0.20

0.20

0.00

0.20

1.0 0.80 0.60 0.40

0.60 0.60 0.40 0.20

0.20 0.20 0.20 0 O0

0.20 Ratio Rule

Godelian Rule

1.0 1.0

0.80 0.80

0.60 0.60

1.0 1.0 1.0

0.80 0.80

0.60

0.40 0.40 0.00

0.20

1.0 1.0 1.0 1.0 1.0

0.50 0.70 0.80 0.80

0.30 0.50 0.60

0.30 0.40

0.20

0.00

tion which is appropriate for assessing two rival hypotheses using the same error ftmction. To do this, we must find a one-to-one prediction that maximizes rule U error. The algorithm used by HLR for maximizing 17 (pp. 132-138) may be adapted for this purpose. It amounts to assigning 0 to one cell for each Xj, by choosing that pair (i, j) in the j t h column which has the largest P , . P j . The error function used for the original prediction is then used to assign values to the remaining wv. This procedure results in the selection of the row with the largest P, for the zero-valued weights. The utility of this benchmark is questionable since its only advantage is that it enables values to be computed for H. But if all the researcher wishes to do is find out which hypothesis is the most precise, then merely comparing their U-values will suffice. Comparing both 17- and H-values for rival hypotheses or error functions raises the issue of how these two pieces of information may be combined in such a way that the researcher is able to make a reasoned choice. HLR point out that H17 is a measure of 'total error reduction', which they call RE. Here, we may define a similar measure which actually indicates the total reduction of error relative to the

ii4. Smithson and K. Kmbb / Numerical

174

varrables

benchmark being used (denoted GRE). Its form is GRE= (U- K)/U,,,. The utili ty of GRE depends, of course, on the researcher’s substantive or conceptual objet tives. As an example, consider the data in Table 4, from a study of breast cancer mop tality as a function of intake of animal fat. The joint distribution certainly suggest1 that high intake of animal fat is sufficient, but not necessary, to produce a high mora tality rate. The corresponding hypothesis is an ‘if-then’ proposition. Assuming the marginal t,- and t,-values are simple multiplicative transformations of their respwa Table 4 Breast cancer and ammal

fat Intake X frequency

023370011315140110

Y frequ.

4 2 4 1 3 5 1 2 6 0 3 0 2

1 1 1

1

1

1 1 1 1

1

1

26

Age-adjusted death-rate (per 100000)

3 2 * 150

0 Per-capita animal fat intake (gp/day) (Adapted from Carroll, 1974.)

‘1 0.06

1.00 0.92 0.85 0.77 0.69 0.62 0.54 0.46 0.38 0.31 0.23 0.15 0.08

0.11

0.17

0.22

0.28

0.33

0.39

0.44

0.50

0.56

0.61

0.67

0.72

0.78

0.83

0.89

0.04

w,, = 0.00

0.03

0.02 0.09

0.07 0.14

0.05 0.13 0.20

0.02 0.10 0.18 0.25

0.01 0.08 0.16 0.24 0.31

0.06 0.13 0.21 0.29 0.36

0.04 0.12 0.19 0.27 0.35 0.42

0.01 0.10 0.18 0.25 0.33 0.41 0.48

0.07 0.15 0.23 0.30 0.38 0.46 0.53

0.05 0.13 0.21 0.29 0.36 0.44 0.52 0.59

0.03 0.10 0 18 0.26 0.34 0.41 0.49 0.57 0.64

0.01 0.09 0.16 0.24 0.32 0.40 0.47 0.55 0.63 0.70

0.06 0.14 0.21 0.29 0.37 0.45 0.52 0.60 0.68 0.75

0.12 0.20. 0.27 0.3$ ’ 0.41’ O.s!, O.,d? 0.66 0.74i O.Sl!

M. Smithson and K. Kmbb / Numerrcal varrables

175

tive variables, and using the Arithmetic Rule bounded sum error function (also equivallent to the distance function for n = l), we derive the error weights shown in lower part of Table 4. With K= 0.0833 and U= 3.405, we obtain V = 0.976. Given that there are only 7 points in the error region, and those points are quite close to the diagonal, a natural question is whether we would lose much in precision if we shifted the boundary minimally so as to bring V to 1 .OO. The transformation tJ’= (t, -0.06)/l. 13 is just sufficient to do that. Now K= 0, but U= 1.8557, meaning that this hypothesis is onbyl1.546 as precise as the first one. The ratio of their G&Y-values is (3.405)(0.976)/ (l-8557)(1 .O) = 1.791, so the original hypothesis performs considerably better by this measure.

the

It is noteworthy that a different error function could lead to virtually the opposite choice. Using the unweighted HLR function in which each error counts as 1, for the original hypothesis K= 7 and U= 11.94, so I;7=0.414, which is considerably worse than the bounded sum function. This obviously is due to the proximity of the counterindicative cases to the diagonal. Now, if we shift the boundary in exactly the fashion we did before, then the HLR function yields K= 0, U= 8.96, and therefore ,a ratio of GRE’s equal to (11.94)(0.414)/(8.96)( 1.O) = 0.55 1, thereby indicating that the second hypothesis is superior. Clearly, there is a need for criteria by which to select appropriate error functions.

5. Criteria for selecting error functions

At least four kinds of criteria suggest themselves as a basis for selecting among different error functions. First, one may consider whether one function more effectively ‘translates’ a given hypothesis than another. Secondly, in so far as we are dealing with fuzzy logical functions, one may require that they satisfy certain logical or formal properties. Thirdly, the functions ought to provide good estimators of true I’ and H for data samples. And fourth, one may wish to maximize GRE, V, or H. I’hese groups of criteria will be referred to here as the translation, logical, statistical, und predictive criteria, respectively. Little can be said in a general way about translation criteria, since often they are unique to the problem at hand. However, the choice of a V measure on the basis of translatability falls naturally into three steps: (1) choosing a general class of prediction regions; (2) selecting an adequate rule U procedure for guessing the dependent variable when the independent variable is unknown; and (3) choosing an error function. If we return to the Muller (1972) hypothesis, for example, its wording leads us to choose the class of ‘if-then’ predictions. It gives no clues as to a reasonable rule Uprocedure, but the if-then class has its own default options on that issue. Most importantly, however, Muller’s statement provides explicit indications of the kind of ‘if-then’ error function which most closely models his prediction. It is evident that he wishes errors further away from the diagonal to be more strongly

176

M. Smrrhson and K. Knibb / Numerical varrables

penalized, and the phrase ‘strong degree of fit to a monotonic function’ hints that to some extent observations above the diagonal should count as errors too. ‘Within the ‘if-then’ functions we have reviewed, that eliminates all those which do not penalize against observations above the diagonal, leaving the two min-max and product functions, Logical criteria mainly involve the absence of paradoxical outcomes, and compatibility with specific features of logical implication. An interesting criterion which straddles the logical and translation groups was proposed by Dubois and Prade (1980, pp. 159-167), when they examined various kinds of fuzzy implication in terms of their compatibility with Piagetian requirements for human reasoning Piaget found that the development of reasoning in children involved an understand ing of the distinctions between and relations among the following four kinds of transformations: (1) identity, (2) negation, (3) reciprocity, and (4) correlativity. If we have elementary propositions p, q, r, . . . and a wff f(p, q, r, . ..). then these four terms correspond to f, not-f, f(not-p, not-q, not-r, . ..). and not-f(not-p, not -ql not-r, . . .), respectively. Dubois and Prade showed that while the Arithmetic Rules class of implication functions are compatible with the Piagetian criteria, the Max min class, Godelian, and Ratio Rule functions are not. Furthermore, Maydole (1975) is able to generate paradoxes for Godelian logic and also for the standard sequence logic (R - SEQ) which includes to the HLR ‘unweighted’ function as a special case. Statistical criteria (i.e. minimum variance, unbiasedness, maximum likelihood and predictive criteria are difficult to assess analytically for the error functions because of the usual problems of intractability. A Monte Carlo study was performed to give performative indications on these two sets of criteria, focusing mainly on bias and sample variance, and maximum possible values for P, H, and GRE. The remainder of this section is devoted to describing this study and presenting their results. The simulations entailed samples of size 500, and 30 runs on each condition. The conditions consisted of 48 combinations from a cross-classification of three table shapes (5 x 5, 3 x 8 and 8 x 3), two skewness values for the X-marginal (0 and 0.21, two skewness values for the Y-marginal (- 0.2,0), and four population values for V (0.1, 0.3, 0.5, and 0.9 or maximum possible 17). An earlier study conducted by the second author had suggested that asymmetric tables, positive skew in X, and negative skew in Y produce the greatest tendencies towards estimator bias in V. In all runs, an ‘if X is high then Y is high’ prediction region was used, again because the earlier study indicated greater bias for such regions than for an ‘iff’ proposition Table 5 displays selected results. Two measures of estimation bias were used: (37)

and BV = log[(l - VJ(1 - I’,)].

(38)

M. Smithson and K. Knibb / Numerical variables

177

['able 5 ~lonte Carlo results '2trot [unction I

4 5 7 8

Av

Bias By

Std. Dev.

Overall 17max

17max

H

GREmax

17max

H

Table GREmax

5 x 5 sym.

Table

3 x 8 sym.

0.0075 0.0077

-0.0052 -0.0024

0.15 0.15

0.67 0.75

1.00 1.00

0.50 0.50

0.50 0.50

0.50 0.62

0.50 0.50

0.25 0.25

0.0156 0.0092 0.0090 0.0015 0.0047 0.0017

- 0.0092 -0.0038 - 0.0060 -0.0007 0.0025 0.0001

0.26 0.15 0.14 0.03 0.06 0.02

0.89 0.68 0.63 0.17 0.29 0.14

1.00 1.00 1.00 0.19 0.29 0.17

0.50 0.50 0.50 0.73 0.47 0.81

0.50 0.50 0.50 0.10 0.12 0.19

0.83 0.36 0.23 0.16 0.30 0.11

0.50 0.50 0.50 0.68 0.48 0.85

0.35 0.17 O. 10 0.08 0 11 0.12

Note: The error functions are numbered as follows: i. 2. L 4,

HLR 'unweighted', Arithmetic Rule bounded sum, Distance, with n = 2, Godelian,

5. 6. 7. 8.

Ratio, Arithmetic Rule min-max, Arithmetic Rule product, Maxmin Rule min-max.

The Av measure is a raw indicator of bias magnitude (but not direction). By, on the other hand, is symmetric about zero and indicates direction as well as relative magnitude. Interestingly, both bias measures resulted in their largest absolute values for the distance rule where n = 2, thus indicating a poorer performance by the squared-error function than any of the others. Its sample variance also was highest. Bias among the m i n - m a x and product based functions was small by contrast, as were their variances. A N O V A was performed to assess the relative influence of error function, skewness, and table shape on Av and By. In neither case was the total variance accounted for high (17.5% for Bv and 18.2% for Av). Error function was by far the greatest influence, explaining 10.5% of the variance for By and 12.8% for Av. By contrast, error function alone accounted for 61.2% of the variance in [7 standard deviation, with skewness and table shape bringing this percentage up to 73.5%. In some error functions, population 17 was correlated with bias. Table 6 shows this tendency to be greatest for the HLR, Ratio Rule, and Arithmetic Rule product Table 6 Correlations between bias and population 17 Error function

Pearson's r

1

-0.57

2 3 4 5 6 7 8

0.15 -0.01 -0.13 -0.33 0.26 0.39 0.19

Note: Error functions numbered as in Table 5.

178

IV, Smithson and K. Knibb / Numerical varrables

functions. When population C’ was included as a covariate, the variance in BP explained by error function fell to 7.0%, although for A,, it declined only to 11.5%. The bias values in Table 5 have been adjusted for the influence of population V Obviously, it is desirable for an estimator’s bias not to be influenced by the magni tude of the parameter being estimated. The maximum possible value of V attainable for a given table was drastically in fluenced by error function, as Table 5 shows. The range of maximum V-values extended from 0.073 to 0.913 under one set of conditions. Furthermore, although some error functions generally yield higher maximum V than others, the order was not consistent. Clearly, this is a serious consideration. Unlike bias, the corn bined influences of error function, table shape, and skewness accounted for fully 98.2% of the variance in maximum V. As it turns out, when table shape and skewness are held constant, error function accounts for virtually all the variance of maximum V, Table 5 also shows that some error functions’ maximum V (especially the Godelian and Ratio Rule, but also HLR and bounded sum to some extent) were decreased by table asymmetry. Owing to the higher precision of the min-max and product functions, this decrease is sufficient to bring the maximum GRE values of the Godelian and Ratio functions down to their level. Apparently no one function dominates the others in both statistical and predictive criteria. The bounded sum function is perhaps the best compromise candidate, although it is difficult to weigh the squared-error distance function’s shortcomings in bias against its superiority in V and GM’. Likewise, the low bias results for the min-max and product functions are offset by their very low GRE- and P-values, although again whether this constitutes a problem depends in part on how the researcher interprets ‘if-then’ propositions. Their poor predictive success rates are somewhat compensated for by their relatively high precision. Table 7 summarizes the performances of the various error functions on translation, logical, statistical and predictive criteria. While a comprehensive synthesis of these results is beyond the scope of this paper, this table should indicate the relevant tradeoffs.

6. Optimization,

prediction,

and best-fit models

In the HLR framework for PRE measures of association, a rather strict distinc tion is maintained between ‘a priori’ and ‘ex post facto’ prediction analysis. Their a priori predictions specify the entire form and parameters of the prediction in advance of any knowledge about the data, while the ex post facto techniques mentioned in their treatment permit the data to influence nearly all aspects of the prediction being tested. As is well known, ordinary least-squares regression occupies a middle ground, in so far as it specifies the form of the prediction as linear, but allows the parameters of the prediction line to be determined by the joint distribution of X and Y. The framework proposed here also takes a middle ground position by permitting

M. Smithson and K. Knibb / Numerical vartables

179

i able 7 Overall error function performance translation-Logical

Statistical

Error

Logical

Rank-order: Low Low std.

Bias[Tpop

Rank-order.

consist.?

AvB v

dev.

Corr.

[7max

H

GREma x

Piagetian?

function

Pre&ctive

I 2 3

Yes Yes Yes

No ? ?

4 6 5 3 8 8

6 6 8

8 3 1

4 2 1

5 5 5

2.5 2.5 1

4 5 6 7 8

No No Yes Yes No

No ~ 9 ? ?

7 5 6 7 2 2 3 4 1 1

6 4 2 3 1

2 6 5 7 4

3 5 7 6 8

5 5 2 8 1

4 5 8 7 6

Note: All rank-ordermgs are scored in the same (positive) direction, so that a lower number indicates better performance. Error functions are numbered in the same manner as Table 5

the researcher to specify the general 'shape' of a prediction domain and the type of error function without having to fix parameters. To begin with, we observe that the HLR method for maximizing GRE (or maximizing 17 (H) subject to a limiting condition on H (V)) will not work for problems in which certain points are required arbitrarily nonzero weights. They use a sequential algorithm for assigning either 0 or 1 to be the Wtmjk , which breaks down when the Wtmjk a r e allowed to take on values between 0 and 1. Nor does there appear to be a straightforward optimization method which permits all weights to be assigned arbitrary values. Even if such a method did exist, its results could well prove difficult to interpret in any substantive fashion. However, it is not difficult to construct optimization procedures which allow a linear transformation of the independent variable truth-scale, t)k=atjk+b, and then maximize GRE by choosing values for a and b. Generally, this reduces to a (non)linear programming problem with the usual constraints:

a>O, a min (tjk) + b>_O, and a max (tjk) + b < 1.

(39)

One may also wish to impose the restriction b _ c, for some constant c > 0, to avoid degenerate solutions were a becomes arbitrarily close to 0. It is easy to show this region is always convex. Fig. 2 shows a typical feasibility region for the case where K = M = 1. It is worth noting in this connection that the distance, min-max, and bounded sum error functions require only constraints of the form b___c and a < d, where c and d are positive numbers in [0, 1]. Those two constraints limit the mobility

180

M. Smtthson and K. Knibb / Nurnerlcal vartables

amin(t

J

)+b~O

Feasibility

Region

a'O

I° ]

b

1

b~_ c

amax ( t J

_

) + b ' l

Fig. 2.

of the hypotenuse boundary on the prediction region to a lesser extent than the con. straints in (39). Since within a particular error function, GRE is proportional to U - K , the optimization problem for, say, finding the best location for an if-then prediction domain reduces to maximizing U - K subject to the constraints given above. In the simplest case, the product function under the Arithmetic Rule, we have a linear function in a and b:

U-K=a E ~ P,.P jVutj(1-t,)+b E £ P,.P jVu(1-t,). z

j

t

(40)

j

Using the cancer data from Table 4, with the constraints a(0.06)+ b _ 0 , a + b < 1, and b<0.20, we obtain a = 0 . 8 0 , b=0.20, U-K= 1.503, and V =0.816. The other error functions produce nonlinear equations for U-K, of course, and require direct-search methods. The generalization to multiple independent variables is straightforward, and will not be discussed here. An additional generalization, however, permits optimization on a parameter which varies the error function itself, and this deserves further comment. Classes of error functions exist in which the extremity of the 'or' and 'and' operators vary as a function of a continuous-valued parameter (e.g. Yager, 1980). Smithson (1984) defined a simple class in which extremity varies linearly in the parameter q, and includes the Arithmetic Rule min-max and bounded sum operators aJ special cases. The resulting error function for this class has the following form:

M Smtthson and K. Knlbb / Numerical vartables

W,rnjk=min(tjg, l--tlm)+(q/2)(1--1tjk--t,m[--II--(t:/. +t,,,)t). linear transformation for the t:k as described above, we obtain

181

(41)

Allowing a a nonhnear optimization problem in a, b, and q, in which the objective function becomes {U-K)/Uma×, accounting for the fact that we are moving a m o n g different error lunctions so that Uma~ changes as a function of q.

?. Conclusions A general f r a m e w o r k has been provided for P R E measures of association for numerical variables. This f r a m e w o r k permits virtually any prediction d o m a i n to be ,pecified and tested, either fully a priori or with parameters determined ex post facto. As was r e m a r k e d in connection with multivariate models using logical ' a n d ' and 'or' for numerical variables (Smithson, 1984), we suffer an embarrassment of riches m terms of the variety of ways of measuring prediction error. No error function emerges as the most preferred, a l t h o u g h the Godelian and Ratio Rule functions ~eem to perform the most poorly. Extensions of the bivariate case to multivariate causal analysis (/l la multivariate linear regression) await a more advanced treatment than is provided here. While the generalization of the optimality problems discussed in Section 6 is straightforward, error-accounting for the multivariate case is not, and the issues raised thereby must be reserved for a paper devoted to them alone.

References k.K. Carroll, Experimental evidence of dietary factors and hormone dependent cancers, Cancer Research 35 (1974) 3377. II.L Costner, Criteria for measures of association, American Sociological Review 30 (1965) 341-353. I). Dubois and H. Prade, Fuzzy Sets and Systems: Theory and Applications (Academic Press, New York, 1980). I Gunman, An outline of the statistical theory of prediction, m: P. Horst et al., eds., The Prediction , of Personal Adjustment (Social Science Research Council, New York, 1941) pp. 213-258. I) K Hildebrand, J.D. Laing and H. Rosenthal, Prediction Analysis of Cross Classifications (Wiley, New York, 1977). ,.C Homans, Social Behavior: Its Elementary Forms (Harcourt Brace, Jovanovich, New York, 1961). R.E. Maydole, Many-valued logic as a basis for set theory, Journal of Philosophy and Logic 4 (1975) 269-291. M. Mlzumoto, Fuzzy reasoning w~th a fuzzy conditional proposltmn 'if.. then...else...', m R.R. Yager, ed., Fuzzy Set and Possibhty Theory: Recent Advances (Pergamon, New York, 1982) pp. 211-223. I N. Muller, A test for a partml theory of potential for political v~olence, American Poht~cal Science Review 66 (1972) 928-959. N Rescher, Many-valued Logic (McGraw-Hill, New York, 1969). M Smithson, Multivariate analysis using 'and' and 'or', Mathematical Social Sciences 7 (1984) 231-251. M Smlthson, Translatable statistics and verbal hypothesis, Quahty and Quantity 19 (1985) 183-209.

182

M. Smtthson and K. Kntbb / Numerical vartab/es

H. Tanaka, T. Tsukiyama and K. Asai, A fuzzy system model based on the logical structure, in: R.R, Yager, ed., Fuzzy Set and Possibility Theory: Recent Advances (Pergamon, New York, 1982) pp, 257-274. E. Trdlas, C. Alsina and L. Valverde, Do we need max, min, and 1 - j in fuzzy set theory?, in: R.R, Yager, e d , Fuzzy Set and Possibdlty Theory: Recent Advances (Pergamon, New York, 1982). R.R. Yager, On a general class of fuzzy connectwes, Fuzzy Sets and Systems 4 (1980) 235-242. L.A. Zadeh, Calculus of fuzzy restriction, m: L.A. Zadeh et al., eds., Fuzzy Sets and their Application! to Cogmtive and Decision Processes (Academic Press, New York, 1975) pp. 1-39. L.A. Zadeh, Preclsiation of human communication via translation into PRUF, Memo No. UCB/ERL M79/73 (Electromcs Research Laboratory, Umverslty of California, Berkeley, 1979).