ELSEVIER
Copyright @ IFAC Modeling and Control of Economic Systems, Klagenfurt, Austria, 2001
IFAC PUBLICATIONS www.elsevier.com/locate/ifac
AN ECONOMETRIC PROBLEM AND A HEURISTIC-SEMIOTIC SOLUTION Ana Marostica School of Economics. University of Buenos Aires. Buenos Aires. Argentina CJ J20AAQ. Tel: + 54 J J 48J5 6548. E-mail:
[email protected]
Abstract: The puzzling two-faced aspect of probabilities (deductive and inductive) are inherited, first by statistics and second by econometrics. This paper presents two econometric inductive inferences, one related to inflation indices, where the passage from sample to population is not correct, and another one, based upon the analysis of household survey data from developing countries where the generalization to the population is accurate. These examples show how a heuristic procedure, such as semiotic-data mining, can be considered a solution to the problem of inductive inferences in econometrics. Copyright © 200J IFAC Keywords: Probabilities, econometric inferences, semiotic-data mining, technique for ambiguous signs
1. INTRODUCTION
property to the total number of members in that class. This is the calculus of chance, a branch of pure mathematics. This type of mathematical theory is deductive because the form is enough to guarantee its correctness. Among the famous theorems of the calculus of probabilitYe, there is an interesting equation due to Thomas Bayes (1702-61). Although Bayes' Theorem can be developed rigorously from the basic laws of probabilitye, it is one of the most controversial topics of probability.
This paper is organized as follows: a discussion of the double aspect of the concept of probability, namely deductive (probability of events) and inductive (probability of inductive inferences), is given in Section 2. Section 3 deals with the way econometrics uses certain methods inherited from statistics. A heuristic-semiotic solution for the problems that econometricians have encountered in the process of scientific inquiry is supplied in Section 4. The same section presents the concluding remarks of this paper.
The second type of probability, is the one called "probability of inferences" (i.e. probability;) . When a scholar wants to analyze the probability of inferences, he must remember that, following Charles Peirce (Marostica, 1998), there are three main types of inferences, abduction, deduction and quantitative induction. In scientific research, the most important type of inductive arguments is inference from sample to population (e.g. statistical generalization). In that type of inference, probability; is applied to the problem of estimating a population from a sample (or samples). There, an estimation rule is used in order to relate the two, sample(s) and population. If the
2. PROBABILITIES AND INDUCTION Probability is an ambiguous term because it has different meanings within the same context. The best way of characterizing probabilities must distinguish between two types. The first type is the probability of events in propositions; let us call this type probabilitye. Probabilitye states the ratio of the number of members of a class that have a certain
153
population is finite, we can place an upper bound on the error, IP-sJ ,of the recommended estimate.
information. Analogous to the population regression function that underlies the population regression line, the concept of the sample regression function (SRF) can be developed to represent the sample regression line.
In inductive generalizations, there is not the certainty that the conclusion is true even though the premise, which expresses information related to the sample set, is true. One is tempted to argue that unexamined instances of the population can be expected to conform to a rule because other supervised cases of the population have been found to conform to it. Since in the type of inductive inference the form is not enough (in contrast to deduction), each time the grounds for thinking that the unexamined cases in a population will conform to the sample information must be looked into. Here a scientist wrongly takes for granted that the data is always uniform.
In inductive inference from sample to population, the type of sample scientists get is very important because the sample is related to the premise of the inference and this is the only support the conclusion can obtain. The conclusion is, of course, related to the population. The criteria that good samples must have in order to avoid fallacies are: to be random, large enough, and varied enough. The explanation some econometricians give in cases related to the distributions of samples and populations is curious. Using, for example, a statistic theorem such as the Central Limit Theorem (Gujarati, 1995), as a method of investigation for limiting distribution in large samples is useful. However, if econometricians believe that the proof of a theorem will give a deductive certainty to this topic, they will be wrong. When someone is dealing with samples and population in an inductive inference, probability distribution gives only estimations of the behavior of the population based on the data examined in a sample. The larger the sample, the better the estimation.
Coming back to the transformation of Bayes' Formula into a special case of confirmation theory, this formula becomes there a rule. This is not probabilitYe any more; it is now probabilitYi. A scientist is dealing with an inductive probability where the form does not guarantee the correctness of the conclusion. If he wants to use it as a method of inference, he must face serious problems when the antecedent probabilities are unknown.
3. PROBABILITIES, STATISTICS AND ECONOMETRICS
Small samples could be the ongm of problems in econometric estimations. For example, in an article considering the empirical evidence connecting inflation to its higher-order moments, it is shown that the sample mean-skewness correlation suffers from a small-sample bias (Bryan and Cechetti, 1999). Sometimes time-series data estimates taken from small samples might be misleading. In addition, these estimations obtained from regression are Blues (i.e. best linear unbiased estimators), which assume, sometime wrongly, that data are already normally distributed. In small samples, the BLUE estimators may induce a completely wrong conclusion.
Econometrics, being the offspring of statistics and economics, is, broadly speaking, the science that investigates statistical methods to analyze economic data and draws economic inferences based upon the analysis of economic phenomena. Although mathematical statistics provides many of the tools used by econometricians, they often need special methods in view of the nature of most economic data, namely, that data are not generated as results of controlled experiments. The data econometricians use are part of the evolving real world. The econometrician depends on data that cannot be controlled directly. Thus, data on consumption, income, investment, saving, processes, etc., which are collected by public and private agencies, are not experimental data. Moreover, such data are likely to of measurement, and the contain errors econometrician must develop special methods to deal with such errors of measurement.
However, in unbiased large samples, such as the analysis of household survey data from developing countries like India, Pakistan, South Africa, Thailand, etc., between 1976 and 1990, light was cast on a range of policy issues (Deaton, 2000). Household surveys collect information on who (men or women, young or old) buys what goods and services and how much they spend on them. Deaton makes a good inductive generalization by using survey data to measure welfare, poverty, and distribution of real income in those countries based on good collected data.
A common topic that can be briefly mentioned here is regression to the mean or, simply, regression analysis, another tool that econometricians have taken from statistics, and the main tool used to obtain estimates. In econometrics, there are two concepts in regression analysis that are very important for this paper, namely, population regression function (PRF) and the sample regression function (SRF). They are both related to inductive inferences. The form of the PRF is an empirical matter. The only way of estimating the PRF is on the basis of sample
4. HEURISTIC-SEMIOTIC HELP The author has argued elsewhere (Marostica, 1998) that in a Peircean approach, it can be said that the process of scientific inquiry starts with the
154
observation of a surprising (or anomalous) fact. After that, the scientist builds a model (i.e. a framework) within which he will present the best explanatory hypothesis for that fact (the iconic abductive part where the explanation resembles the data). Then, in that model he tries to obtain all possible predictions related to that explanatory hypothesis (the symbolic deductive part). Finally, he makes an inductive checking of those predictions, and obtains a degree of probability (the indexical inductive part of the process).
AcrU~
i
-'j·~:~~:1
B
<: P~EDlcr
I
(Symbolic deduction)
I::l ~
LJ. ~
Econometricians have several reasons for constructing a model (i.e. a set of econometric equations) related to their research in the scientific world. The first one is simplification. They want to construct a "fruitful" model by selecting, according to their view, the most important variables correlated to the problem they are trying to solve. In econometrics, the model inductively predicts the unknown data of the population based on the selected data of samples. The only checking part in that process is that of hypothesis testing where they estimate (most of the time using regression analysis) some parameters of the unknown population.
(IndeXic~:1
induction)
SQEHTlFIC
WORLD
'V
Fig. 1. The Process of scientific inquiry. The expression "data mining" represents any method of extracting patterns and relations out of raw data. Any such method helps to generate models that represent the structure implicit in a database. Most data-mining methods are based on the utilization of statistical methods, which accordingly, obtain probabilistic descriptions of the information at hand. Semiotic data mmmg is an alternative but complementary method to the statistical type. It is based on the application of heuristic methods to the task of extracting patters and relations from data. In these methods one of the important procedures is to check, first, the meaning of expressions and types of objects (i.e. data) involved.
According to the semiotic view, every set of data is a sign (S), which has an object (0) and an interpretant, or interpretation, (I). With Peirce's semIotIc trichotomies (Marostica, 1998), scientists can give an exhaustive classification of signs. The advantage of this qualitative approach is that there exists only an initial finite set of possibilities to match with real world information (or data). In the triadic process of semiosis (O-S-I), someone can induce two dyadic relationships, (S-O) or a general denotation, and (S-I) or meaning. Related to meanings, one of the main problems of natural languages is ambiguity. In ambiguity, some scientific terms have more than one meaning in the same context (i.e. the relationship between the sign and its context is not a one-one mapping). The other type of dyadic relation, (i.e. SO), involves in science the relationship between patterns and relations as signs (e.g. in a sample) and real patterns and relations in the unknown data (i.e. objects of the population).
The first thing to do is to determine the exact meaning of terms involved in the population. For example, the econometricians must clarify the meaning of "significance" in the economic context (McCloskey and Ziliak, 1996). This ambiguous term, in statistic usage, means a characteristic of the population from which the sample is drawn, regardless of whether the characteristic is important or not. However, "significant" in economics means magnitudes (or parameters) that are important or scientifically reasonable (i.e. large enough to matter for policy or science). McCloskey and Ziliak (1996) say that 70 percent of authors of papers in the American Economic Review do not distinguish statistical from economic, policy, or scientific significance. The author of this paper thinks that this difference is important because econometricians are dealing with real data and a misinterpretation could cost millions of dollars or put 100,000 lives at risk.
The reason why a good abduction correctly explains the anomalous fact is due to the iconic property that the process has. The predictions the scientist performs in the process of scientific inquiry are all valid because prediction is a deductive process. Therefore, they could have true or false conclusions, depending on what type of premises the inferences have. However, that inference is still valid because the form of the process is the guarantee of the validity. Only indexical checking with the data can tell me the degree of confirmation of the explanation given in the abductive part of the process of scientific inquiry. In Figure 1 all the components of this are summarized from a semiotic point of view.
Scholars know that all the econometric concepts involved in a specific task are signs that belong to a determinate semIOtIc tree (Marostica, 1998). Therefore, they can simplify a tree by canceling the branches containing ambiguous terms. For this operation, they may use a modification of the heuristic technique divide and query that Shapiro and
155
Sterling used in PROLOG (Shapiro and Sterling, 1986). Using this technique, scientists must eliminate all the ambiguous signs. An example of this checking for the general n-tree structure is shown in Figure 2.
population data found there. Then they must find the patterns and relations in the population and then the process continues.
In that figure, a is
In conclusion, the proofs of theorems in econometrics have been used as a further support for inductive methods when econometricians want to infer some estimation for the data of the population based on some characteristic of samples. All these proofs are useless in the passage from sample to population because this is an inductive process where the form is not a guarantee for validity. Semiotics, with its heuristic methods is a better guarantee for this type of inference. When statisticians and econometricians present proofs of theorems used in the passage between samples to populations, they assume, wrongly, that deduction and induction are essentially the same and that there is no real separation between these two parts of logic. The integration of semiotic data mining into the scientific process performed by econometricians will be useful in short-run procedures.
C+:-I)
and k < j < i < a
(Marostica, 1998).
Y'
A o
Precise signs
•
Ambiguous signs
~
$[(a·lt-2)
S((a-I)-3)
o
S((a-it-i)
S((a-It-k)
~
0
Fig. 2. Elimination of ambiguous signs. ACKNOWLEDGEMENTS
After econometricians fix the meaning of terms involved in the population and sample, they must proceed to determine the relationship between signs and their objects in the sample data. There, the identification of patterns and relations is very important. Econometricians can prove theorems related to those tools. However, if they want to use those proofs to increase the strength of inductive methods, this would be incorrect because this is confusion between deduction and induction. In induction, as in inductive econometrics, they do not find .deductive certainty; all the better methods are heuristic in nature. Generally, each heuristic applies only to specific situations and the econometrician needs some way of looking at the problem of how to pass from a specific sample data to the corresponding population data, and to determine which heuristic might be helpful. If econometricians find familiar patterns and relations in samples, they can use a batch procedure, instead of an on line one, in the population part for all the similar cases of patterns and relations they find in the sample part. Therefore, since economic data and meaning evolve, econometricians must look for, in a further step, the new meanings related to the evolved
I wish to thank Dr. Daniel Heymann and Dr. Fernando Tohme for comments and bibliographic hints for this paper.
REFERENCES Bryan, M.F. and S.G. Cecchetti (1999). Inflation and the distribution of price changes. The Review of Economic and Statistics 81, 188-196. Deaton, A. (2000). The Analysis of Household Surveys. The John Hopkins University Press, Baltimore. Gujarati, D.N. (1995). Basic Econometrics. McGrawHill International Editors, New York. Marostica, A. (1998). Semiotic trees and classifications for inductive learning systems. In: Semiotics 1998 (C.W. Spinks, Ed.), Chap.ll, pp. 114-127, Peter Lang, New York. McCloskey, D.N. and S.T. Ziliak (1996). The standard error of regressions. Journal of Economic Literature, 34, 97-114.
156