Mathematical
Social Sciences
6 (1983) 227-246
227
North-Holland
AN INFORMATION THEORETIC MODEL OF BOUNDED RATIONALITY Larry A. CHENAULT
and Gerald E. FLUECKIGER
Department of Economics, Miami University, Oxford, OH45056, U.S.A. Communicated Received
by P.S. Albin
29 August
Bounded
rationality
1983
forces an entity to divide its responses
between
(called adaptedness)
and those it does with delay (called adaptability).
to show how coding
is related
to the problem
Key words: Bounded rationality;
of choosing
coding theory;
information
between theory;
those it does immediately Information
adaptedness
theory
is used
and adaptability.
finite automata;
adaptation.
1. Introduction Adaptation involves matching, fitting or adjusting an entity to its environment so that, in game-theoretic terms, when the environment makes a ‘move’ the entity makes an appropriate ‘response’. Viewed in this way adaptation is closely related to learning, achieving control and survival since it refers to all those means by which an entity maintains its critical variables (e.g. profits, inventory levels, calorie intake) within tolerable bounds. This paper makes a distinction between two forms of adaptation: adaptedness and adaptability. Adaptedness with respect to some contingency is said to occur if the entity responds correctly and immediately to the contingency. If the entity’s rationality were unbounded, then it could have a response ready for every possible move the environment might make. When every contingency is known and matched with a predetermined response, the entity is said to be completely adapted to its environment. If an entity possesses bounded rationality, then for some contingencies it would not have matching responses in ‘inventory’. Adaptability with respect to some contingency is said to occur when an entity responds but only after some delay. The delay is required since the entity must in some sense ‘manufacture’ the appropriate response. Bounded rationality requires an entity to exhibit some degree of adaptability. Choosing the proper mix of adaptedness (ready responses) and adaptability (learned or synthetic responses) requires a priority scheme since there is a trade-off between these two modes of adaptation. The situation just described is very much like what happens in a communication system (see Shannon and Weaver, 1948). The problem of communication is how to 0165-4896/83/$3.00
@ 1983, Elsevier
Science
Publishers
B.V. (North-Holland)
228
L.A. Chenault, G.E. Flueckiger / Information theoretic model
transmit a message from a source through a channel to a destination. Often the message as it appears at the source is ‘large’ relative to the ‘small’ channel. Here ‘large’ means that the message is selected from a large set of alternatives (or alphabet) and ‘small’ means that the channel can cope with only a small number of alternatives per time period. In other words, the channel has limited or bounded capacity. Large messages can get through small channels if the original message is broken down into a sequence of smaller messages. This breaking down and repackaging of messages is called coding. Since all messages cannot pass over the channel in one step, some priority scheme is required to determine which messages are sent first and which messages are sent after some delay. An optimal code is one that maximizes the rate at which information passes over a channel. In information theory, as that subject was developed by Shannon, the information content of a message depends on the frequency with which that message is received. Relative frequencies then yield a priority scheme. Shannon chose this priority scheme because it is appropriate for the engineering or technical problem of communication. From this viewpoint, the communication problem is one of maximizing the rate at which information is transmitted with no regard given to its value. The hallmark of an economic problem, however, is that some contingencies are more important than others. A little information may be of more net value than a lot. This paper uses Shannon’s information theory model to examine the implications of bounded rationality in an economic setting. We form this bridge by allowing the value as well as the information content of a message to determine an optimal code.
2. An entity, its environment
and its catalogue
A productive entity and its catalogue are the primitive concepts we develop. A productive entity, which may be a worker or a firm, is characterized by the names that appear in its catalogue c = {ci, . . . , cn}. Each name ci refers to a product the entity is willing to make or an activity it is willing to undertake. The entity’s environment, or customers, Z, is linked to the entity, e, via the catalogue. The names listed in the catalogue, however, are used by the environment and the entity in different ways. The environment uses the catalogue to compose orders for products. The entity uses the cataolgue to invoice the products it actually makes. In real life, a customer writes a set of names on an order form and expects those names - and only those names - to appear on the corresponding invoice. Typically an order is for more than one product. For a catalogue c = {ci, . . . , C,} be the set of all subsets of c. An order for Cj (a message) sent let C={C,,..., by the environment is denoted Xi; the entity’s perception of Xi is denoted Yj; and the product actually made, hence invoiced, by the entity is denoted 2,. If Xi # Yj,
c,,} ,
L.A. Chenault, G.E. Flueckiger / Information theoretic model
then an error in perception has occurred and if Yj #Z,, has occurred. In summary, c ={c1,...,
c, }
C={C,,...,
229
then an error in execution
C,}
is the entity’s catalogue; is the set of all subsets of c;
X= {Xc, . . ..X.}
where Xi is an order for Ci;
Y={Y,,...,Y,}
where Yj is the entity’s preception of Xi;
z = {Z,,...,Z,}
where Z, is the resultant product.
Viewed naively, an entity is a ‘black box’ with a set of inputs, X, and a set of outputs, Z. Let X*(Z*) denote a sequence of elements from X(Z). Since X* is a ‘message’ sent by the source _Xto e, and Z* is a message passed on from e, it is possible to view an entity as a communication system. The system works well if the message sent by e matches the message sent to it by 2; that is if Xi = Yj = Z,. When the set X is ‘large’ relative to e’s capacity, then z can send messages faster than the entity. The rate at which messages pass through the channel e must then be slowed down. In such a situation we say that the entity possesses bounded rationality. In the next section we will discuss coding. Briefly, coding is concerned with getting ‘large’ messages through ‘small’ channels efficiently. Ever since Shannon formalized the subject, researches in many disciplines have tried to apply his theorems on information theory to various systems. Although all kinds of systems engage in communication, Shannon’s theorems have seemed to be better fitted to ‘pure communication devices’ than to ‘operational systems’. Our belief is that in order to apply information theory to these other systems, one must isolate the unit within that system that exhibits the properties of a pure communication system. For economic systems, we believe that a productive entity is one such unit. Describing a productive entity - an entity that receives orders (messages) for products and sends out finished products (invoices) - as a communication system is a major objective of Sections 5-9. The purpose of the next two sections is to present the necessary concepts from information theory.
3. Variety and coding This section introduces, in an elementary way, the idea of a code and shows how the concept of variety enters in. In the next section the idea of variety is generalized to the idea of information by introducing probabilities. The purpose of the following example is to intuitively develop the important idea that a communication system conveys information or, as in this simple case, variety. For the system to work, variety must be maintained at every stage. The material manifestation of the messages can be arbitrarily changed (e.g. from lights to cards in the following
230
L.A. Chenault, G.E. Flueckiger / Information theoretic model
example) as long as the variety is maintained. A central problem is to find the amount of variety and reduce it to a number. Imagine three lights lined up on a board. Here the variety is three or the alphabet is of size three. If each light can be either on or off independently of the others then 23 = 8 different messages can be represented. The log to the base 2 of the number of messages is (log 8 = 3) a measure of the amount of information in the array of lights. Suppose, however, that the array of three lights is constrained so that only four of the eight possibilities are exhibited. For instance: all on (111); all off (000); first one on (100); and last one on (001). The number of messages in the constrained set is four so that log 4 = 2. Suppose that an information source (or an environment) is represented by the array of three lights whose configurations are constrained to four. Next, suppose that the environment wants to send these four messages via an intermediary, called a channel, that is an array of two cards. Each card is black on one side (B) and white on the other (W). The question is whether the messages generated from the three constrained lights can be represented by the cards. The answer is yes, since the constrained variety of the lights, log 4 = 2, is equal to the unconstrained variety of the cards, log 4= 2. The following code accomplishes the task: (lll)-‘(BB), (000) + (WW), (100) + (BW), (001) + (WB). If each configuration of lights or cards represents one step or one time period, then the channel processes the sequences produced by the environment without delay. Coding involves finding the actual amount of variety (or information) in the message generated by the environment (in the above example (11 l), (000), (loo), (001) in contrast to the maximum possible of eight configurations) and then asking what channel, if completely unconstrained, is just sufficient to represent the variety (or information) in the message. Coding then seeks to match up or represent the variety of the environment with variety in the channel. Codes are optimal if the matching is economical in the sense of using as small an alphabet for the channel as is possible.
4. Information
and coding
This section gives a formal definition of information and expands on the idea of an optimal code. The measure of information differs from the measure of variety in that the messages are weighted by their probabilities. Thus, a set of four elements with unqual frequencies could represent the same amount of information as a set with two elements whose frequencies were more nearly equal. When the amount of
L.A.
Chenault,
G.E. Flueckiger
/ Information
theoretic model
231
information in each set is the same, a code can be devised so that the smaller set can be used to represent the larger set. Let Z be an information source that emits symbols from an alphabet X= {Xc, . . . . X,}. Each symbol Xi is an order for products from the catalogue c= {Cl, . . . . c,}. For a given probability distribution p(X) over X the amount of information identified with the source is given by H(X) = Cy=, -p(Xi) lOgp(X,)* N(X) measures the average uncertainty in a sequence of symbols where lOgp(Xi) is the uncertainty of the symbol Xi. H(X) is maximized for a given N when p(Xi) is equal for all i. H(X) also increases with N. Sequences of symbols from X are called messages. Let e be a channel with an alphabet Y = { Y,, . . . , YN}. A central problem in communication theory is the study of how messages originating in 22can be sent through a channel e and reproduced at some destination. Typically the symbols in Z’s alphabet are physically incompatible with the symbols in e’s alphabet. This fact requires that the symbols from Z be encoded by a transmitter into signals compatible with e. After passing over the channel, signals are decoded by a receiver back into symbols that appear at the destination. The essentials of the problem are contained in the following example taken from Shannon (1948). Suppose 2”s alphabet has four symbols {Xi, X,,Xs,X,) and that p(X,) = +, p(X,) = 4 and p(X,) =p(X,) = $. Then N(X) = i bits per symbol. Suppose that e’s alphabet has two elements or symbols Y= {Y,, Yi} which can also be thought of as (0, l}. The question then is this: How can messages from a large alphabet be conveyed by a channel with a smaller alphabet? The first point to keep in mind is that the channel e does not convey the symbols from Z but rather the information from 2’. The amount of choices, variety, uncertainty or information exhibited by Z is given as N(X) = i bits per symbol. Intuitively this means that it takes on average : binary digits to identify a symbol from X or that, on average, one binary digit represents 4 of a symbol. The second point to remember is that the relative frequencies of the symbols from X are set by the internal constraints of Z. If every symbol is as likely as any other, then Z is said to be unconstrained and H(X) is maximized at log(k+ 1). To the extent that the relative frequencies differ, Z is said to be redundant. Redundancy reduces the uncertainty or the information. In the extreme H(X) = 0; only one symbol can ever appear and no uncertainty is removed upon receipt of a symbol. In this case no binary digits are required to identify Z’s messages since no information or variety is produced. The important point here is that it is the statistical structure of the messages generated by 2 that determines H(X). Consequently, this structure is measured by N(X). The whole thrust of coding is to specify how that same amount of structure, organization, variety or information can be represented by a different alphabet of symbols such that the amount of randomness or variation that is unevenly spread among the symbols Xi will be spread as evenly as posible among the symbols from Y. Put differently, we want to preserve the amount of information or variety con-
232
L.A. Chenault, G.E. Flueckiger / Information theoretic model
tained in a message (sequence of symbols) from Z but, whereas the uncertainty attached to symbols from Z varies, we want the uncertainty of each of e’s symbols to be as nearly equal as possible. The problem then is to take the total amount of uncertainty of 2 and spread it evenly over as few steps as possible in e. Imagine the following source-channel pair. Suppose Z is the source described above and that e can produce sequences of O’s and l’s in any order and with whatever relative frequencies we choose. In other words, e has no constraints. If e produces O’s and l’s with equal frequency, then H(Y) = log 2 = 1 bit per step. Note that H(Y) is thus maximized. One bit per step measures the amount of information that can be passed on by e at each and every step. Or, since a symbol from Y (in this case a 0 or 1) is emitted at each step, one can say that each of e’s symbols is worth one bit. H(X) =: bits per X symbol means that on average a symbol from X represents an amount of information equal to $ of a bit. What Shannon’s noiseless coding theorem establishes is that, since bits of information are common to both alphabets, messages from X can be recast into messages in Y with no loss of information. No loss of information means that the original message can be reconstructed and retrived out of its coded representative. Since e’s bits per step (H(Y) = log 2 = 1) is less than Z’s bits per step (H(X) =:), e will take more steps than .Z but this is just to say that a shorter alphabet, if given more time, can do the job of a longer one. Length, in terms of the number of steps, can be substituted for ‘depth’. More steps can be substituted for more variety at each step. We will return to this point below. Coding is the operation of assigning symbols from X to sequences of symbols from Y so that, on average, the length of the sequences from Y used to represent the sequences from X are as short as possible. In the engineering applications the ‘as short as possible’ condition means ‘as fast as possible’ with the presumption that faster is better. The optimal code for the above example is: x1 + 0,
x*-
10,
X,-l
10,
X,-l
1 1.
Then H(X) = += ii = c:=, p(Xi)Li, where Li is the length of the code word for symbol Xi. For instance, the typical sequence of eight symbols from X would contain four Xi’s, two X2’s and one each of X3 and X4. At the heart of coding is the strategy of assigning short code words to frequent symbols and longer code words to less frequent symbols. Thus: x,
0
x1
0
Xl
0
Xl
0
x2
x2
10
10
x3
110
x4
eight symbols from X,
1 1 1 fourteen symbols from Y.
L.A.
Chenault, G.E. Flueckiger / Information
theoretic model
233
On average (eight symbols from X/fourteen symbols from Y) =4/7ths of an X symbol is transmitted at each step. Or 7/4ths steps (or :Y symbols) are required for each X symbol. Note that while the symbols from X occur with different frequencies those from Y occur with the same frequency, seven zeros and seven ones. This guarantees that H(Y) is maximized which means that the number of Y symbols used is minimized. Unique decipherability (a matter not taken up here) guarantees that the original symbols from X can be extracted from the signal received from e. Thus, the full amount of information generated at the source is broken down and repackaged into amounts that will just fit the channel. Then the message is reassembled again at the destination. So far e has been viewed as a ‘black box’. To peer inside the ‘black box’ and discover the workings of the operational system requires some additional concepts. The next four sections are based on Flueckiger (1978).
5. Conceptual
foundations
of behavior
A goal of the entity is to receive messages from the environment (in the form of orders) and to transmit them (to invoices) as flawlessly as possible. Viewed in this way the environment is an information source - it selects a desired message out of a set of possible messages - and the entity is a communication channel. The capacity of the entity to transmit symbols (from orders to invoices) flawlessly depends on its ability to correctly perceive orders (symbols) and then execute them appropriately. For a catalogue c = {c,, . . . , c,,}, consider the equivalence relation defined by ‘Ci is not perceived to be different from Cj’. Two names from the catalogue are equivalent if the entity does not perceive differences between them. Each way of perceiving the names in the catalogue defines an equivalence relation on c which in turn corresponds to a partitioning of c into disjoint subsets called equivalence classes. Each member of a particular equivalence class is (perceived to be) equivalent to all other members of that class. The finer the partitioning of c, the greater the number of equivalence classes, m, and the smaller the size of each equivalence class. For example, a wine connoisseur who perceives every distinction would put each name in its own equivalence class, whereas a novice may place all names into two equivalence classes - red and white. For n> 1, perception proficiency is measured by the index 71= (number of perception equivalence classes over c) - 1 =s. (number of elements in the catalogue) - 1
(1)
This index ranges from 0, when no distinctions are made, to 1, when every distinction is made. Execution proficiency is defined over c by the equivalence relation ‘c; is not executed independently of Cj’- Two names are equivalent if the entity does not make the two products (or do the two activities) independently. For example, wool
234
L.A.
Chenault,
G.E. Nueckiger
/ Information
theoretic model
and mutton are always produced jointly as are doughnuts and doughnut holes. For n > 1, execution proficiency is measured by the index &=
(number of execution equivalence classes over c) - 1 (number of elements in the catalogue) - 1 ’
(2)
This index also ranges from 0, when all products are made (or activities are performed) jointly, to 1 when each product is made (or activity is performed) independently. In what follows it is assumed that execution proficiency is perfect, that is E = 1. This assumption will simplify the exposition without compromising the analysis. Under this assumption the perceived order 5 is identical to the resultant product 2, and all errors are errors of perception. That is, an error occurs if and only if Xj#?. Learning is an improvement in observed behavior that results because perception proficiency, rr, has increased. Exactly how this is done is taken up after the idea of behavior has been developed.
6. Internal states and behavior An internal state, or simply a state, of an entity is a partitioning S, of c. This partition in turn determines the value of the index II. For example, when n = 3, c = {cl, c2, cs} and there are five possible partitions. These are given below where a bar covers all members of any particular equivalence class: so =
77 = 0,
{ClC2C31 -& = {cl; c2c3) -s2 = {c2; ClC3> -s3 = {q; qc2)
x =+,
sq={c~;c2;c3}
n=l.
---
II =+, n=+,
Let S = {S,} be the set of all possible states for e. Each state in S corresponds to a particular pattern of behavior as follows. For a given state S, and a given order Xi, the entity can only locate a commodity name ci in Xi up to the equivalence class to which ci belongs. The perceived order Yj is a function of the equivalence classes to which the Ci’s belong and thus it may differ from the order sent by Z. Since execution proficiency is assumed to be perfect, yj = 2,. The perceived order 5 represents the entity’s choice of what to make in response to Xi. Mathematically, let A@,, Xi) = Yj represent this choice. Note that Yj depends not only on the order Xi, but on the internal state, S,, as well. If each ci resides in an equivalence class by itself, then no errors occur and L(S,, Xi) = Yi. However, if there is more than one name in an equivalence class, a mistake is possible and then n(Sa, Xi) = 5. where i #j. Behavior is thought of as a decision rule or mapping A : S xX+ Y.
L.A. Chenault, G.E. Flueckiger / Information theoretic model
235
If an error does occur it is assumed that, whenever possible, behavior will be modified so that the same mistake will not be made again. The way the entity modifies its behavior is to ‘pass’ from one state to another. This learning-by-doing process is described in the next section. An internal state may be looked at in two different ways. The first way views each state as a different ‘stage’ in the development of a single entity. In this view, the entity learns and thus its behavior improves (i.e. a different, higher state is realized) but it is, nevertheless, the same entity. Alternatively, each state can be thought of as a distinct kind of entity. As has been shown, each state produces its own pattern of behavior. In this view, learning transforms one kind of entity into a different kind of entity. Which view is taken is not a matter of what is correct but rather what is most useful for the problem at hand. For expository convenience, the first of the above interpretations is used throughout this paper.
7. Learning and state transitions The central idea that motivates the learning hypothesis introduced here is that one learns from one’s mistakes. The learning process, then, requires first that mistakes are identified, and second that behavior is modified so that the same mistake does not occur again. The change in behavior is a result of a refinement in perception proficiency hence a transition to a new and ‘higher’ state. The learning process introduced here may be expressed as an algorithm d that works as follows. When in state S, an order for Xi is perceived by e as an order for rj. If Xi # Yj, then a mistake has occurred and the set q-X;
=
{ch 1c,EY~
and CheXi}
(3)
will be nonempty. That is, something was made but not ordered. The environment will not accept delivery of these ‘extra’ products. Now that the mistake has been identified, the entity will refine its perception partition by contrasting all of the extant equivalence classes (that is, the equivalence classes as they existed when the order was received) with the fragments that are left over (the set Yj -Xi). The fragments are intersected with the extant equivalence classes and this results in a new (emergent) partition which is finer than the old. We denote this operation by d. For example, suppose, --_
% = {c,cz;
c3c4; csl,
Iz(Sa,Xi) =
{CI,C2}
and =
Then = {Cz}, --s, = {c,; c2; c3c4; q-Xi
and
cs}.
Yja
236
L.A. Chenault, G.E. Flueckiger / Information theoretic model
Note that when learning occurs the entity undergoes a state transition. In the be this state transition above example this was from S, to SD. Let 6 : S xX+S mapping that is the result of applying d. That is, S(S,,X,) = Sp.
8. A finite automaton The ‘black box’ e is a finite automaton
s = {So,...,
S, }
X= {Xc),..., XN}
A = (S,X, Y S, A), where
a finite set of states, a finite set of inputs (orders),
Y= {Ye,..., Y,,,} a finite set of outputs (products), 6:SxX+S
state transition mapping,
A:SxX+Y
input/output
or behavior mapping.
Learning is described by 6 while observed behavior is given by A. A finite automaton A is a complete description of e since an entity is always in one of the states in S. A few general comments and additional concepts are in order. (1) No learning may result due to the nature of the state. A state S, is an equilibrium state if 6(S,, Xi) = S, for all i = 0, 1, . . . , IV. The highest state S,, when rc= 1, is always an equilibrium state but there may be others. (2) If the environment is constrained so that certain orders are never sent, then the number of inputs is reduced to something less than N+ 1. A diminution in the richness of experience has two effects. First, the state required to flawlessly service the environment is reduced to something less than S, and, second, the opportunities to make mistakes - and therefore the opportunities to learn - are also reduced. Typically environments are constrained. Examining the patterns of these contraints is a major part of a future paper. (3) Orders X, for nothing and X, for everything can be filled without error no matter what state the entity is in. That is, no matter what level of preception proficiency prevails. Thus, for all (r, A(&, Xc) = Ye and Iz(S,, X,) = Y,. (4) The behavior possibility set for S, is given by B, = { 5 1n(lSa,Xi) = rj for some i}. Behavior is said to be completely specialized behavior in the case of B0 = { YO,Y,}. This is because only one response other than the null response is ever observed. Behavior is said to be completely skilrfui in the case of B, = {Y,, Y,, . . . , Y,}, that is when e is in S,. Only then does e exhibit the full range of possible responses. The idea of skillful behavior captures the entity’s ability to react selectively in response to orders from the environment rather than in an all-or-nothing way. The degree of behavior skillfulness is measured by counting the elements in B,. Improvements in perception proficiency result in more skillful behavior. The above notions of specialization and skillfulness must be sharply distinguished from two related but
L.A. Chenault, G.E. Flueckiger / Information theoretic model
231
different concepts. Namely, if an entity has only one name in its catalogue, then the entity is said to be completely specialized in the sense that it can do only one activity. The entity is said to be skillful to the degree than n is greater than unity. (5) For each state S, the set of inputs X can be partitioned into two sets. The first set consists of those orders that are made without error by e when in S, and is called the error free set for S,. This set is denoted by Za = {Xj 1/l(S~,Xi)
= Yi}o
(4)
Note that X0 and X, are in Z, for all a and Z, = X only if a = r. The second set consists of all those orders that cannot be made exactly by e in state S,. This set is called the error set for S, and is denoted by J, = {Xi 1A&,X;)
= 5 and i #j}.
(5)
Any sequence of orders from Z, is said to be computable by or accepted by the entity when it is in S,. Any sequence of inputs that includes any elements from J, is said to be not computable by or not accepted by the entity when it is in S,. Let S, denote an equilibrium (or absorbing) state. Then J, is the equilibrium error set and Z, the equilibrium error free set of inputs. For inputs in J, errors persist after learning stops. For S,, J, = @ and Z, = X so that every sequence from X is computable by e. Thus computational capacity increases with improvements in perception proficiency.
9. Rationality Rationality is defined in terms of computational and/or information processing ability. Maximum or unbounded rationality is identified with perfect perception, or m = n. When perception is perfect no mistakes are made. In terms of the communication model of Section 4 what this means is that e’s channel capacity is equal to H(X) so that each symbol Xi is transmitted in one step. When m =ri there are no constraints internal to e that slow down the rate at which information flows. In economic terms unbounded rationality means that the entity delivers orders for products without delay or error. Under these conditions the present value of an entity is determined entirely by the factor and product prices it faces and these prices are in turn determined by forces arising from outside the ‘black box’ and presumably beyond its control. In more conventional terminology, the productive entity described above is a ‘perfectly informed’ price-taker whose role is to adjust quantities. The most ‘primitive’ type of firm or organization consists of a single irreducible entity with unbounded rationality. Such a firm produces each of the iV+ 1 orders without delay or error. It reads in ‘one square of order tape’ each day and prints out ‘one square of invoice tape’ each day and the symbols on each tape match exactly. This type of firm is of minimal scale or I= 1 since only one order per day is worked on. There is much variation in the day-to-day goings-on within the firm
238
L.A. Chenault, G.E. Flueckiger / Information theoretic model
(entity) but the variation never slows down the rate of output because of the entity’s unbounded rationality. For such entities isolation, that is being in an organization with only one member, is not costly nor would a larger organization offer gains. Finally, if rationality is unbounded, information theory makes no contribution to understanding the entity’s behavior because the capacity constraint (m) is not binding. Bounded rationality occurs when m
10. A simple model of bounded
rationality
Previously it was assumed that perception proficiency starts at zero and then, under the operation of d, increases. In this section a different interpretation is pursued. Here it is assumed that e has a fixed number, m, of permanent perception equivalence classes, where m
239
L.A. Chenault, G.E. Flueckiger / Information theoretic model
consuming modifications in the basic wiring or program, and ‘software’ which exploits the potential of the hardware by breaking down and recombining the basic operations determined by the hardware. Biology provides other examples. A turtle’s shell is literally hardware that produces a permanent response to disturbances that does not require information processing or discretionary activity. If a disturbance persists or is extraordinary, then the turtle collects and uses information to design a response that is less automatic, but better suited to the particular problem. Nervous systems have a priority scheme as well. Some responses are reflexive while others are discretionary. Neurological capacity limitations rule out the possibility that all responses be reflexive. In a complex environment, an inflexible and immediate response to every possible contingency would require impossible amounts of capacity. When a delayed response would jeopardize an organ, a reflexive response has survival value and it is given priority. Workers can do some of their jobs reflexively (‘off the top of the head’) while other activities require looking something up or thinking something through. Machines too can be set to perform certain operations but must be reset (often by themselves) to perform others. Usually a machine is designed to do a small set of activities and these it does well. It can be altered or adjusted to do other activities but it does these less well, more slowly or at a higher cost. For an entity that has only m permanent equivalence classes (where m is less than n) cost considerations or laws of nature may be responsible for this constraint. Suppose, for example, m = 2 and n = 4. The question is: If m = 2 which of the seven possible partitionings is best? Choosing the permanent equivalence classes is here interpreted as selecting an encoding rule. The operation of e, or its automaton characterization A, is essentially the same as was described above. Two equivalent formulations given in Figs. 1 and 2 will be used to explain the steps. First, the symbol Xi sent by Z is perceived by e according to its permanent equivalence classes. This is the step that encodes the symbol Xi into the signal which is acted upon by e. Symbolically the symbol Xi is mapped into the perceived order Yj by the mapping A. This is L(Sp,Xi) = Yj.
I
Tape 3 Yi
Eni1 "onment e
Fig. 1. A communication
system.
240
L.A. Chenault, G.E. Flueckiger / Information theoretic model Entity
Observer
e
Tape 1 Orders from 2:
Yj all those bundles
Tape 2
of products
-ape
made
Yi just those bundles
3
made that were ordered
Fig. 2. Tape reading
automaton.
At the second step, the perceived order 5 is executed by e according to its execution partitioning. Since it is assumed that E = 1, the bundle of commodities actually made, Z,, will always match the perceived order Yj. The third step is a check or comparison of Xi with Yj performed by the observer. The observer is a delay device (quality control department) with perfect perception that can store Xi for one step. If i =j so that Xi = Yj, then an error has not occurred and the observer sends Yi to the environment. If i #j so that X, # Yj, then an error has occurred and the observer will do two things. First, e’s perception partitioning is refined according to the algorithm d. This new but temporary state is denoted 6(S,,Xi). Next, the observer will resubmit a second Xi to the entity. With the more refined but temporary perception capabilities, no mistakes will occur so that the goods made will match the goods ordered. That is: 1(6(S,, Xi), Xj) = & *
(6)
Once the order is filled exactly, e’s perception partitioning reverts back to the permanent or home state S,. This flow is illustrated in Fig. 1. An alternative way of viewing this process is illustrated in Fig. 2. On Tape 1 the environment prints orders for commodities Xi. On Tape 2 the entity prints what was actually made. The observer reads from both of these tapes and does one of two things. Either (1) i=j and thus Xi = rj, in which case no error has occurred and the observer will print Yi on Tape 3; or (2) i#j and thus Xi # rj, in which case an error has occurred. The observer will then print 0 on Tape 3 to indicate that Xi is being resubmitted to e. By assumption, the symbol @ will always be followed by Y on Tape 3. This is because the entity always learns in one step. Longer lags would be possible - just have the observer print one 0 per time period spent learning.
L.A.
Chenault, G.E. Flueckiger
/ Information
theoretic model
241
Tape 2 contains a sequential record of every bundle of commodities that e makes. Mistakes produce symbols on Tape 2 that are not matched by symbols on Tape 1; that is, commodities made but not ordered. Assume that the objective is to minimize the number of instances of mistakes. In general some mistakes have more serious penalties than others but for now this complication is ignored. (This is the subject of Section 11.) From an economic point of view mistakes are to be avoided. The question then is: Given a fixed number of permanent equivalence classes m< n, how should the m classes be filled with the n names so that for long (input) sequences of orders from X*, the number of errors in the output sequence from Y* on Tape 2 are minimized? The central problem, and the approach followed, is identical to that of matching the information of the source H(X) to the channel capacity C. In other words, find the most common order and then fill the m equivalence classes so that that order is filled without error. Then try to minimize errors for the next most frequent order, and so on. Some examples will help clarify some questions already raised and will also raise further questions. Assume that the environment, or source, 2 sends orders X1 = to119
x, = {c2, cs},
x3
x4
= {Cl, cz),
with probabilities:
= (C4lr
p(X,) = i; p(X,) = $; p(X,) =i; and &X4) =f so that
-logp(X,)
= -log + = 1 bit for symbol Xi,
-logp(X,)
= -log+ = 2 bits for symbol X,,
-logp(X,)
= -log $ = 3 bits for symbol X3,
-logp(X,)
= -log $ = 3 bits for symbol X4.
Then H(X) = C:=, p(Xi) lOgp(Xi) =+(I) + $(2) + i(3) + $(3) ={ bits per symbol. If m = 2, then to minimize the number of steps e takes to transmit sequences of symbols from Tape 1 to Tape 3, the permanent perception equivalence classes should be (C,; m). A typical sequence of length eight would contain orders with the relative frequencies given by their probabilities (see Table 1). The source _Zgenerates H(X) = i bits per symbol. The entity’s capacity C is such that on average it processes 14 bits/l2 steps = i bits per step. What this means is that, on average, the channel e takes 12 steps to pass on the 14 bits that come from the environment in 8 steps. Thus, C/H(X) = i/f = f of a symbol passes on each step. The capacity of the channel e is C = i bits per step. This is an information measure of what e does. As m --t n, C-+ H(X). The number (n - m) indicates the degree of constraint present in the channel e. In the example used here, if m = n, then C = s bits per step. Then e has no perception constraint and coding is a trivial problem. The crux of the coding problem is to match the constraints present in e with the statistical structure exhibited by Z. The amount of statistical structure is measured
242
L.A. Chenault, G.E. Flueckiger / Information theoretic model
Table 1 Tape 1 Tape 2 Tape 3
XI
XI
YI Yl
YI Yl
L
XI
X1
Yl
Yl
Y5
y2
y5
y2
y5
y3
y5
y4
Yl
Yl
@
y2
@
y2
@
y3
4J
y4
x2
x2
X3
x4
\
J-w
*
J
1 symbol per step
$ symbol per step
; symbol per step
4 symbol per step
1 bit per symbol
2 bits per symbol
3 bits per symbol
3 bits per symbol
1 bit per step
1 bit per step
i bits per step
1 bits per step
4 bits
4 bits
3 bits
3 bits
Tape 1: H(X) = s bits per symbol = $ bits per step. Tape 3: R = 12 steps/8 symbols = G steps per symbol. Tape 3: C= 14 bits/l2 steps = i bits per step = H(X)/R.
by H(X) in terms of bits per symbol. Shannon’s insight is that if C and H(X) are given, then a code can always be found that will permit a rate of transmission of C/H(X) symbols per step. If m = 1, then e’s perception index is zero for it has only one equivalence class. In this case e uses two steps for every symbol so that for the typical sequence of 8 symbols - which contains 14 bits - 16 steps are required. Then C= 14 bits/l6 steps and H(X) = t so C/H(X) = + symbol per step Suppose that m = 2 but e is coded badly as (cl,; Cq). Then, for a typical sequence, 15 steps are required for the 8 symbols; C = 14 bits/l5 steps and C/H(X) =$ symbols per step. Thus, for m = 2, the best coding allows f of a symbol per step while the worst allows only & of a symbol per step. The interpretation can be restated as follows. The channel e is an automaton A whose state transitions are governed by A. The number of (perception) equivalence classes, m, determines e’s capacity while how those classes are filled determines the code. Given H(X) and m, there is a best code. The best code (when costs are ignored) is the one that maximizes the bits per step. This also can be looked at as minimizing the number of times there are wasted products or as minimizing the number of learning experiences.
11. Two hypotheses
about costs
In this section the information theory model of bounded rationality is given an economic interpretation in terms of two hypotheses about costs. Each hypothesis refers to an interpretation of the equivalence relation ‘ci is perceived to be equiva-
L.A. Chenault, G.E. Flueckiger / Information theoretic model
243
lent to cj’, and the corresponding decision rule A. One hypothesis is called ‘all-ornothing’; the other ‘pick-at-random’. With the ‘all-or-nothing’ decision rule an order for any member of an equivalence class is perceived to be an order for the entire class. In other words, all members of an equivalence class are viewed as being parts of a common whole. For example, --~ if S,=(C~C~;C~C~} and X~=(C~;C~}, then A(S,,Xi)=Yj={cl,c2,c3,cq}. In this example, c2 will be made because cl was ordered and both lie in the same equivalence class. Similarly, c4 will be made because c3 was ordered. With the ‘pick-at-random’ decision rule all members of an equivalence class are perceived to be perfect substitutes for each other. Unlike the all-or-nothing rule, each cj is perceived to be a distinct product or activity. Using the previous example --again, suppose S, = { ci c2; c3c4} and Xi = {cl, c3}. The entity will decide (at random) to produce ci or c2 but not both. The four possible responses to this order are {c1,c31, {ct,~~1~ {c~,c31, and {c2,c4). When an entity whose perception is constrained to m permanent equivalence classes adopts either decision rule, it can fill some orders (those in the error-free set) immediately and others after some delay. That is, only after the entity has taken the time to learn or compute the correct response. The entity is adapted to some contingencies and exhibits adapted&s to the rest. The main idea is that if all orders cannot be filled immediately, those for which the cost of delay is large should be filled first while those for which the cost of delay is small should be delayed, ceteris paribus. The benefit from or value of producing a correct response 5 to the order Xi is assumed to be a decreasing function of the number of time periods (or learning steps) required to reach the correct response. The reduced reward to the delayed, in contrast to the immediate, response arises because the environment values prompt delivery. Costs associated with the learning process could also contribute to the reduced reward. Both economic hypotheses impose a priority scheme on how the n commodity names should be assigned to the m equivalence classes. Recall that each assignment designates a state S, or, from another point of view, an entity e,. Thus, each state (or entity) is identified by a code. The code determines which orders are done with and without delay. For an entity e, let 1, = {Xi ( Xi can be done without delay}; J, = {Xj 1Xj can only be done with delay}; V(Xi) = value of an order Xi done without delay; and V,(X;) = value of an order Xi done with delay.
The penalty for delaying a particular order Xi is given by 0(X;) = V(Xi) - &(Xj) > 0.
(7)
244
L.A. Chenault, G.E. Flueckiger / Information theoretic model
Let p(X;) denote the probability of receiving an order Xi. The expected value of an entity e, will be
(8) Given a reward schedule (which reflects the environment’s tolerance for lateness), a probability distribution of orders (the environment’s proportionate preferences), and the entity’s parameters, n and m, there will exist an entity (or state) with a maximum expected value. For a number of reasons, entities that are identical at the start of life may grow up to be different because of the experiences they have. In terms of the above they end up with different assignments of the n names to m equivalence classes. (The more difficult case of different values for m is ignored for now.) For two entities e, and eP let (9) and Jab= {Xi 1XiEJa and XieJB}.
(10)
Note that lab = 1, - I8 = Js - J, = Jpa. Zaais the set of orders that e, can do without delay but eP cannot. This set represents e, ‘s comparative advantage over ea. Is there some way for e, and ep to both gain from their differences? The answer is ‘yes’ and the gains can be realized if each gives up isolation (and its identity) in favor of a collective. Isolation for an entity with bounded rationality means that the orders printed on the input tape by the environment appear without error on the output tape but, over a typical sequence, there are delays. Both a and /I can gain from their difference if each gives up isolation and becomes a part of a larger scale entity. That producing in isolation is costly for those who do it is one of the simplest yet most important lessons in economics. However, the conventional economic treatment of ‘giving up isolation’ means that an entity retains its identity but begins trading with other entities. What is conjectured here is that an entity would stop being a firm unto itself and would become a part, or a member, of a larger scale firm. The important point is that the organization is not simply the means required to achieve the gains, but the organization so formed becomes, to use the biologists’ term, the relevant ‘survival unit’ (Wilson, 1978; Conrad; 1972).
12. The thesis regarding scale The entities with the best chances of survival (lowest costs) are those which attain complete control in the sense of matching exactly and immediately the ‘moves’ made by the environment. An irreducible entity with unbounded rationality achieves control in this sense with minimal scale. Entities with bounded rationality can be com-
L.A. Chenault, G.E. Flueckiger / Information theoretic model
245
bined into a ‘large’ scale organization that exhibits unbounded rationality. Control requires that the entity’s behavior be as complex as that of the environment. When m is fixed and less than n, the requisite complexity and control is achieved through scale. Large scale, or, roughly speaking, many entities, will not in itself improve control or reduce cost. The large scale unit must be ‘properly designed’ so that, although its parts are subject to bounded rationality, the unit exhibits all the properties of unbounded rationality. How to achieve a proper design of the large scale entity is a problem to be addressed on another occasion in terms of the algebraic structure theory of sequential machines. The purpose of this paper is to build the economic foundations for that later analysis. For now, two ways for entities to be associated will just be suggested. First, entities can work ‘side-by-side’ (i.e. in ‘parallel’) so that each has its ‘own job’ but it is only assigned jobs that it can do without delay or error. Second, entities can work together as ‘teams’ so that every job is a joint effort. In either case entities of bounded rationality associate so that the resulting organization - a ‘survival unit’ of larger size - exhibits unbounded rationality and thus achieves perfect control. The thesis regarding scale is this: control can be achieved by entities of bounded rationality by changing the scale of the organization to which they belong. In this view, scale is the solution to the problem of how to achieve control with entities of bounded rationality. In the conventional view, scale is the source of the control problem. Diseconomies of scale arise because an organization becomes more difficult to control as it gets larger. Bounded rationality sets limits to the size of an organization only if one seeks to achieve control with a single entity of bounded rationality. However, if one seeks to achieve control by combining entities, the size of an organization is limited only by the ingenuity of the design. Design, which may be a conscious act or the result of selection, involves finding the ‘proper code’, i.e. the proper networking of entities.
13. Concluding
remarks
In this paper information theory, as that subject was developedfor communication systems by Shannon, is used to show how a productive entity’s bounded rationality is related to channel capacity. When rationality is bounded an entity must choose which responses should be ‘reflexive’ (adaptedness) and which should be ‘computed’ (adaptability). This is shown to be identical to the problem of coding. The work reported here is distinctive primarily because it takes information theory on its own terms. This requires that productive entities be viewed as communication systems that receive and send messages. A central insight of information theory is that communication and control are essentially the same problem. By looking at a productive entity as a communication system its behavior can be investigated from
246
L.A. Chenault, G.E. Flueckiger / Information theoretic model
the point of view of achieiving control. The major contribution of this paper is that it develops a behavioral interpretation of information theory upon which an economic analysis can be built.
References M. Conrad, Towards
Statistical
Biological G.E.
and hierarchical
a Theoretical Sciences
Flueckiger,
Biology,
(Edinburgh
A finite
aspects
Papers
University
automaton
of biological
organization,
from a series of Symposia
model
Press,
Edinburgh,
of behavior
and
in: C.H.
Waddington,
held by the International
ed.,
Union
of
1972). learning,
Economic
Inquiry
16 (1978)
508-530. C.E.
Shannon
E.O. Wilson,
and W. Weaver, The ergonomics
The Mathematical
Theory
of Communication
of caste in the social insects, American
Economic
(Urbana,
Ill., 1948).
Review 68 (1978) 25-35.