Ecological Modelling, 16 (1982) 275-297
275
Elsevier Scientific Publishing Company, Amsterdam--Printed in The Netherlands
O P T I M A L S T O C H A S T I C C O N T R O L IN N A T U R A L R E S O U R C E
MANAGEMENT: FRAMEWORK AND EXAMPLES
BYRON K. WILLIAMS U.S. Fish and Wildlife Service, Migratory Bird and Habitat Research Laboratory, Laurel, MD 20708 (U.S.A.)
(Accepted for publication 15 March 1982)
ABSTRACT Williams, B.K., 1982. Optimal stochastic control in natural resource management: framework and examples. Ecol. Modelling, 16: 275-297. A framework is presented for the application of optimal control methods to natural resource problems. An expression of the optimal control problem appropriate for renewable natural resources is given and its application to Markovian systems is presented in some detail. Three general approaches are outlined for determining optimal control of infinite time horizon systems and three examples from the natural resource literature are used for illustration.
INTRODUCTION Exploitable ecosystems share with other p r o d u c t i o n systems the c o m m o n goal of rational management. Managers of ecosystems seek to use them in a m a n n e r best designed to exploit resources at an optimal or specific level with m i n i m u m negative ecological impact. T h e y therefore face the probl em of determining optimal utilization subject to certain constraints on the level of that utilization. T hr ee general requirements must be met. First, there must be an index of p e r f o r m a n c e by which various m a n a g e m e n t strategies can be c o m p a r e d and assessed; second, there must be specified a class of managem e n t options over which p e r f o r m a n c e is to be measured; and last, system responses to m a n a g e m e n t must be determined. Gen er al similarities notwithstanding, natural resource systems differ from artificial systems in some i m p o r t a n t ways that make their m a n a g e m e n t difficult. N a t u r a l resource analysis must cont end with large numbers of interacting components, high levels of complexity, various kinds and amounts of environmental heterogeneity, and only partially known system structures 0304-3800/82/0000-0000/$02.75
© 1982 Elsevier Scientific Publishing Company
276 and functions. Physical, thermodynamic and biotic principles operate on these systems in distinct ways requiring different levels of analysis; moreover, management itself is often included within a larger socio-economic-environmental context. As our understanding of ecological processes advances beyond the merely descriptive to a point at which ecological control is feasible, a general framework must include the following essential elements. First, dynamic optimization is necessary. The dynamic nature of ecosystems requires that they be managed in terms of responsive, evolving processes. Ecosystem components exhibit trajectories through which the system is comprehended, and management analysis seeks to describe future time paths as influenced by strategy. Optimal ecosystem management is therefore dynamic optimization. Second, uncertainties must be included in the formulation. Resource managers must operate on the basis of imperfect knowledge about natural resource systems. This uncertainty can take many forms: (i) inadequate knowledge of biological mechanisms; (ii) unknown effects of management on the system; (iii) unpredictability due to a stochastic environment; and (iv) limitations on monitoring capabilities. It is thus necessary to include uncertainties in a formal system representation by means of random inputs. Third, the issues of complexity and ecological interaction must be addressed. Natural resource systems are interacting arrays of biotic and non-biotic components. They have developed homeostatic feedback patterns (Odum, 1969; Patten, 1975) which make their conceptual simplification problematic. Management often involves the perturbation of these feedback loops, the effects of which are unpredictable without a priori or experimental knowledge about them. Complicated structures of many interacting components thus comprise the subject matter of an investigation, and this very size and complexity constitute a major problem for analysis. Each of these characteristics is more or less important depending on the ecosystem and the objectives of the analysis. For example, an ecosystem managed for maintenance of habitats for endangered species would be analyzed differently than the same system managed for timber production. In many cases, however, objectives cannot be unambiguously specified, either because they are not known, cannot be agreed upon, or involve conflicting and non-commensurate goals (Holling, 1974). Aside from Other obvious benefits, the determination of optimal management strategies with differing objectives can aid considerably in realizing trade-offs and identifying goals. The number of studies in natural resources using optimal control methods has steadily increased over the last decade. Each of these has advanced a somewhat different formulation of the control problem, and several different solution methods have been utilized. The result is a lack of overall concep-
277
tual unity, and much confusion. In this manuscript I propose a general conceptual model for optimal control problems as they broadly relate to natural resources. In the following section the optimal control problem is presented in the context of natural resource management. This is followed by a description of Markov decision processes, a large and quite general class of models directly applicable to natural resource systems. Then the available algorithms for determining optimal control of such systems are outlined, and three examples from the natural resource literature are used to illustrate possible applications. THE OPTIMAL CONTROL PROBLEM
To apply optimal control methods a formal specification of the natural resource system is necessary. This must include an objective function, a set of feasible controls and a system model. The objective function is a mathematical expression of controls and system behavior that measures costs or benefits of a particular policy. The control set consists of a class of feasible management strategies over which an optimal assessment is to be made, and the model exhibits dynamic responses to policies. The underlying principle for this approach is that system dynamics can be adequately described by a succession of states and that system states can be characterized by values assigned to a set of state variables. That is, the state variable approach to natural resource analysis is assumed. Changes in the system are characterized by changes in state variables, and system dynamics are assumed to depend upon three factors: (i) the initial condition or starting point for the system; (ii) the management control imposed on it; and (iii) the environmental milieu of the system. Control is usually conceived as a change in the flow rates among system components, or as an alteration of the magnitudes of state variables, either by enrichment or reduction. Different controls influence the direction of system change, and the object of management is to influence behavior optimally according to a well-chosen performance index. This heuristic description of the optimal control problem may be formalized with an expression involving: an objective functional; a set of state, exogenous and control variables; a set of state transition equations; and time. Each of these components is described below. In the formal specification time t can be measured either in continuous or discrete units and is defined over an appropriate interval. The initial time t o is given and the terminal time tf can be finite or infinite. If finite, t~ can be specified or left as a variable to be determined. The set of values which t assumes is the time frame T. System state is specified at any time in the time frame T by a state vector
278 composed of n state variables. The time sequence of vectors in n-dimensional Euclidean space therefore describes a state trajectory
x = (x(t)} -- { x ( t ) ~ E ° :
t ~ T}
The trajectory starts in some initial state x ( t 0 ) = x 0, assumed given, and terminates in a terminal state x ( t r ) = x r that may be unspecified. For example, a model for a simple p r e d a t o r - p r e y system might consist of two state variables, one for the predator and one for the prey. The state trajectory then is of dimension two and the time frame might consist of all points in time greater than some starting time. In general there are restrictions on the trajectory and an admissible trajectory is one that belongs to some restricted set xEX A second trajectory is defined by the assignment, at appropriate times in T, of a vector of k control variables. The time sequence of these controls constitutes a control trajectory in k-dimensional space
u = {u(t)} = { u ( t ) E E k :
t ~ V}
In the p r e d a t o r - p r e y example, an appropriate control may be the continuous harvest of varying proportions of the predators, so that a univariate continuous trajectory is defined. There may also be restrictions on controls, thereby defining admissible trajectories by u~U At each time t in the time frame a third vector of r driving variables is assigned a value. The time sequence of these variables in r-dimensional space constitutes a driving variable trajectory
z - - {z(t)} = {~(t) e E ' : t e T} This trajectory influences but is not influenced by the state trajectory, nor is it affected by the control trajectory. In the example above, it may consist of fluctuating temperatures that impact the intrinsic growth rates of predators and prey. The state, control and driving variable trajectories interact through a set of state transfer equations which describe the rate of change of state trajectory {x(t)} as a function of state variables, driving variables, controls and time. In continuous time the equations are
~,(t) = f ( x ( t ) , ~ ( t ) , u ( t ) , t ) State transitions in discrete time are expressed by
x(t + 1) = x(t) +
f(,,(t),~(t),u(t),t)
F o r a p r e d a t o r - p r e y system, the state transition equations might consist of
279 the Lotka-Volterra equations (Lotka, 1956). Last, the objective functional is a real valued mapping of control sequences, the value of which is to be maximized. It has the form J(u) =
f,i fI(x(t),u(t),t)
dt
and the analogous form for discrete time is tf
J(u) = ~] I ( x ( t ) , u ( t ) , t ) t0
The function I(x(t), u(t), t), often called an optimality index, exhibits the effect on J of state variables and controls as well as time. In our example, the optimality index could be a cost function based on numbers of prey and the cost of predator control at time t. In summary, the optimal control problem is max J =
I(x,u, t ) d t
u ~ U
to
subject to :~ = f(x,z,u,t)
X(to):Xo X(tf) : X f xCX u~U This specification results in closed loop optimal control trajectories {u*(t)}, wherein the appropriate control at time t is not directly dependent on system state. Intrilligator (1971) and Sage and White (1977) provide detailed, comprehensive descriptions of optimal control for such deterministic systems. In m a n y cases, however, random effects must be included, either by exogenous inputs, imprecise observability or uncertainty about the system indentification. Then an open loop optimal control (u*(x(t), t)} may result, in which the control trajectory is a function of current system state at each point in time. For such problems system behavior cannot be predicted with complete certainty, and therefore the value of the objective functional J(u) is known only probabilistically. Since unique values of J(u) are required for optimization, a c o m m o n approach is to maximize the expected value of this functional. For discrete time the problem becomes max J =
maxE ~I(x(t),u(t),t)
uEU
uEU
[ to
280 subject to
x(t + 1) = x(t) + f(x,z,u,t)
x(t0)=x0 x(t,) : , , , x~X u~U Since there is uncertainty in the system, the transition equations essentially are the recurrence relations of a (usually non-stationary) first order stochastic process (Box and Jenkins, 1976). The problem then is one of analyzing a stochastic decision process, and this specification defines the stochastic optimal control problem. Kushner (1971) and Bertsekas (1976) give good expositions of optimal control for stochastic systems. The remainder of this exposition will focus on problems for which t f z O0, T is discrete, U is a finite set and the number of possible states x is finite. Systems thus delineated are characterized as finite state, finite action, infinite discrete time horizon Markov decision processes. They form a class of optimal control processes sufficiently large to be generally applicable to natural resource management, yet sufficiently limited to be analytically tractable. I focus here on infinite time processes because they fit easily into the framework of renewable natural resource problems, wherein protection against long-term resource degradation is necessary. Thus, for example, it is possible by this means to investigate 'steady state' system control in the absence of non-stationary over-exploitation near the end of the time frame. In view of the fact that renewable resource problems often involve large and indefinite time horizons, it is somewhat surprising that this approach is not more widely used. MARKOV DECISION PROCESSES Some of the more important results on which Markov decision analysis is based are described in this section. In order to proceed the following notation is established. Let S be the set of all possible states for a system. In the ensuing development this state space is assumed finite with, say, N elements. For each time t in a countable time frame T, the observed state of the system x, is in the set S. Let At, i ~ S, be the set of all possible management actions available when the system is in state i. It is not necessary that A i = A j for i4=j. Then A ----U,~ s A~ is the set of all available management activities for the system. A is also assumed finite in this development. Let a policy ~r be defined by a
281
mapping
or: S X T-o A That is, or is a rule which associates with a given state i E S at any given time t E T the action or(i, t). If 0r(i, t) = or(i), i.e. if the policy is time independent, then it is said to be stationary. In order to apply a stationary policy one need know only the system state, and not the particular point in time; a given state has the same action associated with it for every t ~ T. Let Pij(or(i, t)) be the probability of system transfer to state j at time t + 1, given that it is in state i at time t and action or(i, t) is taken
Pij( or( i, t )) = P,~[x,+ , = j l x , = i] Also, let Pi~.(or) denote the n th order transition probability corresponding to policy or. It is therefore the probability that the system is in state j after n time steps, given that it started in state i P,;(or) = P=[x, = j l X o = i] By repeated application of the C h a p m a n - K o l m o g o r o v law this probability may be expressed as
P,;(or)=
N
N
£
... £
in t = l
n--2
,,(or(<_,,
il=l
l)) 11 j=l
U n d e r a stationary policy the Markov decision process becomes a stationary Markov chain. That is, the sequence of states {x,: x, ~ S, t E T} forms a Markov chain defined by the fixed transition probabilities P,j( or( i)). Furthermore, it is easily seen that for stationary processes the n th order transition matrix is given by
where Pll(7(1))
"'"
PiN(or(I))
IPN,(~(N))
...
PNN(Or(N))
P,,=
Objective functionals Let Rj(or(i, t)) be the return associated with transfer from state i to s t a t e j for action or(i, t). Then N
R(or(i, t)) = ~, Pij(or(i, t))Rj(or(i, t)) j=l
282 is the average return accruing to action 7r(i, t) taken at time t with the system in state i. An appropriate optimality index for discrete time processes is based on these single-step returns. The corresponding objective functional is simply the expected value of the sum of discounted returns, when it exists. A formal expression is
r=(i)=E
a'R(~r(x,,t))]Xo=i t
1
(1)
where V,,(i) indicates a dependence of the functional on initial condition and policy. Since returns are assumed b o u n d e d for every policy, V=(i) exists for all discount values a such that 0 < a < 1. However, when a = 1 the sum in eq. 1 can be infinite and therefore its expected value can also be. In this case a different objective functional is required, which is based on the limit of time averaged returns
V,~(i)= lim (n+
'E[ ~ R(~r(x,,t))]Xo=i
(2)
Bertsekas (1976) has shown that this undiscounted value also exists for any stationary policy, again because all returns are bounded. For non-stationary policies the limit may be replaced by limit inferior. Finite and infinite problems require different treatments for their analysis. It is assumed henceforth that the discount factor a has been absorbed into the transition probabilities.
Finite processes The class of finite Markov decision processes includes but is not limited to discounted processes. It is defined by the existence of the expectation in eq. 1. In order to clarify this equation it may be rewritten as
V~,(i) = E[R(~(Xo, =i]+E
0))Ix0 = i] +
~
E[R(~r(x t,
1))Ix0 = i] + E[ R(~r(x 2 , 2))Ix o
R(~r(x,,t))lXo=i
t=3
so that the first three terms of the sum can be analyzed ( a is assumed to be absorbed and thus is not written). Starting in state x o -- i at time t -- 0, action ~r(x0, 0) is taken. This action results in the transition to state x I = j at t - - 1 with probability P~j(~r(i, 0)). The average value accruing to ~r(i, 0) is N
R(~'(x o, 0))= E Pxos(~r(Xo,O))Rj(~r(Xo, j=l
0))
283 stere
Ra(Tr( 3,0) )
4
R2 (rr(3,0))
Fig. 1. Single step transition probabilities and associated returns for a system that begins in state 3. Policy ~r specifies action ~r(3, 0) at t=0. Returns R2(Tr(3, 0)) and R4(~(3, 0)) are associated with the transition to states 2 and 4, respectively.
F i g u r e 1 shows an e x a m p l e of first step transitions a n d returns. In this case o n l y two transfers are possible f r o m the initial state, and the e x p e c t e d r e t u r n for step o n e is simply an average o f the returns c o r r e s p o n d i n g to these two transfers. A c t i o n ~r(x~, 1) is t a k e n at time 1 w h e n the system is in state x l, which results in an average r e t u r n o f R(~r(x l, 1)). T h e value of decisions at t = 1, a v e r a g e d over all states, is t h e r e f o r e N
e[R(
ix,, 1))Ix0=;] =
1)) j=l
This result is illustrated in Fig. 2, in which the system in state 3 at time t = 0 has t r a n s f e r r e d to one of two states with positive p r o b a b i l i t y at t = 1. T h e a c t i o n a p p r o p r i a t e to policy ~r is t a k e n for each state, a n d c o r r e s p o n d i n g average r e t u r n s R(~-(2, 1)) a n d R(Tr(4, 1)) are generated. E a c h r e t u r n is multiplied b y the p r o b a b i l i t y of being in the associated state, and the d i s c o u n t e d average r e t u r n for time p e r i o d 1 is c a l c u l a t e d b y
E[e(qT(x,, 1))Ix0-- 3] =
[P32(Tr(3, 0))R(~r(2, 1 ) ) + P34(~r(3, 0 ) ) R ( c r ( 4 , 1))]
T h e decisions at t = 0 a n d t = 1 leave the system in state x 2 at t = 2 with a n a p p r o p r i a t e p r o b a b i l i t y . T h e p r o b a b i l i t y o f starting in state x 0 = i at t = 0 a n d e n d i n g at state x 2 = j at t = 2 is P,2(~r). A c t i o n ~r(j, 2) then is taken, average r e t u r n R(~r(j, 2)) is g e n e r a t e d a n d the system transfers to a new state with s o m e specified p r o b a b i l i t y . T h e e x p e c t e d r e t u r n for decisions at
284 State
R5(~(4,1))
/ P 4 4 (rr(4,1)) __ 4
P43(rr(4'l)) " " ~
R4(n(4,1))
3 R3(~4J))
(o)
R3(~(2,1)) 2 R2(~(2,1))
P22(Tr(2,1) )
R (/T(4,1) )
(b)
3
R(rr(2,1) 0
t
Fig. 2. (a) Transitions, probabilities and associated returns for a system in one of two states at t-1. Policy ~r specifies action ¢r(2, 1) for state 2 and ~r(4, 1) for state 4. Transitions at t = 1 are shown, along with the associated probabilities and returns. (b) Average returns R(~r(2, 1)) and R(~r(4, 1)) associated with actions ~'(2, 1) and ~r(4, 1), respectively. Average returns are given by the expected value of transition-specific returns from Fig. 2a.
t = 2 therefore is given by N
E[R(Tr(x2,2))lXo=i] = Y~ Pi2(~r)R(~r(j, 2)) j--I
This result is illustrated in Fig. 3. The system can be in one of four states with positive probability at t = 2. Policy ~r specifies actions ~r(j, 2) which result in a set of third stage transition probabilities and returns. Average returns for actions taken at t = 2 are shown in Fig. 4, along with the corresponding probabilities P~(~r). For this example the average value of third stage actions is 5
E[R(~r(x2,2))lXo=
3] = •
P~(Tr)R(~(j, 2))
j=2
By induction the discounted average value of decisions taken at time t is N
E[R(~r(x,,t))lXo=-i]= ~, Pi~(~)R(~r(j,t)) j=l
285 State P55(rr(5, 2))
Rs(zr(5,2)) Rs(rr(4,2))
P45(~,2)) P44(rr(4, 2J )
R4(r~4,2) ) R4(Zr(3,2) ) R3(lr(4,2))
P 3 4 (n'(3, 2 )) ~. \ 3 P32 (rr(3,2)) ~¢
P23((2,2))
R3(rt(2,2) ) R2 (rr(3,2))
2
R2(~(2,~) P21 (rr(2.2)),,~ 1
R 1 (Yr (2,2))
Fig. 3. Transition probabilities and associated returns for a system in one of four states at t = 2 . Policy ~r specifies action ~r(i, 2), for state i at time t --2. Transitions at t = 2 are shown, along with associated probabilities and returns.
State R(Tr(5,2))
R(=(4,2))
R(Tr(3,2))
R(Jr(2,2))
0
2
t
Fig. 4. Probabilities and associated average returns at t = 2. P~(~r) is the probability that the system is in s t a t e j at t = 2 , and R(~'(j, 2)) is the average return accruing to action ~r(j, 2). See text for additional explanation.
286 and therefore the objective functional may be written as
v (i) =
t)) t=0
(3)
j
It is noted that this expression is indexed by i, the initial state of the system. A policy 7r thus generates N values, one for each of the possible states x 0 E S. N o w let V ' = [V~(1) ..... V~(N)] be the vector of values generated by a stationary policy ~r. Since transition probabilities are time independent for stationary processes, average returns are stationary as well N
R(,,(j,t))
= E k--I
=R(,,(j)) If the vector of average returns for stationary policy rr is denoted by
R',, = [ R (~(1)), R (~(2)) ..... R(~'( N ) ) ] then the objective functional in eq. 3 may be reduced to oo
V~: ~ P~R~
(4)
t:O
Optimality for finite valued Markov decision processes As shown by eq. 3, each policy rr generates a vector V. of values V=(i) which are dependent on the probability structure. For a given structure, define V(i) by
V(i) : inf V,~(i) A n y policy rr* which satisfies
V,~,(i) = V(i) for every state i is said to be an optimal policy (we follow the convention in optimal control of minimizing rather than maximizing. Since one problem transforms easily into the other, no loss of generality results). D e r m a n (1970) demonstrated the existence of optimal policies for finite processes, and showed that they may be chosen to be stationary. This result is crucial, because it reduces the problem of finding an optimal policy to a search among stationary candidates, greatly simplifying the analysis. Stationarity yields transition probabilities Pij(rr(i)), which form the stochastic structure of a stationary Markov process. A recurrence relation can be formed from
287 eq. 4 under these conditions
V= = R~ + P=[R= + P=R~ + . . . ] = R,~ + P,~V=
(5)
F r o m eq. 5 the values V~(i) corresponding to policy ~r can be derived as the solution of N linear non-homogeneous equations. Every stationary policy ~r has corresponding to it a recursion relation, including the optimal policy ~r*. The approach usually taken is to elucidate this policy with an alternative form of the recursion relation. Let R(i, a) be the average single-step return accruing to action a when the system is in state i. It is shown in an appendix that the optimal policy ~r* satisfies
V~,(i) = V(i) = min R(i, a) + ~. P s j ( a ) V ( j a~A,
j
,
i = 1..... N
(6)
1
Computational algorithms for finite processes Three approaches to the analysis of eq. 6 are generally recognized in the literature: methods of bound improvement, linear programming and methods of policy improvement. The first two focus on determining values V(i) in eq. 6 and then searching the finite action space for a corresponding Stationary policy. Policy improvement focuses on the determination of a policy, followed by the generation of corresponding values. An elementary but important theorem for computing is the following: if Tr is a stationary policy which selects action 7r(i) minimizing N
+ E j=l
when the system is in state i, then 7r is optimal (Ross, 1970). That is, when optimal values V(i) are known a corresponding optimal policy is given by the simple minimization above. Thus we need only find the values V(i) in order to obtain an optimal policy. Their determination is the goal of the linear programming and bound improvement algorithms, as outlined below.
Linear programming Determination of optimal values V(i) with linear programming requires the least theoretical support, though it often is the least efficient method. This is especially so if both state and action spaces contain large numbers of elements. The problem of finding V(i) satisfying eq. 6 can be reformulated
288 as: minimize N
/ ( u ) = ~ Biu i i=1
subject to N
ui~R(~(i))+ 2 Pij(cr)uj,
iES
j=l
where N constraints are included for every possible stationary policy and U ~ Bi=l, Bi>Ofori=l,...,N i=1
That linear p r o g r a m m i n g could be applied to Markov processes was first recognized by d'Epenoux (1963).
Bound improvement A n interative procedure for determining V(i) is based on the theory of contraction mappings. We m a y define the m a p T. by : r ( U ) = a ~ + P~U where ¢r is any stationary policy and U is any N-dimensional vector of values. It can be shown (Ross, 1970) that T~ is a contraction map; that is IT,~(U) - T,~(V)I < IU -- V[ Furthermore, if T is defined by T ( U ) : min [R~ + P~,U] 'rr
where the minimization is over all stationary policies, then T is also a contraction map. It follows that the sequence [T'(U)Ii = 1,2,...] =[T(U),T(T(U))= T2(U) .... , T ' ( U ) .... ] is convergent with lim T " ( U ) = V Y/~OO
(Ross, 1970). A n algorithm based on this result can be constructed to produce successive approximations of V, and when convergence is sufficiently exact the values can be used with eq. 5 to determine optimal policy rr*. This general approach, first developed by Shapley (1953) in a game theoretic context, is called the m e t h o d of successive approximation or the b o u n d improvement method. Totten (1971) has developed the theoretical constructs by means of which an efficient successive approximation algorithm can be encoded.
289
Policy improvement-value determination Rather than determining V(i) first and generating ~r* as a consequence, we may analyze eq. 6 by choosing a policy first and then finding the corresponding values. This latter approach makes use of a theorem from which improved policies can be found: let 7r be a stationary policy and V,~ be the vector of values generated by ~r in eq. 5. If ~r1 is the stationary policy obtained by choosing, for each state i, the action a minimizing N
R(i,a)+
E Pij(a)V~(J) j--I
over a E A i , then IV,~(i) - V(i)l <~ [V,~(i) - V(i)[. In addition, either V,~(i) > V=(i) for at least one state i or else ~r is optimal. A concise proof of this theorem may be found in D e r m a n (1970). A n algorithm for determining the optimal policy may be constructed easily from this result. We can start with any stationary policy 7r and generate the values V,~(i) corresponding to it. Minimization of the above expression over A~ produces a new action a~, and when this is done for each of the states a new policy ~r' results. ~r' may then be used to generate another new policy in the same manner, and so on. Successive iterations must produce steadily improving policies until V,~, ( i ) = V~(i) for i = 1,...,N, at which time the optimal policy is found. This technique, known as the policy improvement-value determination method, was first introduced by Howard (1960).
Infinite processes In cases for which eq. 6 has no finite solution, the equation V,~(i) = lim ( n + 1)-1E
R(~r(xt, t))lXo=i
is used to determine policy values. Bertsekas (1976) demonstrated the existence of an optimal policy for the undiscounted case, and showed that the values V,~(i) corresponding to any stationary policy ~r satisfy the recurrence relations V,~ + h,~ = R~ + P,~h,~ where the vector h= is defined by
(I -- P. + P~*)h,~ : (I -- P*)R~ with P* given by
P*:
lim ( n + l ) n~oo
-1 ~ P~ 1=0
(7)
290
In its most general form eq. 7 yields no computation algorithm, since there exist 2 N variables (corresponding to the variables Va(i ) and ha(i ), i = 1 , . . . , N ) constrained by N equations. However, Bertsekas (1976) showed that if there is any state to which transfers can be made from every other state in S with non-zero probability, then V~ can be expressed as V,~ = g ~ l Thus the long-term average value of the Markov decision process is independent of initial state. This independent average value is called system gain and is designated by the parameter ga. U n d e r this condition the N equations in (7) contain N + 1 unknowns. If one of the values h,~(i), say h~(1), is set to zero, then we have a set of N equations in N unknowns. Since the long-term average values are given by
[,
V== ~im n+l)
P~R~ t
when ¢r is stationary, the system gain g= is not affected by this assignment. An interpretation of the values g~ and h~ can be generated from results by H o w a r d (1960). Using an analysis based on the z-transform, he showed that for ergodic processes the solution to eq. 7 asymptotically satisfies
V=(n)=n[g~l]+h, where V,~(n) is the vector of (asymptotic) cumulative returns for policy ¢r after n time steps. Cumulative returns thus are decomposed into a component for average long-term system gain and components specific to the initial state of the system. Hence the values h,~(i) may be thought of as returns due to 'transient' system behavior, whereas the gain g= corresponds to 'steadystate' behavior. Although the N values in h a cannot be determined simultaneously, by setting h.(1) to zero the remaining N - 1 relative values can be found. An algorithm for systems which meet the appropriate recurrence class conditions then can be built around the linear system g~l + h~ = R~ + P,~h~ J
h (1) = 0 The solution of this system consists of the system gain and a set of relative transient returns, and the objective of the algorithm is to determine the policy ~r* for which gain is maximum. It is noted that when g,~ = 0 for all stationary policies, this system of equations reduces to the one used for the finite case. For constant long-term system gain, it can be shown (Howard, 1960) that the same policy improvement algorithm used for finite processes also produces convergent policies, gains and transient returns for the infinite
291 case. Since no contraction mapping can be based on eq. 2 and the expectation
E
R(Tr(x,,t))[Xo=i t
is infinite, no other general solution algorithm is available. APPLICATIONS TO NATURAL RESOURCE SYSTEMS The ideas developed above are illustrated in this section with three examples from the natural resource literature. Many applications of dynamic optimization and stochastic modeling have been reported (see Waiters and Hilborn (1978) and Williams (1979) for comprehensive reviews). Those included here were chosen to indicate the variety of ecological systems to which optimal stochastic control is applicable, and to illustrate various approaches in problem formulation and solution.
Migratory bi'rd management Anderson (1975) used optimal stochastic control methods to study the impact of harvest policies on mallard (Anas platyrhynchos) polpulations. This species is a Nearctic migrant, breeding primarily in the prairie pothole region of the north-central U.S.A. and central Canada. Populations are strongly influenced by the number of ponds available as breeding sites, and they are subjected to considerable hunting pressure during fall migration. Anderson described this situation with a model involving two state variables: the size of the adult breeding population in year t, x~(t), and the number of ponds present on the central breeding grounds, x2(t). In addition, a variable y(t) is used to track the number of surviving offspring. Since the latter is completely determined by x~(t) and X2(t), it is formally an intermediate variable and need not be included in the state space. The driving variable trajectory z for this system is given by annual rainfall amounts, which influence the number of ponds on the breeding grounds. System control consists of projected harvest amounts during fall migration, as determined by hunting regulations. The state transition equations are expressed by
xl(t + 1) = (alxl(t) + th2Y(t ) x2(t + 1) = a + bx2(t ) + cz(t) y ( t ) = g(x(t)) =
D( t ),
t ), y( t ))
(8) (9) (lO) (11)
292
q,2 = D( t), xl(t), y(t)) D(t) = u(t) + , t
t12) t13)
The coefficient qh is spring to spring survival for adults and ~2 is fall to spring survival for juveniles. Constants a, b and c are used to express the degree of serial correlation in pond numbers for successive years, and D(t) specifies the number of birds actually harvested in year t. There are two sources of variability in this model, the first of which is uncontrolled variation in total rainfall. The second is in the harvest term, where c t measures uncontrolled differences between actual and projected harvests. Both trajectories (z(t)} and {ct) are assumed to be generated by Gaussian white noise processes. With an appropriate descretization of the state space, this leads to a set of decision-specific transition probabilities. Single step payoffs for the optimization consist of actual harvest levels for each year. Based on these amounts an objective function is given as the expected value of time-averaged yields J=
lim E N~
f,
t + l ) -1 ~ D(t t=0
The methodology used by Anderson to find optimal harvest policies is a backward iteration procedure commonly applied in terminal-time dynamic programming problems. The number of stages for his analysis was incremented iteratively, until a stationary policy appeared to have been determined. From the resulting policy configuration he was able to determine the marginal influences of weather and population status on optimal harvest levels. By changing eqs. 11 and 12 the hypotheses of compensatory and non-compensatory mortality for mallards were also examined.
Forest management Lembersky and Johnson (1975) used optimal stochastic control to examine planting and removal policies for Douglas fir (Pseudotsuga menziesii) forests in Washington. These forests are subjected to a number of cropping and regeneration practices, responses to which can only be determined after extended periods of time. The effects of environmental variability, combined with inevitable limitations on our knowledge of forest dynamics, prohibit the accurate prediction of stand development. Nor is it possible to determine precisely the nature of future sales markets. Lembersky and Johnson modeled this situation with a three dimensional system. The first two state variables are biotic, average dbh (diameter-breast height) and average number of trees per hectare. These were chosen because both stand dynamics and market returns are based on timber volume, which
293 in turn is a function of dbh and tree density. A third state variable specifies the trajectory of market conditions. Driving variables for the system are two general environmental variables, which essentially express the lack of certainty about future stand dynamics. A general form for system transitions is given by
x,(t + l) ----g,(x,(t), x2(t ), u(t)) + z,(t) x 2 ( t + 1)=g2(xl(t),x2(t),u(t))+z2(t) x3(t+ 1)=a+bx3(t)+c , where xl(t ), Xz(t ) and x3(t ) are stand dbh, tree density and market conditions, respectively. Stochastic influences enter with the driving vector z(t) and market variable ~t, the latter expressing future market uncertainties. Both {z(t)) and (ct} are assumed to be white noise processes, and the state space is discretized into 240 states. Control for this problem is handled in a different way from the example above. Whereas Anderson (1975) modeled control as a reduction of state variable levels by certain projected amounts, Lembersky and Johnson considered both enrichment and reduction of states to pre-established levels. Control options include planting, partial harvests and thinning to various densities. Single step payoffs are economic returns v(x(t), u(t)) from timber harvest, minus costs c(x(t), u(t)) of applying the control. They assumed a discounted process with discount factor a = 0.50324, so that the objective function is given by oo
Lembersky and Johnson solved this optimal control problem by means of a bound improvement procedure, as outlined above. From the resulting policy configuration they were able to determine several general patterns for optimal management of timber plantations and several general trends were recognized in the corresponding returns.
Optimal defoliation of salt desert shrubs Williams (1979) used optimal stochastic control to investigate the exploitation of some c o m m o n cool-desert shrubs in the Great Basin of the intermountain U.S.A. Environments in this region are extreme, with hot, dry summers, subfreezing winters and poorly developed, saline soils. Vegetation in the intermountain desert shrublands is often exposed to severe physiological stress because of the infrequent precipitation, extremes in temperatures and high soil salinities. Productivity is low, so that recovery from
294
perturbation is very gradual. These shrublands are extensively managed, mostly as winter and spring grazinglands for sheep. An additional burden of stress is thus placed on the plants, making them especially susceptible to deterioration from improper management. Williams (1979) sought to determine optimal defoliation strategies for the cool-desert shrub species shadscale (Atriplex confertifolia), big sagebrush (Artemisia tridentata) and winterfat (Ceratoides lanata). A primary production model combining both structural and physiological features was used, and state variables were included for shoots, roots, stems and carbohydrate reserves. Accumulator variables were also included in order to track yearly defoliation amounts of active and inactive shoots. Driving variables for the model consist of soil water and temperature values, chosen because each is the dominant factor limiting metabolism during certain seasons. The model is based on the equations of growth analysis (Blackman, 1968; Evans, 1972), whereby changes in plant biomass are partitioned into various components as a function of the controlling factors. Flows across system boundaries as well as allocations among components are controlled at different points by temperature, soil moisture, defoliation, system state and time. The general form of the model equations is
xi(t "31"-1) ----xi(t ) -- • F,j(x, z, u, t) ÷ ~] ~/(x, z, u, t) J
J
where F,s(x, z, u, t) denotes the flow of biomass from compartment i to compartment j. Stochastic effects enter with the environmental variables in z(t). See Fetcher (1977, 1981) and Williams (1979) for a more complete model description. Controls for this ~tudy consisted of defoliation patterns in different combinations of intensity, duration and seasonality. Defoliation was modeled as the removal of a specified proportion of appropriate state variables, and optimal control was based on yearly amounts. Defoliation was measured in weighted units of active and inactive biomass. For discounted problems the corresponding objective function is given by discounted sums of defoliation amounts
where T is a yearly index and W(T) is the accumulated amount of defoliated biomass in year T. The time-averaged form J=
lim E N~
~
T + I ) -1 ~
W(T
T=0
is used for undiscounted problems.
295
Williams used the policy improvement-value determination methodology outlined above to determine optimal defoliation strategies for each of the shrub species. Interspecific differences in policies were attributed to particular morphological and physiological characteristics. He also assessed the influence of discounting and the effects of varying climatic patterns. Finally, a series of analyses was performed to determine the effect of differential weighting of active and inactive biomass in the objective function. Possible management implications were derived from each of these studies. CONCLUSIONS
A great many problems in natural resource management fit within the framework of Markov decision processes. It is required only that the system of interest be described as Markovian and that the associated objective functional be separable into time-specific components. In particular, these conditions are met for any harvest problem in which future system dynamics depend only on controls, driving variables and the present system state. Most natural resource problems can be made to fit within this framework. For those that cannot, variational or other approaches may prove useful. The great weakness of Markov decision procedures is the "curse of dimensionality" (Bellman, 1957). For finite state problems this means that as the number of possible states in the state space increases beyond a certain range, the optimal decision problem rapidly becomes computationally intractable. This is so essentially because the number of alternative policies one must consider increases directly with the size of the state space and the number of computations required to evaluate each of them increases with its square. One way around this problem is to 'disaggregate' the system into blocks of states which have certain common features and can be analyzed as separate MDPs. The resulting values and policy configurations can then be combined to produce the overall optimal solution. This approach is suggested, for example, by Torten (1971). The alternative is of course to model the system in such a way that its dimensionality is reduced, while its essential behaviors are retained. With the development of very high speed computers these problems are not nearly so limiting as they once were. Both the algorithms and the computer technology are now available to handle systems with thousands of states and large numbers of decisions. For example, Williams (in preparation) has developed a P L / I program encoding a bound improvement procedure that can traverse data structures at the rate of nearly 100000 bytes s-1. At these rates it is possible to analyze problems with relatively high levels of complexity. We may thus expect in the near future to transcend the restrictions of linearity and static analysis, which necessarily characterized past approaches to optimal natural resource management.
296 APPENDIX In this a p p e n d i x the f u n d a m e n t a l recurrence relation (6) is developed. F o r i E S, let a o ~ A i be chosen such that
R(i, ao) + ~. P,j(ao)V(j ) =
min
R(i, a) + ~, P,j(a)V(j)
a~A~
j=l
j=l
F o r arbitrary b u t fixed ~ > 0 let ~r = [ao, rq], where ~rI designates the policy structure starting at time t = 1 (after action a 0 has been taken), a n d rr~ is chosen such that
V~,(j) <~V(j) + e,
,../E S
Then
v(i) <- v~(i) N
=R(i, ao)+ ~ Pij(ao)V=,(J) j=l N
~R(i, ao)+ E Pij(ao)[V(j) +,] j=l N
N
=R(i, ao)+ ~ eij(ao)V(j)+, E e~j(ao) j=l
j
1
~min R(i,a)+ 2 P,j(a)V(j) +~ a EA i
j= 1
T h u s we have V(i)~
R ( i , a ) + ~ P,j(a)V(j) j= 1
But for a n y arbitrary policy ~r, including an optimal policy ~r*,
v~(i)~min R(i,a)+ E P~j(a)V(j) a U:A,
j= 1
F r o m these last two inequalities it follows that
V(i) = min R(i, a) + Y~ P,j(a)V(j) a C:A, [
j= 1
297 REFERENCES Anderson, D.R., 1975. Optimal exploitation strategies for an animal population in a Markovian environment: a theory and an example. Ecology, 56:1281 1297. Bellman, R., 1957. Dynamic Programming. Princeton University Press, Princeton, N J, 342 pp. Bertsekas, D.P., 1976. Dynamic Programming and Stochastic Control. Academic Press, New York, NY, 398 pp. Blackman, G.E., 1968. The applicability of the concepts of growth analysis to the assessment of productivity. In: F.E. Eckardt (Editor), Functioning of Terrestrial Ecosystems at the Primary Production Level. Proceedings of the Copenhagen Symposium UNESCO, Paris, 516 pp. Box, G.P. and Jenkins, G.M., 1976. Time Series Analysis: Forecasting and Control. HoldenDay, San Francisco, CA, 743 pp. D'Epenoux, F., 1963. A probabilistic production and inventory problem. Manage. Sci., 10: 98-108. Derman, C., 1970. Finite State Markovian Decision Processes. Academic Press, New York, 159 pp. Evans, G.C., 1972. Quantitative Analysis of Plant Growth. University of California Press, Berkeley and Los Angeles, CA, 205 pp. Fetcher, N., 1977. Productivity of cold desert shrubs: a simulation model. Ph.D. Diss., Colorado State University, Fort Collins, CO. Fetcher, N., 1981. Effects of grazing on cold desert shrubs: a simulation model based on relative growth rate. Ecol. Modelling, 13: 49-86. Holling, C.S., 1974. Modelling and simulation of environmental impact analysis. International Institute for Applied Systems Analysis Research memorandum RM-74-4, International Institute for Applied Systems Analysis, Laxenburg, Austria, 55 pp. Howard, R.A., 1960. Dynamic Programming and Markov Processes. Wiley, New York, NY. Intrilligator, M.D., 1971. Mathematical Optimization and Economic Theory. Prentice-Hall, Englewood Cliffs, NJ. Kushner, K., 1971. Introduction to Stochastic Control. Holt, Rinehart, and Winston, New York, NY. Lembersky, M.R. and Johnson, K.N., 1975. Optimal policies for managed stands: an infinite horizon Markov decision process approach. Forest. Sci., 21: 109-122. Lotka, A.J., 1956. Elements of Mathematical Biology. Dover, New York, NY. Odum, E.P., 1969. The strategy of ecosystem development. Science, 164: 267-270. Pattern, B.C., 1975. Ecosystem linearization: an evolutionary design problem. Am. Natur., 109: 529-539. Ross, S.M., 1970. Applied Probability Models with Optimization Applications. Holden-Day, San Francisco, CA. Sage, A.P. and White, C.C., III, 1977. Optimum Systems Control. Prentice-Hall, Englewood Cliffs, NJ. Shapley, L.S., 1953. Stochastic games. Proc. Natl. Acad. Sci., 39: 1095-1100. Totten, J.C., 1971. Computational methods for finite state finite valued Markovian decision problems. Univ. California, Berkeley, Operations Research Center Tech. Rep. 71-9, 101 PP. Walters, C.J. and Hilborn, R., 1978. Ecological optimization and adaptive management. Annu. Rev. Ecol. Syst., 9: 157-188. Williams, B.K., 1979: Optimal stochastic control of salt desert shrubs by season and intensity of defoliation. Ph.D. Diss., Colorado State University, Fort Colins, CO, 261 pp.