J. SYSTEMS SOFTWARE 1992; 11:221-232
Using Program Flow Control C. Samuel
Dependence
227
Graphs for Information
Hsieh
Computer Science Department,
Vanderbilt University, Nashville, Tennessee
Elizabeth A. Unger Department of Computer and Information
Science, Kansas State University, Manhattan,
Kansas
Ramon A. Mata-Toledo Department of Mathematics
and Computer Science, James Madison University, Harrisonburg,
An algorithm that uses program dependence graphs for static program analysis to validate secure information flow is presented. The algorithm is able to deal with a wider class of information flow policies than the well-known lattice model of secure information flow. The lattice model requires that the information flow permitted by a flow policy be transitive and unidirectional; the algorithm presented here does not impose these restrictions on a flow policy. 1. INTRODUCTION
Many static program analysis techniques used for software testing, debugging, and parallelization are based on dataflow and control dependences [ 1, 21. A statement I is dataflow dependent on a statement J if the computation performed by I can use a value defined by J, while a statement Z is control dependent on a control predicate J if whether I should be executed depends on the outcome of evaluating J. For example, consider the program segment 1 x:= a + b; 2 ifi< lOthen y:= x* y; 3 Statement 3 is dataflow dependent on statement 1 because statement 3 may use the value of x defined by
Address correspondence to C. Samuel Hsieh, Computer Science Dept., Box 1679, Station B, Vanderbilt University, Nashville, TN 37235.
0 Elsevier Science Publishing Co., Inc. 655 Avenue of the Americas, New York, NY 10010
Virginia
statement 1 in computing a value for y, and statement 3 is control dependent on the control predicate in statement 2 because whether statement 3 should be executed depends on the result of evaluating the predicate i < 10 in statement 2. For ease of textual exposition, we shall refer to control predicate simply as a statement. Together, dataflow dependence and control dependence are referred to as program dependence. A program dependence graph is a directed graph in which each node is a statement of a program and each arc (x, y) denotes that statement y is dataflow or control dependent on statement x. The concept of information flow, as defined by Denning [3] for use in computer security, coincides with program dependence. Denning distinguished two kinds of information flow-explicit and implicit. The former corresponds to dataflow dependence and the latter to control dependence. When the above program segment executes, both statements 1 and 2 can influence the value of y: statement 1 by dataflow dependence and statement 2 by control dependence; correspondingly, in Denning’s terminology, information can flow explicitly from a and b (via x) to y, and implicitly from i to y. Static program analysis can be used to validate a program for secure information flow, i.e., to determine whether execution of a program can potentially generate information flow that violates an established flow policy. Denning [3, 41 assumed that a flow policy is a
0164.1212/92/$05.00
228
J. SYSTEMS
SOFTWARE 1992; 17:22-J-232
C. Samuel Hsieh, Elizabeth A. Unger, and Ramon A. data-Toledo
lattice’ (C, I ), where C is a set of security classes and I is a binary relation with certain properties (among them, transitivity and antisymmetry). Each object x accessed by a program is assigned a security class denoted C(x). Information flow from an object x to an object y is admissible if and only if C(x) zz C(y). Several static analysis techniques (e.g., [4, 51) previously developed for validating secure information flow are based on this model. By the definition of a lattice, the relation r-c is transitive. Hence, if a flow policy permits information flow from x to y and from y to z, then it must also permit info~ation flow from x to 2, because if C(x) I C(y), C(y) 5 C(z), and I is transitive, then C(x) 5 C(z) must hold. In other words, the information flow between objects permitted by a flow policy must be transitive. Also, by the definition of a lattice, the relation 5 is antisymmetric. Hence, if C(x) f C(y) and information flow from x to y is permitted, then information flow from y to x must be forbidden, since C(x) I C(y) and C(y) 4 C(x) cannot both hold unless C(x) = C(y) holds. That is, information flow between any two objects permitted by a flow policy must be unidirectional unless the objects belong to the same security class. As Denning pointed out, these restrictions on flow policies are sometimes impractical [6]. For example, consider this simple information flow policy: information flow from a classified iile x to an unclassified file z is permitted only if such information flow is made via an authorized person y. In other words, this policy permits information flow from x to y and from y to z, but forbids information flow from x directly to z without passing through y. Since the information flow permitted by this policy is not transitive, this policy cannot be specified in the lattice model. The static analysis algorithm presented in this articIe makes these restrictions on information flow policies unnecessary. Our analysis algorithm is based on program dependence graphs. In contrast to previously proposed techniques based on the lattice model, our analysis algorithm can deal with a wider class of flow policies: a flow policy is simply any binary relation over a set of objects-it need not be transitive nor antisymmetric.
’ Let C be a set, and 5 a binary relation over C. (C, 5) is a lattice if I is reflexive, transitive, and antisymmetric, and every pair of elements in C has a unique least upper bound and a unique greatest lower bound. An eiement c is an upper bound of a pair of elements a and b if a I c and b s c, and a lower bound if c 5 a and c I b. An upper (lower) bound c of a and b is a least upper (greatest tower) bound a and b if there is no upper (lower) bound d of a and b such that d c c (c I d).
2. PROGRAM DEPENDENCE AND INFORMATION FLOW
We first describe a program dependence graph in the light of information flow. Information flows to an object only if the object serves as a destination of data transfer. The following program will be used as an example throughout this article: a, b, c, d: integer; U,, U,, I/I., U,: files; 1 input a from U, ; 2 input b from U,; 3 c:= a + 5; 4 d:= 0; 5 if c < 10 then output b to U, 6 else while d < 99 do 7 begin output b to V, ; 8 b:= b + 1; 9 input d from U, 10 end; (*while*) 11 output d to U4; As shown above, we identify each statement in the program by a unique number. Only in input, output, and assignment statements can an object serve as a destination of data transfer. These statements are called definitions. All statements except 5 and 7 are definitions. Statements 5 and 7 do not specify explicit destinations of information transfer, but they cause implicit information flow. The files U,, U,, U,, and U, are objects external to the program, such as secondary storage files, human users interacting with the computer program, or any device through which the program interacts with external physical processes. For each definition n, we use dest( n) to denote the object that serves as the destination of data transfer in n. For each statement n, we define exp( n) to be the set of objects whose values are explicitly referenced by n. Also, for each statement n, imp(n) denotes the innermost if or while s~tement within whose scope n appears. In other words, a statement n is in the range of influence of imp(n). Table 1 shows dest( n), exp(n), and imp(n), for each statement in our example program. Constant objects are ignored, since information flow can never occur through constant objects. To keep track of how information may be propagated from object to object by the program, we need to define the reaching relation between statements in a program. A definition m reaches a statement n if there is a control flow path from m to n such that the statements
Using Program Dependence Graphs Table 1. d&, exp, n
J. SYSTEMS SOFTWARE 1992; 17:221-232
imp and file
dest( n)
if-then. On the other hand, the second condition defines a control dependence of n on m, since n may or
iv(n)
e.v( n)
u,
1 2
u2
3 4 5 6 7 8 9 10 II
a c i b b
along the path (not including m and n) do not redefine dest( m). In our example, statement 1 reaches all statements except itself, since there are control flow paths from it to all other statements and a is never redefined in the program. On the other hand, statement 6 does not reach statement 10 because there is no control flow path from statement 6 to statement 10, and statement 2 does not reach statement 10 because any control flow path from statement 2 to statement 10 must pass statement 9, which redefines &H(2) (i.e., b). The set of definitions reaching a statement n can be determined efficiently. Algorithms to do this can be found in texts on compiler construction (e.g., [7, S]), and will not be repeated here. We use reach(n) to denote the set of definitions reaching a statement n. Table 2 shows reach(n) for each statement n of the example program. Definitions 8, 9, and 10 all reach themselves because the while loop (7)-( 10) can be executed more than once. The program dependence graph of a program is a directed graph (N, E), where the node set N is the set of statements and a directed arc (m, n) exists in E if and only if either m reaches n and dest(m) is in exp( n), or m is imp(n). The program dependence graph of our example program is shown in Figure 1. The first condition defines a dataflow dependence of n on m since execution of n may use a value generated by m. In other words, information can flow from dest( m), and thereby exp(m), to dest( n) if n is a definition, or to the boolean value based on which a control branching decision is made if n is a while or
may not be executed, depending on the boolean value evaluated at m (m must be a while-do or an if- then since m is imp(n)). In other words, the information conveyed by such a boolean value implicitly influences the value of destfn). That is, information can Aow from such a boolean value, and thereby from the values used to derive this boolean value, to dest( n). Applying the above reasoning inductively, it can be concluded that if a directed path from a definition i to a definition j exists in the program dependence graph, then information may flow from dest(i), and hence from exp(i), to dest(j). This conclusion will be used in the next section when we present an algorithm to derive the information flow relation between objects accessed by a program. We have discussed const~ction of program dependence graphs for structured programs. Construction of program dependence graphs for programs with arbitrary control flow graphs can be found in [l]. The basic idea is the use of the immediate forward dominator of a control-branching node in a control flow graph to determine the set of statements that are control dependent on the control predicate. This technique has been presented in several places for information flow [4], program slicing [93, and program dependence graphs [ 11, and so will not be repeated here. Construction of program dependence graphs for inte~rocedural analysis can be found in [IO].
3. FLOW POLICY AND FLOW RELATION Let 0 be a set of objects between which information flow is to be regulated. 0 is usually some subset of the variables and external objects accessed by a program. An information flow policy P is a binary relation over 0 specifying the admissible information flow between objects in 0. That is, information flow from x to y (x, y E 0) is permitted by a flow policy P if and only if the tuple (x, y) is in P.
Table 2. reach(n) For Each Statement n n 1 2 3 4 5 6
reuch( n)
1 192 1>%3 1,X3,4 1,2,3,4
n
reach(n)
7
1,2,3,4,8,9,
8
1,2,3,4,8,9, 10
9 10 11
229
10
1,2,3,4,8.9,10
1,3,4,8,9.10 1.2,3,4,6,8,9,10
Figure 1. A program dependence graph.
230
J. SYSTEMS SOFTWARE 1992; 17:221-232
C. Samuel Hsieh, Elizabeth A. Unger, and Ramon A. Ma&Toledo
Note that we do not assume any properties of P. For example, P can specify that information flow from x to z is forbidden, while information flow from x to y and from y to z is permitted. That is, F can stipulate that any information flow from x to z should be made via y, and any information flow from x to z that does not pass through y is forbidden. The ability to specify such intransitivity is especially important when information flow between external objects is concerned, as demonstrated by the following example adapted from Denning [6]. Suppose x is a classified file to which only authorized persons have access, and z is an unclassified file open to the public. A reasonable flow policy should prohibit information flow from x directly to Z. However, the classified file often needs to be edited for release to the public after sensitive information has been sanitized. Let y be a sanitizer (an authorized person or a computer program) that edits info~ation from x and then transfers the sanitized information to the unclassified file z. Hence, a reasonable flow policy should contain (x, y) and (y, z) but should not contain (x, z). In contrast, such a policy cannot be specified in the lattice model and, therefore, static analysis techniques based on the lattice model cannot be used to determine if execution of a computer program may violate such a flow policy. This is because the lattice model and the analysis techniques based on it require that, as long as information flow from x to y and from y to z is permitted by a flow policy, then information flow from x to z must be permitted by the policy, no matter whether such information flow passes through y. Our analysis algorithm, presented below, does not place this restriction on flow policies. Another restriction imposed by the lattice model is that information flow permitted by a flow policy must be unidirectional: since a lattice is, by definition, antisymmetric, if information Bow from an object x to an object y is permitted, then information from y to z must be forbidden unless x and y are in the same security class. Our analysis algorithm does not impose this restriction. Using the program dependence graph of a program, our algorithm finds the information flow relation R for a given set of objects 0 such that a tuple (x, y) in R (where x, y E 0) implies that execution of the program can cause information to flow from x to y without passing through any other object in 0. Hence, the algorithm distinguishes between information flow from an object x to an object z via another object y and that from x to z without passing y. The relation R found by the algorithm can then be checked against a given flow policy P: if there is a tuple in R but not in P, then execution of the computer program may violate the flow policy P.
Let N be the node set of a program dependence graph (i.e., the set of statements of a program), E be the arc set of the program dependence graph, and 0 be a set of objects. The following algorithm generates the flow relation R between objects in 0: Input: (N, E) and 0 Output: R, initially a null relation; S: a stack, initially empty; visited: array [N] of boolean values 1 For each object x E 0 do 2 begin 3 set visited[ i]:= false for all i E N; 4 for each statement y do 5 if x E exp(y) then push y into S; 6 while S is not empty do 7 begin 8 pop a statement z from S; 9 visited! z] : = true; IO if dest(z) E 0 then 11 add (x, dest(z)) to R 12 else 13 for each arc (z, w) in E do 14 if not visited[ w] and w not in S then 15 push w in S 16 end; (*while*) 17 end; (*outermost for*) Readers familiar with graph traversal will find similarity between the algorithm and depth-first traversal of a graph: the loop body of the outermost for statements (lines 2-17) traverses the program dependence graph from those nodes (i.e., statements) that use (i.e., reference) the object x, with an additional constraint imposed by line 10 on the graph traversal. Lines 4-5 initialize the stack with statements that use x, and lines 6-16 constitute a depth-first traversal of the program dependence graph from these statements. Line 11 puts (x, dest(s)) in R because z is reachable from a statement using x (i.e., there is a path from a statement using x to z in the program dependence graph), hence information may flow from x to dest(z). The main difference between lines 2-17 and an ordinary depthfirst traversal is the constraint imposed by line 10: when a node z is visited in the traversal and dest( z) is an object in 0 (line IO), further traversal from the node z will not be performed (i.e., lines 13-15 will not be executed). This constraint ensures that a tuple (x, z) is added to R if and only if information may flow from x to z without passing any other objects in 0. In other words, the algorithm will not add a tuple (x, z) to R if information flow from x to z must pass through some other object in 0.
Using Program Dependence Graphs For example, suppose x, y, and z are in 0. Consider the simple program 1 2
y:= x; z:=y;
The program dependence graph of this simple program consists of two nodes 1 and 2, and an arc from node 1 to node 2 representing dataflow dependence of statement 2 on statement 1. Using this program dependence graph, our algorithm will add (x, y) and (y, z) to R but will not add (x, z) to R. Potential information flow from x to z, however, is not ignored; it is indicated by the presence of both (x, y) and ( y, z) in R, while the absence of (x, z) implies that information flow from x to z without passing through other objects in 0 is impossible. This distinction between direct flow and transitive flow between objects in 0 is important because an information policy is not necessarily transitive. For example, suppose the information flow policy P consists of (x, y) and ( y, z) only. Such a policy forbids any information flow from x to z that does not pass y, since (x, z) is not in the policy. The simple program obviously cannot violate this policy. However, without the constraint imposed by line 10 on the graph traversal, the algorithm would add (x, z) to R, which would incorrectly imply that the simple program can violate P. As mentioned previously, techniques based on the lattice model cannot make this distinction because they assume transitivity of admissible information flow. Often, a flow policy is concerned only with the external objects accessed by a program, since only through these objects can the effects of information flow be observed. Let us return to the example program given in the previous section. Suppose 0 is the set of the external objects U,, U, , U,, and U,. The information flow relation R generated by applying the above algorithm to the program dependence graph given in the previous section is depicted in Figure 2 as a directed graph representing each object in 0 as a node and each tuple (x, y) in R as an arc from node x to node y. To grasp the intuitive meaning of an information flow relation, consider the tuple (U,, U4). Our algorithm adds this tuple to R because the following condi-
Figure 2. An information flow relation.
J. SYSTEMS SOFTWARE 1992; 17:221-232
231
tions all hold: U, is in exp(l), the path 1,3,5,7, 10, 11 exists in the program dependence graph, U, in dest( 1l), and for all statements i along the path except 11, dest(i) is not an object in 0. A careful examination of the example program in the previous section reveals that if the value 0 is output to c’, at statement 11, then it can be deduced that the value of a read from U, by statement 1 is < 5. Hence, the value transferred to U, can reveal something about the value read from U,. An information flow relation generated by our algorithm for a program can be compared with a given flow policy to determine if executing the program may violate the flow policy; a tuple in the flow relation but not in the flow policy implies that executing the program may violate the flow policy. If a flow policy P is available at the time when a program is being analyzed, a potential improvement of our algorithm is possible: the relation R need not be explicitly generated, since at line 11 of the algorithm we may test whether (x, dest( z)) is in P instead of adding (x, dest( z)) to R.
4. DISCUSSION
AND CONCLUSION
We present an algorithm that uses program dependence graphs for static analysis of programs to validate secure information flow. The main conclusion is that our static analysis can deal with a wider class of flow policies than the lattice model of secure information flow. A flow policy that our algorithm deals with can be any binary relation between objects accessed by a program. In contrast, previously proposed static analysis techniques based on the lattice model require that the relation of admissible information flow between objects be transitive and unidirectional. Our algorithm makes these restrictions unnecessary. Static analysis based on program dependence has been applied in several other areas of software engineering , including software testing [ 1 1 - 131, debugging [9, 10, 14-201, and maintenance [9, 16-181, code optimization [ 11, and parallelization [21]. A recent article by Podgurski and Clark [22] formally characterizes program dependence and examines the implications of dependence analysis in software testing, debugging, and maintenance. Usefulness of the analysis techniques developed for the various applications can potentially be enhanced by further progress in dependence analysis, for example, development of better techniques for interprocedural analysis and to handle pointer variables and arrays. Like static program analysis, information flow control is an established research area, and abounds in published research. This article has focused on an
232
1. SYSTEMS
SOFTWARE 1992; 17~227-232
C. Samuel Hsieh, Elizabeth A. Unger, and Ramon A. Mata-Toledo
efficient algorithm (as opposed to an abstract model for information flow control [3, 261) for static, compiletime analysis (as opposed to run-time certification [23, 241) to validate information flow through legitimate and storage channels (as opposed to covert channels [25]). A comprehensive survey of previous work in infortnation flow control
can be found in Denning’s [6].
13.
14.
book on
data security and c~ptography REFERENCES
1. J. Ferrante, K. Ottenstein, and J. Warren, The Program Dependence Graph and its Use in Optimization, ACM TOPLAS 9, 319-349 (1987). 2. K. J. Ottenstein and L. M. Ottenstein, The Program Dependence Graph in a Software Development Environment, ACMSIGPLANNotices 19, 177-184 (1984). 3. D. E. Denning, A Lattice Model of Secure Information Flow, Commun. ACM 19, 236-243 (1976). 4. D. E. Denning and P. J. Denning, Certification of Programs for Secure Info~ation Flow, Commun. ACM 20, 504-513 (1977). 5. M, Mizuno and A. E. Oldehoeft, Information Flow Control in a Distributed Object-Oriented System with Statically Bound Object Variables, in Proceedings of the 10th National Computer Security Conference, 1987, pp. 56-67. 6. D. E. Denning, Cryptography and Data Security, Addison-Wesley, Reading, Massachusetts, 1982. 7. A. V. Aho, R. Sethi, and J. D. Ullman, Compilers Principies, Techniques and Tools, Addison-Wesley, Reading, Massachusetts, 1986. 8. K. Kennedy, A Survey of Data Flow Analysis Techniques, in Program Flow Analysis, (S. S. Muchnick and N. D. Jones, eds.), Prentice-Hall, Englewood Cliffs, New Jersey, 1981. 9. M. Weiser, Program Slicing, IEEE TOSE SE-IO, 352-357 (1984). 10. S. Horwitz, T. Reps, and D. Binkley, Interprocedural Slicing Using Dependence Graphs, ACM TOPLAS 12, 26-60 (1990). 11. B. Korel, The Program Dependence Graph in Static Testing, Info. Process. Lett. 24, 103-138 (1987). 12. J. W. Laski and B. Korel, A Data Flow Oriented Pro-
15.
16.
17.
18.
19. 20. 21.
22.
23. 24.
25. 26.
gram Testing Strategy, IEEE TOSE SE-9, 347-354 (1983). S. Rapps and J. Weyuker, Selecting Software Test Data Using Flow Information, IEEE TOSE SE-11, 367-375 (1985). H. Agrawal and J. R. Horgan, Dynamic Program Slicing, in Proceedings of the ACM SIGPLAN ‘90 Conference on Programming Languages and Implementation, White Plains, New York, 1990, pp. 246-256. J. F. Bergeretti and B. A. Carre, Information Flow and Dataflow Analysis of While-Programs, ACM TOPLAS 7, 37-61 (1985). L. D. Fosdick and L. J. Osterweil, Data Flow Analysis in Software Reliability, ACM Comput. Surveys 8, 306-330 (1976). J. C. Hwang, M. W. Du, and C. R. Chou, Finding Program Slice for Recursive Procedures, in Proceedings of the IEEE COMPSAC 88, Chicago, 1988, pp. 220-227. C. S. Hsieh, Slice, Chunk, and Dataflow Anomaly as Datalog Rules, J. Systems Software 16, 197-203 (1991). B. Korel and J. Laski, Dynamic Program Slicing, Info. Process. Lett. 29, 155-163 (1988). M. Weiser, Programmers Use Slices When Debugging, CACM 25, 446-452 (1982). D. A. Padua and M. J. Wolfe, Advanced Compiler Optimizations for Supercomputers, Commun. ACM 29, 1184-1201 (1986). A. Podgurski and L. A. Clarke, A Formal Model of Program Dependences and Its Implications on Software Testing, Debugging, and Maintenance, IEEE TOSE 16, 965-979 (1990). J. S. Fenton, Memoryless Subsystems, Comput. J. 17, 143- 147 (1974). A. K. Jones and R. J. Lipton, The Enforcement of Security Policies for Computation, ACM Oper. Syst. Rev. 9, 197-206 (1975). B. W. Lampson, A Note on The Confinement Problem, Commun. ACM 16, 613-615 (1973). J. A. Goguen and J. Meseguer, Security Policies and Security Models, in Proceedings of the IEEE Symposium on Security and Privacy, Oakland, CA, 1982 pp. 11-20.