Accepted Manuscript Ontology Based Approach to Bayesian Student Model Design Ani Grubišić, Slavomir Stankov, Ivan Peraić PII: DOI: Reference:
S0957-4174(13)00228-5 http://dx.doi.org/10.1016/j.eswa.2013.03.041 ESWA 8493
To appear in:
Expert Systems with Applications
Please cite this article as: Grubišić, A., Stankov, S., Peraić, I., Ontology Based Approach to Bayesian Student Model Design, Expert Systems with Applications (2013), doi: http://dx.doi.org/10.1016/j.eswa.2013.03.041
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Ontology Based Approach to Bayesian Student Model Design Ani Grubišića, Slavomir Stankova, Ivan Peraićb a b
Faculty of Science, Teslina 12, Split, Croatia,
High School “Biograd na moru”, Augusta Šenoe 29, 23210 Biograd na moru, Croatia
[email protected],
[email protected],
[email protected] Tel: ++385 21 385 133, Fax: ++385 21 384 086
Corresponding author: Ani Grubišić
[email protected] Faculty of Science, Teslina 12, Split, Croatia Tel: ++385 21 385 133, Fax: ++385 21 384 086
1
Ontology Based Approach to Bayesian Student Model Design Abstract Probabilistic student model based on Bayesian network enables making conclusions about the state of student‟s knowledge and further learning and teaching process depends on these conclusions. To implement the Bayesian network into a student model, it is necessary to determine "a priori" probability of the root nodes, as well as, the conditional probabilities of all other nodes. In our approach, we enable non-empirical mathematical determination of conditional probabilities, while “a priory” probabilities are empirically determined based on the knowledge test results. The concepts that are believed to have been learned or not learned represent the evidence. Based on the evidence, it is concluded which concepts need to be re-learned, and which not. The study described in this paper has examined 15 ontologically based Bayesian student models. In each model, special attention has been devoted to defining "a priori" probabilities, conditional probabilities and the way the evidences are set in order to test the successfulness of student knowledge prediction. Finally, the obtained results are analyzed and the guidelines for ontology based Bayesian student model design are presented.
Keywords Intelligent tutoring systems, e-learning, Knowledge modeling, Probabilistic algorithms, Bayesian network, conditional probabilities
1. Introduction Today, there is the ubiquitous need and desire to improve the quality and availability of various educational systems. The education has become a lifelong process and need, and its quality became an impetus. It became clear that the latter cannot be achieved without the appropriate and effective use of information and communication technology (ICT) in the learning and teaching process. The use of ICT in learning and teaching enabled a concept called e-learning. High-quality implementation of e-learning in a form of e-learning systems, brings many advantages in the learning and teaching process and enables the desired new, modern and quality education. The introduction of these technologies and innovations in the field of education not only reduces the cost-effectiveness of pedagogical theory application, but also opens opportunities to explore models from different fields (Millán & Pérez, 2002). One special class of e-learning systems are Intelligent Tutoring Systems (ITS), which, in contrast to the traditional systems that support the learning and teaching process, have the ability to adapt to each student. It is this ability to adapt to each student that allows the improvement of learning and teaching process, because it was shown that the best approach is one-on-one tutoring (Bloom, 1984). The intelligent tutoring systems are a generation of computer systems intended for the support and enhancement of learning and teaching process in the selected domain knowledge, thereby respecting the individuality of those who teach and those who are 1
teaching ((Wenger, 1987), (Ohlsson, 1986), (Sleeman & Brown, 1982)). The intelligent tutoring systems become a student‟s personal "computer teacher". A computer teacher on one side is always cheerful, shows no negative emotions, while a student, on the other hand has no need to hide ignorance and can communicate freely. The intelligent tutoring systems can adapt the content and manner of presentation of certain topics to different student abilities. In this sense, knowledge is the key to intelligent behavior, and therefore intelligent tutoring systems have the following basic knowledge: (i) knowledge that the system has about the domain knowledge (expert module), (ii) teaching principles and methods for applying these principles (teacher module), and (iii) methods and techniques for modeling student‟s acquiring knowledge and skills (student module). Nowadays, ontology is commonly used to formalize knowledge in the ITSs (Lee, Hendler & Lassila, 2001). Ontology describes a conceptual model of a domain, that is, it represents objects, concepts and other entities that are believed to exist, and relations among them (Genesereth and Nilsson, 1987, according to (Gruber, 1993)). The main structural elements of the conceptual model are the concepts and relations. Consequently, every area of human endeavor can be presented with a set of properly related concepts that correspond to appropriate domain knowledge. Ontological description of the domain knowledge provides a simple formalization of declarative knowledge using various tools that support working with concepts and relations. The component of an ITS that represents the student's current state of knowledge and skills is called the student model. The student model is a data structure, and diagnosis is a process that manipulates it. The student model as such represents a key component of ITS. The design of these two components is called the student modeling problem (VanLehn, 1988). If a student model is "bad" to the extent that it does not even closely describe the student‟s characteristics, then all the decisions of other ITS components that are based on this model are of poor quality. Therefore, considerable research is carried out in the field of student modeling. Design and implementation of intelligent tutoring systems systematically contributed and still contributes to the development of methods and techniques of artificial intelligence (AI). An artificial intelligence, as the area that connects computers and intelligent behavior, occurred at the end of 50 - and early 60-ies of last century with pioneers Alan Turing, Marvin Minsky, John McCarthy and Allen Newell (Urban-Lurain, 1996). The AI is essentially oriented on knowledge representation, natural language understanding and problem solving, all of which is equally important for the development of the intelligent tutoring concept (Beck, Stern & Haugsjaa, 1996) One of the techniques widely used in different areas of artificial intelligence are Bayesian networks. The idea of Bayesian networks is not the latest as they began to engage in the 80ies of the last century in the field of expert systems. The true extent of this area began in the 90-ies of the last century, probably due to the increase in computer speed and renewed interest in distributed systems. Large computational complexity is one of the biggest barriers to a wider use of Bayesian networks. Unlike traditional expert systems, where the main purpose is modeling the experts‟ knowledge and replacing them in the process of planning, analyzing, learning and decision making, the purpose of Bayesian network is modeling a particular problem domain. Thus 2
they become the help for experts while studying the causes and consequences of the problems they model (Charniak, 1991). It is extremely important to put emphasis on the domain modeling, as the most important feature of Bayesian networks. The domain modeling refers to collecting and determining all necessary values for Bayesian network initialization. Specially, it refers to modeling dependencies between variables. Dependencies are modeled using a network structure and a set of conditional probabilities (Charniak, 1991). Integration of student models with Bayesian networks in the ITSs is one way to facilitate student learning. Specifically, this model allows making conclusions about the actual student knowledge. Also, it enables a computer tutor to guide the learning and teaching process towards the learning of only those concepts that the student has not already learned. The aim of this paper is to design student model based on Bayesian networks and ontologies, and to compare the results of its predictions with actual student knowledge. All probabilities in the majority of Bayesian network are determined empirically, and that presents the biggest problem in their design. Therefore, novel methods for parameter estimation in Bayesian networks are an important research endeavor, given the utility of the Bayesian approach for student modeling. In our approach, we enable non-empirical mathematical determination of conditional probabilities, while “a priory” probabilities are empirically determined based on the knowledge test results. In the second chapter attention is paid to the theoretical background underlying Bayesian networks. The third chapter describes fifteen probabilistic student models that differ in the way the conditional probabilities are defined and in the way the evidences are set. Finally, the obtained results are analyzed and the guidelines for ontology based Bayesian student model design are presented.
2. Application of Bayesian theory in student modeling One major difficulty that arises in student modeling is uncertainty. The ITSs needs to build a student model based on the small amounts of very uncertain information, because the certainty of information can be obtained only based on students' activities in the system. If these activities do not occur, the diagnosis must be carried out on the basis of uncertain information. Moreover, because the ITS base its decisions on a student model, the uncertainty from the student model contributes to poorly adaptive learning and teaching process. The student model is built based on observations that ITS makes about student. Student model can be viewed as a compression of these observations: raw data are combined, some of them are ignored, and the result is a summary of beliefs about the student. Powerful general theory of decision making are developed and designed specifically for managing uncertainty. One of them is a Bayesian probability theory ((Bayes, 1763), (Cheng & Greiner, 2001), (Mayo, 2001)), which deals with reasoning under uncertainty. Bayesian networks are one of the current approaches of solving uncertain modeling ((Mayo, 2001), (Conati et all, 1997), (VanLehn et all, 1998), (Conati, Gertner & Vanlehn, 2002), (Gamboa & Fred, 2002)). This technique combines the strict formalism of probability with a graphical representation and efficient inference mechanisms.
3
The Bayesian network is a probabilistic graphical model that displays dependencies between nodes (Pearl, 1988). It is a directed acyclic graph in which nodes represent variables and edges represent their interdependence. A node is a parent of a child, if there is an edge from the former to the latter. In Bayesian network nodes that have no parents are called roots, and these variables are first placed in the Bayesian network. The roots are not influenced by any node, while they affect their children. When we put all nodes that do not have children in the Bayesian network, the structure of the Bayesian network is defined (Korb & Nicholson, 2011). After the structure of Bayesian network is defined, it is necessary to define the possible values that each node can take and the values of conditional probabilities of nodes (Korb & Nicholson, 2011). For nodes without parents only “a priori” probabilities have to be defined. “A priori” probabilities for all other nodes can be defined using the corresponding conditional probabilities tables designed based on the values of the “a priori” probabilities of their parents. Therefore, it is superfluous to explicitly specify “a priori” probabilities for the nodes that have parents (Korb & Nicholson, 2011). The dimension of the conditional probability table is determined by the parents of a node. In the case of discrete binary variables, for the node with n parents, the conditional probability table has 2n rows. Namely, the conditional probabilities table of the non-root node has 2n rows. Each row contains one of 2n combinations of values T and F, that is, each row contains t values T. We will use this number of values T in the conditional probability table rows for enabling non-empirical mathematical determination of probabilities. The conditional probability function has two well-marked values: (i) a value that a student does not know the parents of the node, although he/she knows the concept itself - unlucky slip, (ii) and a value that a student knows all the parents of the node, although he/she does not know the node itself - lucky guess. In the literature, these well-marked values are equal to 0.1 (Mayo, 2001). Consequently, the probability of truthful knowing is 1-0.1=0.9. That means, if all the parents are known, we predict that the concept itself is known with the probability 0.9. The Bayesian network can be used for probabilistic inference on the probability of any node in the network if conditional probabilities tables are known. Based on the Bayesian network, the ITS can calculate the expectation (probability) of all unknown variables based on the known variables (evidence) (Charniak, 1991).
3. Ontological Approach to the Bayesian Student Model Design In this section we describe an approach to Bayesian student model design in an intelligent tutoring system that has an expert knowledge presented in a form of ontology. We observe only the model, not the diagnosis process itself. For a student who has never learned domain-specific knowledge, we believe that he/she knows concepts from the domain knowledge graph with the very small probability, that is, we draw conclusions about his/her knowledge without testing it. We consider, in this case, that the student knows a concept from the domain knowledge with a probability 0. Likewise, if we have tested students' knowledge about some domain knowledge, and determined with certainty that the student knows all the concepts from that domain knowledge, then we can argue that the student knows all concepts from that domain knowledge with probability 1. 4
The problem is how to determine probabilities between 0 (not knowing) and 1 (knowing). That is why we define an expert Bayesian student model (Mayo, 2001) over domain knowledge concepts combined with the overlay model (Carr & Goldstein, 1977). For each student, and domain knowledge for each concept we define the probability of a student knowing that concept. When a student model is created, all probabilities are 0. After each question from the test that examines the knowledge about one or more concepts and relations, the conditional probabilities of the concepts involved in those relations change. The correct answers increase, while the incorrect answers reduce the probability of knowing the concepts involved. The teacher uses the probabilities from the student model to determine which concepts the student knows with high probability (e.g., more than 0.8 – (Bloom, 1976)) so that the system does not bother the students with learning and teaching concepts the student already knows. In this way, the Bayesian student model serves as a "sieve" that passes to the learning and teaching process only those concepts that the student does not know with high probability. We have developed a methodology for determining the most suitable way of calculating conditional probabilities, as well as, for determining the most suitable way of setting evidence in such environment. We explain the structure of the domain knowledge ontology, the design of the Bayesian network and present the results of applied methodology for selecting the most suitable Bayesian student model. For the purpose of this research, we have used an adaptive e-learning system AdaptiveCourseware Tutor (AC-ware Tutor) (Grubišić, 2012). This system has ontological domain knowledge, as well as, knowledge tests that enabled us to get instances of actual student‟s knowledge before and after knowledge test.
3.1. Domain Knowledge Ontology Domain knowledge is presented with concepts and relations between them. As we have to indicate the direction of the relation between concepts, we use the terms child and parent. In order to clearly indicate for each relation in the ontology which concepts it connects and what the nature of that relation is, we introduce the following definition (Definition1): Definition1: Let set ECON={K1,…,Kn}, n≥0,be a set of concepts, set EREL={r1,…,rm}U{has_subtype, has_instance, has_part, slot, filler}, m≥0, a set of relations and ØE an empty element. Domain knowledge DK is a set of triplets (K1, r, K2) that define that the concepts K1 and K2 are associated with relation r. In this way we define that the concept K1 is the parent of concept K2 and that concept K2 is the child of concept K1. Since the basic elements of the domain knowledge triples are concepts and relations between them, we use a graph theory as a mathematical foundation for managing subsets and elements of domain knowledge, as well as for domain knowledge visualization (Gross & Yellen, 1998). Therefore, we define a directed domain knowledge graph on which all the rules from the graph theory apply (Definition2). Definition2: For domain knowledge DK we define directed domain knowledge graph DKG=(V,A) where the set of vertices is V=ECON and a set of edges
5
A={(K1,K2)(K1,r,K2) Є DK, r≠ØE, K1≠K2} is equal to a set of ordered pairs of those concepts from the domain knowledge that are related. The set of concept Kx„s parents is a set ParentsKx={KЄECON(Kx,r,K) Є DK, K≠Kx, r≠slot, filler, ØE} ={KЄV(Kx,K) Є A, K≠Kx}. The number pKx is equal to the number of elements in the set ParentsKx and denotes the number of concept Kx„s parents. The set of concept Kx„s children is a set ChildrenKx={KЄECON(K,r,Kx) Є DK, K≠Kx, r≠slot, filler, ØE} ={KЄV(K,Kx) Є A, K≠Kx}. The number cKx is equal to the number of elements in the set ChildrenKx and denotes the number of concept Kx„s children. The vertex from DKG is called a root if it has no parents and has children. The vertex from DKG is called a leaf if it has parents and has no children. These different types of relationships among concepts of the ontology describe the semantic of the related nodes, but are completely equal when it comes to domain knowledge graph design. The only thing that matters is if the relation between two nodes exists or not, and what is the direction of that relation. We define a weight function XV:VDKG[0,1] on the domain knowledge graph, where XV(K x) corresponds to the probability of a student knowing concept K x. The values of the function XV are determined after each knowledge test and calculation of it‟s values depend on the question score and certain concept‟s parents and children. In our approach, the values of the function XV depend on another weight function defined in Definition3: Definition3: The function XA: ADKG{-1,0,1,…,max} defined by ∀KxKyЄADKG, XA(KxKy)=score obtained by answering a question that applies to edge KxKy, is a weight function on a set of edges of the domain knowledge graph. The following applies: ∀KxЄVDKG, ∀KxKyiЄADKG, ∀KyiKxЄADKG
When the student model initializes, all edges in the domain knowledge graph have the weight -1, that is ∀KxKyЄADKG, XA(KxKy)=-1, which means that the knowledge about the relationship between those two concepts has not been tested yet. The function XA allows assigning weights to those edges that connect the concepts mentioned in certain question. Each edge from A’ has the weight between 0 and max, where max is an integer that corresponds to a maximum score that can be assigned to a question. Thus, domain knowledge graph with weight function XA becomes edge-weighted graph where the weighting function values change after each knowledge test. Now, a mathematical definition of the function XV is given (Definition4): Definition4: The function XV:VDKG[0,1] defined by: ∀KxЄVDKG, ∀KxKyiЄADKG, XA(KxKyi)≠-1, ∀KyiKxЄADKG, XA(KyiKx)≠-1
6
is a weight function on a set of vertices of the domain knowledge graph. The value XV(Kx) represents the weighted sum of values of the function XA on the edges towards parents of the concept Kx and towards children of concept Kx. We believe that the student knows the concept Kx completely if and only if XV(Kx)=1, which is true only if all the values of the function XA on the edges towards parents and children are max. Then the probability of knowing the concept is the highest, that is, equal 1.
3.2. Bayesian Network Design We define a Bayesian network BN over domain knowledge concepts as directed acyclic graph where the vertices are variables KX (they correspond the nodes of DKG) and that can take the values T (true, learned) and F (false, not learned) and directed edges between random variables show how are they are related (they correspond the edges of DKG). To implement and test a new approach to probabilistic student model design, we defined the Bayesian network with 73 nodes (see Figure 1). These 73 nodes represent 73 concepts from domain knowledge “Computer as a system” (Grubišić, 2012). It is important to emphasize that there are four root nodes: Computer system, Computer system model, Programming language and Logical gate. In this paper we used the Bayesian networks software package GeNIe (Graphical Network Interface) which provides a graphical user interface for simple construction of Bayesian networks (http://genie.sis.pitt.edu).
Figure 1. Bayesian network structure
7
An adaptive e-learning system AC-ware Tutor has ontological domain knowledge, as well as, knowledge tests that enable realization of the functions XA and XV over domain knowledge graph, as described in previous section. The usage of the AC-ware Tutor has enabled us to get instances of actual student‟s knowledge before and after knowledge test. In the previous section we indicated that the function XV is of a paramount importance for determining "a priori" probabilities of root nodes, but also in evidence setting. The Bayesian student model contains all concepts from the domain knowledge, as well as the values of the function XV for each concept. When the student model initializes, the values of the function XV for each concept are 0. These values can be changed only after a knowledge test is conducted. Since learning and teaching process consists of multiple learning-testing cycles, the student model has to be changed after each cycle, that is, after each knowledge test. We observe two instances of a particular student model. A Student_model_1 is an instance taken at the end of one learning-testing cycle. A Student_model_2 is an instance taken at the end of the following learning-teaching cycle. These two instances have the same structure, but the values of the function XV are different for certain concepts that were involved in the knowledge test (after knowledge test, the values of the function XV change). These two instances are the basis for complex analysis that will be presented below. Based on the domain knowledge graph and the values stored in Student_model_1, three different Bayesian networks will be designed (BN1, BN2, BN3) that have equal nodes and edges, but different calculations of conditional probabilities. These three different networks will be tested on the basis of setting evidences in five different ways (Test1,..., Test5). Finally, there will be a total of fifteen different Bayesian student models (Model1,..., Model15). After applying the methodology for selecting the most suitable Bayesian student model, according to its prediction effectiveness, it will be clear which one of these models most accurately predicts student‟s knowledge on the basis of comparison with the actual values stored in Student_model_2. 3.2.1. Calculating the "a priori" probabilities Every Bayesian network is defined when “a priory” probabilities of root nodes and conditional probabilities tables of its non-root nodes are defined. Therefore, in our approach “a priory” probabilities of root concepts are defined based on values of weight function XV. Namely, “a priory” probability P(KX) of root node KX (corresponds to root from DKG) is defined as following: P(KX)=P(KX=T)=XV(KX) - the probability of knowing the concept KX, as well as, P(KX=F)=1-XV(KX) - the probability of not knowing the concept KX. If XV(KX)=0, then P(KX=T)=0.1, because of the possibility of lucky guess. In Table 1 there is a part of the values stored in Student_model_1. It is obvious, from the above formulas, that the root nodes have the following “a priori” probabilities: Computer system (T=0.33, F=0.67), Computer system model (T=0.083, F=0.917), Programming language (T=0.1, F=0.9- lucky guess), Logical gate (T=0.0416, F=9584).
8
Table 1. Part of the student model instance KX 1.44MB Application software Arithmetic operation Arithmetic-logic unit Assembler Basic C Central unit Central processing unit Disjunction Diskette DOS Fortran I gate OR gate Information Instruction Interpreter Output unit Language translators Capacity Compact disc Compiler Conjunction Logical operation
XV(KX) 0.125 0.375 0 0.375 0 0 0.25 0.5 0.5 0.083 0.25 0 0.125 0 0.5 0.125 0.4375 0 0.2917 0 0.125 0.25 0 0 0.25
3.2.2 Conditional probabilities calculation methods The most important feature of our approach is the usage of non-empirical mathematical determination of conditional probabilities. This is very important as this is the bottleneck for a wider use of this complex technique for predicting student‟s knowledge. Automating this segment simplifies the Bayesian student model design. The conditional probabilities, in our approach, depend only on the domain knowledge ontology, that is, only on the structure of the domain knowledge graph DKG. To specify the Bayesian student model that provides better and more accurate results, the conditional probabilities will be calculated in three ways using the structure of the domain knowledge graph DKG. In the first calculation method, the conditional probabilities depend only on the number of parents, in the second method they depend on the number of parents and children, while in the third method they depend only on the number of children. Common to all three calculation methods are equal "a priori" probabilities of the root nodes. What makes the difference in these three approaches is the determination of the “weight” of the truth (knowing). The probability of truthful knowing 0.9 in each approach is divided with different quantifiers (number of parents, number of children and parents, number of children) and this value is the "weight" of truth in the conditional probabilities tables. The first method for calculating conditional probabilities is a variation of the leaky AND (Conati, Gertner & Vanlehn, 2002) lays on the fact that the fewer parent concepts are known, the lower the probability of the target node is (and so belief that the student knows the corresponding concept). The Bayesian network we use for student modeling is derived from domain knowledge ontology. The ontology includes semantically defined relationships among concepts that can be bidirectional. Since the original Bayesian networks consider only parent nodes for conditional probabilities calculations, in order to facilitate those bidirectional relations, we 9
have to fragment original Bayesian network to include a forest of nodes, in order to ignore the non-directed dependencies encoded in the original Bayesian network. In this way we transform serial connections (PXC) from node‟s parents (P) to node (X) and to node‟s children (C) into converging connections (PXC) from node‟s parents (P) to node (X) and from node‟s children (C) to node (X) (Korb & Nicholson, 2011). Therefore, the second and the third method, the Bayesian network first has to be fragmented, in order to enable conditional probabilities calculations using childe nodes as well. For example, if domain knowledge ontology includes triples (Memory, has_subtype, Mass memory), (Mass memory, has_subtype, Floppy Disk), (Mass memory, has_instance, Hard Disk), (Mass memory, has_instance, Compact Disc), we would like to see how the fact that the student knows concepts Floppy Disk, Hard Disk and Compact Disc (child nodes) influences the prediction of knowing the concept Mass memory. Furthermore, we would like to see how the fact that the student knows the previously mentioned concepts combined with the concept Memory (child and parent nodes together) influences the prediction of knowing the concept Mass memory. 3.2.2.1 Conditional probabilities based on the number of node’s parents In the first approach (BN1), the conditional probabilities table of the non-root node KXis defined based on the number of its parents - pKx. The number and percentage of nodes with certain number of parents can be seen in Table 2. Table 2. The structure of Bayesian network 1 Number of parents roots 1 2 3 4
Total number of nodes 4 53 12 3 1
Percentage of nodes 5.48% 72.60% 16.44% 4.11% 1.37%
In this approach, each value T from conditional probability table, has “weight” 0.9/pKx. Therefore, row “weight” is t*0.9/pKx. This row “weight” defines conditional probability of the non-root concept Kx: P(KX=TKy Є ParentsKX, Ky=T ν Ky=F) = t*0.9/pKx. In the same way we define P(KX=FKy Є ParentsKX, Ky=T ν Ky=F) = 1-t*0.9/pKx. For example, let us analyze the determination of conditional probability of node "Mass Memory" that has two parents ("Central Unit" and "Memory"). In this case, each T value in the conditional probabilities table has "weight" 0.9/2 = 0.45. The conditional probability of concept "Mass Memory" is given in Figure 2. Central Unit Memory P(Mass memory=TCentral Unit, Memory) P(Mass memory=FCentral Unit, Memory)
T T 0.9 0.1
F 0.45 0.55
F T 0.45 0.55
F 0.1 0.9
Figure 2. Conditional probabilities based on the number of node’s parents
10
3.2.2.2 Conditional probabilities based on the number of node’s parents and children In the second approach (BN2), the conditional probabilities table of the non-root node KX is defined based on the number of its parents and children - pKx+cKx. The number and percentage of nodes with certain number of parents and children can be seen in Table 3. Table 3. The structure of Bayesian network 2 Number of parents and children roots 1 2 3 4 5 6
Total number of nodes 4 26 17 10 11 3 2
Percentage of nodes 5.48% 35.62% 23.29% 13.69% 15.07% 4.11% 2.74%
In this approach, each value T from conditional probability table, has “weight” 0.9/(pKx+cKx). Therefore, row “weight” is t*0.9/(pKx+cKx). This row “weight” defines conditional probability of the non-root concept Kx: P(KX=TKy Є ParentsKX U ChildrenKX, Ky=T ν Ky=F) = t*0.9/(pKx+cKx). In the same way we define P(KX=FKy Є ParentsKX U ChildrenKX, Ky=T ν Ky=F) = 1-t*0.9/(pKx+cKx). For example, let us analyze the determination of conditional probability of node"Mass Memory" that has two parents ("Central Unit" and "Memory") and three children ("Floppy Disk", "Hard Disk" and "Compact Disc"). In this case, each T value in the conditional probabilities table has "weight" 0.9/(2 +3)=0.18. The conditional probability of concept "Mass Memory" is given in Figure 3. Central Unit Memory Floppy disk Hard disk Compact disk P(Mass memory=T Central unit, Memory, Floppy Disk, Hard Disk, Compact Disk) P(Mass memory=F Central unit, Memory, Floppy Disk, Hard Disk, Compact Disk) Central Unit Memory Floppy disk Hard disk Compact disk P(Mass memory=T Central unit, Memory, Floppy Disk, Hard Disk, Compact Disk) P(Mass memory=F Central unit, Memory, Floppy Disk, Hard Disk, Compact Disk)
T T
F
T
F
T
F
T
T
F
F
T
F
T
F
T
F
T
F
T
F
T
F
T
F
T
F
T
F
T
F
0,9
0,72
0,72
0,54
0,72
0,54
0,54
0,36
0,72
0,54
0,54
0,36
0,54
0,36
0,36
0,18
0,1
0,28
0,28
0,46
0,28
0,46
0,46
0,64
0,28
0,46
0,46
0,64
0,46
0,64
0,64
0,82
F T
F
T
F
T
F
T
T
F
F
T
F
T
F
T
F
T
F
T
F
T
F
T
F
T
F
T
F
T
F
0,72
0,54
0,54
0,36
0,54
0,36
0,36
0,18
0,54
0,36
0,36
0,18
0,36
0,18
0,18
0,1
0,28
0,46
0,46
0,64
0,46
0,64
0,64
0,82
0,46
0,64
0,64
0,82
0,64
0,82
0,82
0,9
Figure 3. Conditional probabilities based on the number of node’s parents and children
3.2.2.3 Conditional probabilities based on the number of node’s children In the third approach (BN3), the conditional probabilities table of the non-root node KX is defined based on the number of its children - cKx. The number and percentage of nodes with certain number of children can be seen in Table 4. 11
Table 4. The structure of Bayesian network 3 Number of children roots leafs 1 2 3 4
Total number of nodes 4 31 12 14 9 3
Percentage of nodes 5.48% 42.46% 16.44% 19.18% 12.33% 4.11%
In this approach, each value T from conditional probability table, has “weight” 0.9/cKx. Therefore, row “weight” is t*0.9/cKx. This row “weight” defines conditional probability of the non-root concept Kx: P(KX=TKy Є ChildrenKX, Ky=T ν Ky=F) = t*0.9/cKx. In the same way we define P(KX=FKy Є ChildrenKX, Ky=T ν Ky=F) = 1-t*0.9/cKx. The only problem in this approach are the nodes that have no children. For those nodes cKx is 0, and we cannot calculate the “weight” of truth according to the above formula. Therefore, we determine that each value T from conditional probability table, has “weight” 0.5. For example, let us analyze the determination of conditional probability of node "Mass Memory" that has three children ("Floppy Disk", "Hard Disk" and "Compact Disc"). In this case, each T value in the conditional probabilities table has "weight" 0.9/3=0.3. The conditional probability of concept "Mass Memory" is given in Figure 4. Floppy Disk Hard Disk Compact Disk P(Mass memory=T Floppy Disk, Hard Disk, Compact Disk) P(Mass memory=F Floppy Disk, Hard Disk, Compact Disk)
T
F
T T 0.9 0.1
F F 0.6 0.4
T 0.6 0.4
T F 0.3 0.7
T 0.6 0.4
F F 0.3 0.7
T 0.3 0.7
F 0.1 0.9
Figure 4. Conditional probabilities based on the number of node’s children
3.2.3 Setting the pieces of evidence The importance of the function XV is not only in determining the "a priori" probabilities of root nodes, but it is also used for setting the pieces of evidence. We will observe five different ways of setting the pieces of evidence (five different values of the function XV used as threshold) in order to examine their efficiency and reliability. These various threshold values are defined completely heuristically, and the following analysis is done in order to determine which one of these heuristic values is the best for setting the pieces of evidence. It is important to observe gained predictions and compare them with the actual values from the instance of the real student model Student_model_2, the gold standard. 3.2.3.1 Test 1 Let Kx be any node. If XV(Kx) ≥ 0.9, then we set the evidence on the node Kx on truth. Similarly, if 1-XV(Kx) ≥ 0.9, then we set the evidence on the node Kx on false. Example 1: In the instance Student_model_1 exists the value XV(Computer system model)=0.083. It is clear that 1-XV(Computer system model)=0.917 which is greater than 0.9. Therefore, we set the evidence on the node Computer system model on false. In this way, we set the false evidence on four nodes (5% of all nodes are pieces of evidence). 12
3.2.3.2 Test 2 Let Kx be any node. If XV(Kx) ≥ 0.8, then we set the evidence on the node Kx on truth. Similarly, if 1-XV(Kx) ≥ 0.8, then we set the evidence on the node Kx on false. Example 2: In the instance Student_model_1 exists the value XV(Fortran)=0.125. It is clear that 1-XV(Fortran)=0.875 which is greater than 0.8. Therefore, we set the evidence on the node Fortran on false. In this way, we set the false evidence on twelve nodes (16% of all nodes are pieces of evidence). 3.2.3.3 Test 3 Let Kx be any node. If XV(Kx) ≥ 0.75, then we set the evidence on the node Kx on truth. Similarly, if 1-XV(Kx) ≥ 0.75, then we set the evidence on the node Kx on false. Example 3: In the instance Student_model_1 exists the value XV(Central Unit)=0.25. It is clear that 1-XV(Central Unit)=0.75 which is equal to 0.75. Therefore, we set the evidence on the node Central Unit on false. In this way, we set the true evidence on four nodes and false evidence on seventeen nodes (29% of all nodes are pieces of evidence) 3.2.3.4 Test 4 Let Kx be any node. If XV(Kx) ≥ 0.65, then we set the evidence on the node Kx on truth. Similarly, if 1-XV(Kx) ≥ 0.65, then we set the evidence on the node Kx on false. Example 4: In the instance Student_model_1 exists the value XV(Input Unit)=0.312. It is clear that 1-XV(Input Unit)=0.688 which is greater than 0.65. Therefore, we set the evidence on the node Input Unit on false. In this way, we set the true evidence on four nodes and false evidence on nineteen nodes. The results would be the same if we have observed a limit 0.7 (32% of all nodes are pieces of evidence). 3.2.3.5 Test 5 Let Kx be any node. If XV(Kx) ≥ 0.6, then we set the evidence on the node Kx on truth. Similarly, if 1-XV(Kx) ≥ 0.6, then we set the evidence on the node Kx on false. Example 5: In the instance Student_model_1 exists the value XV(Application Software)=0.375. It is clear that 1-XV(Application Software)=0.625 which is greater than 0.6. Therefore, we set the evidence on the node Application Software on false. In this way, we set the true evidence on four nodes and false evidence on twenty six nodes (41% of all nodes are pieces of evidence).
13
By comparing the way the pieces of evidence are set, the differences are obvious if we observe only the total number of set pieces of evidence. In Test1, only four pieces of evidence are set. The same four pieces of evidence occur in all other ways of evidence setting. It is logical to assume that there are differences in making predictions between setting only four pieces of evidence in the Bayesian network (Test1) and setting thirty pieces of evidence (Test5). It is also logical to assume that the more evidences are set, the more accurate prediction model we have. These assumptions will be refuted furthermore. It will be shown that it is essential to set evidence in a quality manner and that the quantity of evidence does not play the most important factor for accuracy of prediction. Moreover, mentioned five ways of setting evidence will be used on all three models of Bayesian networks. This way we test the prediction effectiveness of, in total, 3x5=15 Bayesian student models (Model1,..., Model15). 3.2.4 Testing the Bayesian student model prediction effectiveness Student‟s knowledge after the knowledge test is contained in an instance of a student model Student_model_2. That instance contains the actual student‟s knowledge. So, based on the actual knowledge, it is known which concepts has student mastered, and mentioned 15 models will be analyzed to show which one of them best predicts this actual student‟s knowledge. The comparative analysis included only those nodes whose values of the function XV differ in the student model instances Student_model_1 and Student_model_2. The nodes that are evidences were excluded from the comparative analysis. For each model, the percentage of overlapping in relation to an instance of the model student Student_model_2 is given. If a value of prediction for a given node and its value of the function XV differ less or equal to 0.1, then we have a prediction match. If a value of prediction for a given node and its value of the function XV differ more than 0.1 and less or equal to 0.2, then we have a prediction indication. If a value of prediction for a given node and its value of the function XV differ more than 0.2, then we have a prediction miss. These values are determined heuristically and have no support in the literature, therefore have to be verified in future experiments. The results of an analysis are presented in Tables 5, 6, 7.
Table 5. Results of Bayesian student model prediction testing Model
Bayesian network
Evidence setting
Number of compared nodes
Model1 Model2 Model3 Model4 Model5 Model6 Model7 Model8 Model9 Model10 Model11 Model12 Model13 Model14 Model15
BN1 BN2 BN3 BN1 BN2 BN3 BN1 BN2 BN3 BN1 BN2 BN3 BN1 BN2 BN3
Test1 Test1 Test1 Test2 Test2 Test2 Test3 Test3 Test3 Test4 Test4 Test4 Test5 Test5 Test5
41 41 41 33 33 33 23 23 23 21 21 21 14 14 14
14
Match ≤0.1 36% 32% 15% 12% 18% 9% 22% 22% 26% 9% 28% 14% 14% 28% 7%
Indication 0.1≤0.2 32% 17% 12% 27% 18% 12% 17% 17% 17% 19% 24% 33% 28% 14% 21%
Miss >0.2 32% 51% 73% 61% 64% 71% 61% 61% 57% 72% 48% 52% 58% 58% 72%
Table 6. Average results regardless of evidence setting Bayesian network BN1 BN2 BN3
Match ≤0.1 19% 26% 14%
Indication 0.1≤0.2 25% 18% 17%
Miss >0.2 56% 56% 64%
Table 7. Average results regardless of Bayesian network Evidence setting
Number of pieces of evidence
Test1 Test2 Test3 Test4 Test5
4 12 21 23 30
Match ≤0.1 28% 13% 23% 17% 16%
Indication 0.1≤0.2 20% 19% 17% 25% 21%
Miss >0.2 52% 68% 60% 58% 63%
Observing the results from the mentioned tables, it is not difficult to conclude that the network BN3 has the "worst" results (the highest percentage in the last column of Table 6 64%). This result can be attributed to the setting the conditional probability value on 0.5 for all nodes without children. When we compare the networks BN1 and BN2, we can conclude that the BN1 has better results in Test1 and Test5, while in Test3 they have identical results. The BN2 has shown better results in Test2 and Test4. Overall the BN2 has the most matches (the highest percentage in the second column of Table 6 - 26%) and, therefore, it can be considered the best. Looking at the Table 7 and trying to answer which evidence setting is the best for knowledge prediction, it is not hard to see that this is Test1 (the highest percentage in the third column of Table 7 - 28%). We conclude that it is essential to set evidences in a quality manner and that the quantity of evidence does not play the most important factor for accuracy of prediction. If we observe individual results in Table 5, we conclude that the model that has the most overlap with actual student‟s knowledge is Model1 where the conditional probabilities were determined based on the number of parents (BN1) and evidence were set for nodes whose values of the function XV were greater or equal to 0.9 (Test1). This model has at least prediction misses, and in relation to other models, very high number of prediction matches. Therefore, this model stands as an appropriate Bayesian student model for predicting student knowledge in ontology based environments.
4. Conclusion The intelligent tutoring systems need to build a model based on uncertain information received from the students. This information can be variously interpreted, therefore, the role of probabilistic models is especially important. To build a model that, given the small number of high-quality information to make conclusions about student‟s knowledge and to adapt to it, requires a lot of effort. Bayesian network theory provides the above, but is particularly important to find the best way how to implement Bayesian networks in student model design process.
15
The desire to provide a new, modern and quality education requires a lot of research. This paper describes a Bayesian student model, as a new way of modeling students in ontology based intelligent tuoring systems. Development of this model is illustrated through empirical research that included comparative analysis of fifteen potential models where we looked for the one that the best predicts the student‟s knowledge. The most important feature of this model is its non-empirical mathematical determination of conditional probabilities, while “a priory” probabilities are empirically determined based on the knowledge test results. This is very important as this is the bottleneck of using Bayesian networks. Automating this segment will eventually lead to a wider use of this complex technique for predicting student‟s knowledge, as the conditional probabilities depend only on the structure of domain knowledge ontology. The basis of this study was to find the best way to design a Bayesian student model. Numerous deployments were observed and a special emphasis placed on determining the conditional probabilities and evidence setting. Believing that the most important aspect is determination of conditional probabilities, it was proven that nothing less important is the setting of evidence. The model that, among all tested models, represents the best actual student‟s knowledge is a model where the conditional probabilities were determined based on the number of parents (BN1) and evidence were set for nodes whose values of the function XV were greater or equal to 0.9 (Test1). In the future, we should find answers why this model was wrong in 32% cases and eliminate these prediction misses. It turned out that a small, but well selected, number of evidence enable better prediction of the student‟s knowledge than many unfounded evidence. In further studies related to Bayesian student model design, we will conduct broader research on a larger sample of instances of actual student models and see what is the percentage of selected Bayesian student model that accurately predict student‟s knowledge. Furthermore, we will test this model on different domain knowledge to conclude about the model‟s generality and independence of domain knowledge. There are several aspects that should be involved in the extension of the presented work: in depth sensitivity analysis, real-time usage of the network updated as a result of student actions in order to find out about its accuracy.
Acknowledgements This paper describes the results of research being carried out within project 177-03619941996 Design and evaluation of intelligent e-learning systems within the program 036-1994 Intelligent Support to Omnipresence of e-Learning Systems, funded by the Ministry of Science, Education and Sports of the Republic of Croatia.
6. References [1] Bayes, R. (1763). An essay toward solving a problem in the doctrine of chances. Philos.
Trans. R. Soc. London, 53, pp. 370-418. [2] Beck, J., Stern, M. & Haugsjaa, E. (1996). Applications of AI in Education. Crossroads, 3(1), pp. 11-15. [3] Bloom, B. S. (1984). The 2 Sigma Problem: The Search for Methods of Group Instruction
as Effective as One-to-One Tutoring. Educational Researcher, 13(6), pp. 4-16.
16
[4] Bloom, B.S. (1976). Human Characteristics and School Learning. New York: McGraw-Hill
Book Company [5] Carr, B., Goldstein, I.P. (1977). Overlays. A theory of modeling for computer-aided instruction, AI Lab Memo 406, Massachusetts Institute of Technology, Cambridge, Massachusetts [6] Charniak, E. (1991). Bayesian Networks without tears, AI magazine, 12(4), pp. 50–63. [7] Cheng, J. & Greiner, R. (2001). Learning bayesian belief network classifiers: Algorithms and system. Advances in Artificial Intelligence, pp. 141-151. [8] Conati, C., Gertner, A. & Vanlehn, K. (2002). Using Bayesian networks to manage uncertainty in student modeling. User Modeling and User-Adapted Interaction, 12(4), pp. 371-417. [9] Conati, C., Gertner, A. S., Vanlehn, K. & Druzdzel, M. J. (1997). On-line student modeling for coached problem solving using Bayesian networks. User Modeling: Proceedings of the Sixth International Conference, UM97, pp. 231-242. [10] Gamboa, H. & Fred, A. (2002). Designing intelligent tutoring systems: a bayesian approach. Enterprise information systems III, 1, pp. 452-458. [11] Gross, J. L. & Yellen, J. (1998). Graph Theory and Its Applications (1st ed.). CRC Press. [12] Gruber, T. R. (1993). A translation approach to portable ontology specifications.
Knowledge acquisition, 5(2), pp. 199-220. [13] Grubišić, A. (2012). Adaptive student's knowledge acquisition model in e-learning systems. PhD Thesis, Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia (in Croatian). [14] Korb, K.B., Nicholson, A.E. (2011). Bayesian Artificial Intelligence. Chapman & Hall/CRC Press, 2nd edition [15] Lee, T. B., Hendler, J. & Lassila, O. (2001). The semantic web. Scientific American, 284(5), pp. 34-43. [16] Mayo, M. J. (2001). Bayesian Student Modelling and Decision-theoretic Selection of Tutorial Actions in Intelligent Tutoring Systems, PhD Thesis, University of Canterbury, Christchurch, New Zealand. [17] Millán, E., Pérez-De-La-Cruz, J.L. (2002). A Bayesian Diagnostic Algorithm for Student Modeling and its Evaluation, User Modeling and User-Adapted Interaction 12(2-3), pp. 281-330. [18] Ohlsson, S. (1986). Some principles of intelligent tutoring, Instructional Science, 14, pp. 293–326. [19] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems, San Mateo: Morgan Kaufmann [20] Sleeman, D. & Brown, J. S. (1982). Introduction: Intelligent Tutoring Systems: An Overview. Intelligent Tutoring Systems (Sleeman, D.H., Brown, J.S.), pp. 1-11. Academic Press, Burlington, MA. [21] Urban-Lurain, M. (1996). Intelligent tutoring systems: An historic review in the context of the development of artificial intelligence and educational psychology.In: Technical Report, Department of Computer Science and Engineering, Michigan State University. [22] VanLehn, K. (1988). Student Modeling. In Foundations of Intelligent Tutoring Systems, M. C. Polson, J. J. Richardson, Eds., Lawrence Erlbaum Associates Publishers, pp. 55 – 79 [23] VanLehn, K., Niu, Z., Siler, S. & Gertner, A. (1998). Student modeling from conventional test data: A Bayesian approach without priors, In Goettle B., Halff H., Redfield C., and Shute V. (Eds.) Proc. of the 5th International Conference on Intelligent Tutoring Systems, Springer-Verlag, pp. 434-443. 17
[24] Wenger, E. (1987). Artificial Intelligence and Tutoring Systems. Morgan Kaufmann
Publishers, Inc., California, USA
18
Table1
Table 1. Part of the student model instance KX 1.44MB Application software Arithmetic operation Arithmetic-logic unit Assembler Basic C Central unit Central processing unit Disjunction Diskette DOS Fortran I gate OR gate Information Instruction Interpreter Output unit Language translators Capacity Compact disc Compiler Conjunction Logical operation
XV(KX) 0.125 0.375 0 0.375 0 0 0.25 0.5 0.5 0.083 0.25 0 0.125 0 0.5 0.125 0.4375 0 0.2917 0 0.125 0.25 0 0 0.25
1
Table2
Table 1. The structure of Bayesian network 1 Number of parents roots 1 2 3 4
Total number of nodes 4 53 12 3 1
1
Percentage of nodes 5.48% 72.60% 16.44% 4.11% 1.37%
Table3
Table 1. The structure of Bayesian network 2 Number of parents and children roots 1 2 3 4 5 6
Total number of nodes 4 26 17 10 11 3 2
1
Percentage of nodes 5.48% 35.62% 23.29% 13.69% 15.07% 4.11% 2.74%
Table4
Table 1. The structure of Bayesian network 3 Number of children roots leafs 1 2 3 4
Total number of nodes 4 31 12 14 9 3
1
Percentage of nodes 5.48% 42.46% 16.44% 19.18% 12.33% 4.11%
Table5
Table 1. Results of Bayesian student model prediction testing Model
Bayesian network
Evidence setting
Number of compared nodes
Model1 Model2 Model3 Model4 Model5 Model6 Model7 Model8 Model9 Model10 Model11 Model12 Model13 Model14 Model15
BN1 BN2 BN3 BN1 BN2 BN3 BN1 BN2 BN3 BN1 BN2 BN3 BN1 BN2 BN3
Test1 Test1 Test1 Test2 Test2 Test2 Test3 Test3 Test3 Test4 Test4 Test4 Test5 Test5 Test5
41 41 41 33 33 33 23 23 23 21 21 21 14 14 14
1
Match ≤0.1 36% 32% 15% 12% 18% 9% 22% 22% 26% 9% 28% 14% 14% 28% 7%
Indication 0.1≤0.2 32% 17% 12% 27% 18% 12% 17% 17% 17% 19% 24% 33% 28% 14% 21%
Miss >0.2 32% 51% 73% 61% 64% 71% 61% 61% 57% 72% 48% 52% 58% 58% 72%
Table6
Table 1. Average results regardless of evidence setting Bayesian network BN1 BN2 BN3
Match ≤0.1 19% 26% 14%
1
Indication 0.1≤0.2 25% 18% 17%
Miss >0.2 56% 56% 64%
Table7
Table 1. Average results regardless of Bayesian network Evidence setting
Number of pieces of evidence
Test1 Test2 Test3 Test4 Test5
4 12 21 23 30
1
Match ≤0.1 28% 13% 23% 17% 16%
Indication 0.1≤0.2 20% 19% 17% 25% 21%
Miss >0.2 52% 68% 60% 58% 63%
Figure1
Figure1_color
Figure2
Figure3
Figure4
1. 2. 3. 4. 5.
Probabilistic student model based on Bayesian network Non-empirical mathematical determination of conditional probabilities Guidelines for ontology based Bayesian student model design Novel methods for parameter estimation in Bayesian networks Expert Bayesian student model over domain knowledge concepts combined with the overlay model