Journal Pre-proof
Bayesian Entropy Network for Fusion of Different Types of Information Yuhao Wang , Yongming Liu PII: DOI: Reference:
S0951-8320(18)31311-5 https://doi.org/10.1016/j.ress.2019.106747 RESS 106747
To appear in:
Reliability Engineering and System Safety
Received date: Revised date: Accepted date:
27 October 2018 28 October 2019 10 November 2019
Please cite this article as: Yuhao Wang , Yongming Liu , Bayesian Entropy Network for Fusion of Different Types of Information, Reliability Engineering and System Safety (2019), doi: https://doi.org/10.1016/j.ress.2019.106747
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Ltd.
Highlights
BEN is an information fusion tool combining Bayesian and Maximum Entropy method
BEN encodes various types of information into Bayesian method as constraint
Benefits in classification when training data size is small
Converges to the Bayesian result when dataset is large
1
A Bayesian Entropy Network for Fusion of Different Types of Information Yuhao Wang1, Yongming Liu1 1
Arizona State University, Tempe, AZ, 85281, USA
Abstract: A hybrid method for information fusion combining the maximum entropy (ME) method with the classical Bayesian network is proposed as the Bayesian-Entropy Network (BEN) in this paper. The key benefit of the proposed method is the capability to handle various types of information for classification and updating, such as classical point data, abstracted statistical information, and range data. The detailed derivation of the proposed is given and special focus is on the formulation of different types of information as constraints embedded in the entropy part. The Bayesian part is used to handle classical point observation data. Next, an adaptive algorithm is proposed to mitigate the impact of wrong information constraints on the final posterior distribution estimation. Following this, several examples are used to demonstrate the proposed methodology and application to engineering problems. It is shown that the proposed method is a 2
generalized form of classical Bayesian method, and can take advantage of the extra information. This advantage is preferable in many engineering applications especially when the number of point observations is limited. Conclusions and future work are drawn based on the current study. Keywords: Bayesian network, maximum entropy, probability, updating, classification
1 Introduction Statistical machine learning has been extensively developed during the past decades and Bayesian Network (BN) was one of the most widely-used learning method [1]. BN is a robust tool in inference due to its ability to model causal relationship and update the posterior distribution with observations (most likely point observations). BN has been applied in engineering problems to update and infer the model parameters through observations [2] and to predict system reliability [3]. The application of Bayesian updating in engineering problems dates back to the 1970s. The potential of the method was discovered for updating the probability distribution for model parameters and the prediction of system response [4][5][6]. In [7], a framework based on probabilistic logic for system prediction under model uncertainty is proposed, which considered the plausibility of different model classes and updates the uncertainty using Bayesian method. The Bayesian Network Classifier (BNC) uses the Bayesian network in classification [8]. A BNC learns the conditional distribution (the likelihood) of feature variables given each class from data. Prediction can be made for new instance based on the class that achieves the highest posterior probability. The Bayesian method requires the evaluation of the posterior probability as a product of the prior and likelihood function. The prior
3
is related to the previous belief in the parameter of concern and the likelihood related to the model and the observed data. In practice, other types of information may be accessible especially for complex engineering systems. For example, in the analysis of the failure in aircrafts, the data required to train an accurate network model may not be available. While an expert engineer from industry may suggest that the aircraft would be in critical condition if no maintenance for 300 cycles of operation. Such empirical information from expert opinion or human concept is not easy to be encoded into the traditional Bayesian method. Other examples of available information include moment information (mean and variance etc.) and range information. A Plausible Petri net (PPN) framework introduces expert information as a symbolic place in a Petri Network in [9]. The algorithm was applied in a railway problem to present Bayesian updating with data and expert system as a PPN[10]. There are existing researches about introducing additional information into the Bayesian method. [11] presented a way of updating parameter distribution using fuzzy observation data by changing the likelihood function into an integral over the fuzzy range. [12] used a Bayesian prior to control the posteriors by adding constraints. The method is to minimize the Kullback-Leibler divergence between the target distribution and the prior under constraint. The posterior is solved using expectation maximization algorithm. This provides a way of numerically solve for the target distribution. A posterior regularization framework was presented in [13] for structured, weakly supervised learning. The framework treats data-dependent constraint as information about model posteriors. The work was further developed as a regularized Bayesian inference in [14]. [15] introduced a Bayesian maximum entropy (BME) that can integrate information from multiple sensors. It was proven to have better accuracy in spatiotemporal problems [16][17] and has successfully applied in various fields [18]. BME uses 4
maximum entropy theory to construct the prior distribution based on general knowledge or physical law [19]. The method was found especially popular in geostatic problems to handle cite specific information [20][21][22]. The Maximum Entropy (ME) method is an alternative tool for updating for posterior distribution regarding given evidences as constraints. It was first introduced in Jaynes’ information theory[23][24] and has been widely applied in science and engineering fields. It is shown that the Bayes’ theorem is a special case of the ME algorithm [25] where only point data is available. Although there has been debates [26][27] about the inconsistency of the entropy method to the classical probability theory, i.e., the Bayesian method, it has been shown that the inconstancy depends on how the information as constraint is used [28]. The constraint can be used to encode additional information into the Bayesian framework. The ME method incorporating moment constraints has been derived in [28] and successfully applied in a fatigue life prediction scheme via single parameter update [29]. The work in [30] provided a method to update parameters using statistical moment information. It builds a separate network to train the likelihood function between the moment data and the model parameters. Although the method can successfully take advantage of the moment information that can be interpreted from historical data, the model requires a separate step and an additional dataset for training. In this paper, we propose a method of handling extra information via the combination of Bayesian and ME method to achieve information fusion. The extra information will be introduced into the Bayesian method in the form of constraints. The proposed method is similar to the general Bayesian updating. The idea is that the method has a Bayesian part which deals with point observations as a classical Bayesian framework, and an entropy part that can encode extra information as a constraint. The entropy part is analytically solvable, so it is easy to implement into any existing Bayesian method. When no extra information is given, the entropy 5
term would drop, the governing equation yields the classical Bayesian equation. The new developed method will be called Bayesian Entropy Network (BEN) from this point on. The BEN is proposed as a generalized information fusion tool. The concept of information fusion in this research refers to the ability of BEN to utilize various types of information. Such information could be abstracted statistic information, physical constraint on the range of variable, expert knowledge and linguistic data etc. It is essentially different from the sensor information fusion [31] by the application of Bayesian and maximum entropy method [32]. Based on the research work in [33][34], the constraint introduced by the entropy term is a hard constraint. In this paper, an adaptive framework was added into the BEN method to mitigate the strong effect of the entropy constraint. The rest of the paper will be organized as follows: in the next section, the definition of entropy will be reviewed and the derivation for the target posterior given different types of constraint will be introduced. Following this, an adaptive framework to mitigate the strong effect of the entropy constraint is proposed. In section 4, two examples will be given to illustrate the BEN method. One demonstration example will be given in the application of BEN in classification incorporating moment and range constraints on some imaging features. The classification result will be compared with a Naïve Bayes classifier. The second example involves an air traffic control model that predicts the occurrence probability of an accident related to human performance. Additional information from various sources will be encoded into a Bayesian network model to demonstrate the influence of extra information and to compare with the classical Bayesian method. Section 5 of the paper will provide concluding remarks and some suggestions for future research. The main contributions of this paper are: 1) the extension of the analytical solution for the entropy-based constraints to many different types of information 6
commonly seen in engineering practices; 2) an adaptive framework to mitigate the strong effect of the entropy term; and 3) the concept of using Bayesian Entropy updating in a network format for complex system analysis.
2 Brief review of the maximum entropy method As presented in [23], the ME method was originally defined for the purpose of assigning probability using information as constraints. It was later found that the Bayes’ rule can be derived from the ME method using observations as a constraint [25]. The method was used in updating probability with moment constraint in [28]. The entropy is measured between the posterior p and prior q of the joint distribution for θ and x as: S[ p, q] p( x, ) log
p( x, ) dxd q( x, )
1
where θ is the parameter, x is the observable variable and p(x,θ) and q(x,θ) is the posterior and prior joint distribution. The integral range is over the domain of the parameter θ and x. The entropy is used to describe the difference between the two distribution functions. The idea of the ME method is to find the posterior distribution that maximizes the entropy under constraints. In this section, the derivation of the constraint term will be given under various types of information. 2.1 Maximizing entropy with point observations The point observations are the most common type of data that is usually seem in engineering problems. It could be measurements, sensor readings and experimental outcomes. When updating for a parameter θ with observable variable x, the traditional Bayesian method updates the belief on the parameter θ with some observed values of x as:
7
p( | x x ')
p( x x ' | ) p( ) p( x x ')
2
where x’ represents the observation. p(θ|x=x’) is the posterior distribution of θ given the observation x’. The p(x=x’) in the denominator is acting as a normalizing constant and the equation can be written in the proportional form as:
p( | x x ') p( x x ' | ) p( )
3
In the ME method, the goal is to maximize the entropy under the observation constraint. In this case, the entropy involves the old and new joint distribution of parameter θ and the observable variable x as expressed in Eq. 1. The constraint given the observation data x’ is described using a delta function:
p( x) p( x, )d ( x x ')
4
Another constraint comes from the definition of a probability distribution function (PDF), i.e., the integral over the domain equaling to unity. We have the normalization constraint:
p( x, )dxd 1
5
The target posterior distribution should maximize the entropy in Eq. 1 while satisfying the constraint in Eq. 4 and 5. The result for the target distribution, i.e. the posterior can be solved using Lagrange method and is expressed as: p ( x, )
q ( x, ) ( x x ') q X ( x)
6
Integrating over x for the marginal distribution of θ: q( x, ) ( x x ')dx q( | x x ') q X ( x)
p ( ) p( x, )dx
8
7
This is exactly the Bayes’ theorem where the posterior distribution is the conditional based on the observed point data x’. 2.2 Maximizing entropy with moment information The benefit of the ME method is that, theoretically, any information can be encoded if written in the form of a constraint. Statistical moment information can often be derived from historical database or due to the data reduction/abstraction. While the traditional Bayesian method may not be able to easily handle this type of information, it can be written in a constraint form and used in the ME method. The moment information of the parameter θ can be expressed as
p( x, ) g ( )dxd G
8
Eq. 8 represents the expected value of a function g(θ). When g(θ)=θ, the equation represents the first order moment. When g(θ)=θ2, the equation represents the second order moment and so on. With the normalization constraint and observation constraint as in Eq. 5 and Eq. 4, the posterior can be solved as: p ( ) q( | x x ')
e g ( ) q( | x x ')e g ( ) Z
9
Where β is a Lagrange multiplier that is used in solving the maximization problem. Z is a constant introduced by the normalization constraint. Given the specific form of the prior joint distribution, the β term could be analytically solved. The detailed derivation can be found in [28]. As shown in Eq. 9, the solution from the ME method includes the Bayesian part (the first term on the right-hand side) and an additional entropy part that includes the moment information (the exponential term). If there is no such information, i.e., β=0, the equation recovers the classical Bayesian updating rule. 9
2.3 Maximizing entropy with range constraint Another type of information commonly seen in engineering problem is range information. For example, observation of some physical properties is within a certain range (i.e., crack length in a bridge is between 5mm and 6mm). Another example is that a parameter θ could have a physical constraint and its value should fall in a certain range. While it would lose some generality in assuming a specific bonded prior for the parameter, we introduce a range constraint using the entropy method. Assume the parameter θ should be in the range from a to b, the constraint could be expressed as: b
p( x, )dxd 1
10
a
Along with the normalization constraint in Eq. 5. The idea of these two constraints setup is that the target posterior only takes value in the range from a to b for variable θ, and for (a, b) the probability density is zero:
p( x, )dxd 0
11
( a ,b )
Forming the Lagrange function using these two constraints: b
L S [ p( x, )dxd 1] [ p( x, )dxd 1]
12
a
where α and γ are Lagrange multipliers. To find the target posterior function, the variation of the Lagrangian function needs to be zero, i.e., L 0 . Thus, the derivative of the Lagrangian function to p(x,θ) equals zero, L / p 0 . This yields: (log(
p ( x, ) ) 1)dxd dxd dxd 0 q ( x, ) a b
10
13
The target posterior needs to satisfy the equation, which means that p(x,θ) should satisfy Eq. 13 when (a, b) and (a, b) . This provided us with two equations when (a, b) and
(a, b) : b b b p ( x, ) (log( ) 1) dxd dxd a a dxd 0 q ( x, ) a p ( x, ) (log( ) 1)dxd dxd 0 ( a ,b ) q ( x , ) ( a ,b )
14
This leads to a piecewise solution for the posterior as:
q( x, )e 1 p ( x, ) 1 q ( x, ) e
, ( a, b) , ( a, b)
15
Substitute the result into Eq. 10, we have: e 1
1 Q (b) Q (a)
16
where Qθ(·) is the cumulative density function (CDF) of the prior distribution for θ. And substitute Eq. 15 back in Eq. 11 we have: e 1[1 (Q (b) Q (a))] 0
17
Since we do not control or make any assumption about the prior distribution, the result in Eq. 17 indicates that the term e 1 should be an infinitely small number. This is not hard to express in numerical calculations, it can simply be achieved by assigning a large negative number (e.g. 1000) to α. Hence, by integrating the joint posterior over x, the final solution for the posterior of θ given a range constraint from a to b is:
1 q ( ) , (a, b) p ( ) Q (b) Q (a) 0 , ( a, b) 11
18
It can be seen that the solution from the entropy method gives us a truncated distribution on the specified range. The result may seem to be intuitive, but the key point is that there is no assumption made to the prior distribution. The truncated result is the effect of the encoded constraint. 2.3 Maximizing entropy with general function as constraint When modeling a parameter θ and its observable value x as a statistical model (such as a Bayesian Network), the two variables could be regarded as jointly distributed. The correlation between these two variables can only describe the linearity. But in most cases, the true relation between two parameters is non-linear. In this part, a method of encoding the underlying physics into a two jointly distributed random variables using entropy constraint is discussed. Assume that a known relation exists between the observable quantity x and the parameter θ and could be expressed as θ =f(x). This piece of information can be interpreted as given an observed value of x, it is expected that the correlated parameter is f(x). Or in other words, the expected outcome given a parameter value θ is f -1(θ), where f -1 is the inverse of function f. Written in constraint form, it could be expressed as:
p( | x) d f ( x)
19
The integral is over the domain of parameter θ. To interpret Eq. 19, it can be stated in human language as: The expected value for parameter θ given observational value x is equal to f(x). Similar to the previous sections, maximizing the entropy with the constraint in Eq. 19 and the normalization constraint as in Eq. 5, a Lagrange function can be formed:
L S [ p( | x)d 1] [ p( | x) d f ( x)]
12
20
Note that the object in this case is the conditional probability of θ given x. To maximize the Lagrange function, the derivative with respect to the conditional probability of θ given x is equal to zero, which gives:
L p( | x) (log( ) 1)d d d 0 p q ( | x )
21
Solving Eq. 21 we can have the relation between the target likelihood function and the prior likelihood:
p( | x) q( | x)exp(1 )
22
Now we assume Normality for the distribution function to get an analytical solution. Substituting Eq. 22 back into the constraint in Eq. 19 we can have the final form of the new distribution function given constraint:
p( | x) q( | x) exp(
f ( x)
2
f 2 ( x) 2 ) exp( ) 2 2
23
Eq. 23 is the final result of posterior given the general function as a constraint. In which μ and σ are the mean and standard deviation for the conditional distribution of q(θ|x), which is the old likelihood function.
3. An adaptive framework for BEN method It is known that the sequential updates for Bayesian method would always give the same result regardless of the order for the observation data. In [26][27], it has been pointed out that Jaynes’ information theory (the entropy method) is inconsistent with the probability theory (Bayesian method). The inconsistency is due to the strong effect of the constraints introduced by the entropy term. This section will provide a new point of view to present the application of the 13
entropy method in the Bayesian framework. The point observations will be handled in the classical Bayesian rule and the extra information will be encoded with the exponential term by the maximum entropy method. A weighing factor is introduced into the exponential term to mitigate the strong effect of the Entropy term. 3.1 The strong effect of the Entropy constraint It has been noted that the Entropy introduces a strong constraint on the target posterior distribution [33][34]. This could result in an undesired misinformed posterior distribution when the imposed constraint is wrong. Assume a case where the mean μ of a normal distribution is unknown. The variance of the parameter distribution is set to be 15 2. The prior distribution for the mean follows N(50,102). Five samples are drawn from a Normal distribution N(30,52) as observations from the population, namely (30.62 37.18 20.19 29.01 23.96). Aside from the five observations, a constraint on the mean of μ is imposed. In the first case, a correct information is given, i.e. mean(μ)=30. While the mean constraint in the second case is 70. The result can be seen in Figure 1. The posterior in Figure 1 a) quickly converged to the true value that is specified by the constraint. But as in the second example, when a piece of false information was presented, in this case a mean constraint equals 70, the posterior would greatly deviate from the truth (Figure 1 b)). And the subsequent updates would not have any effect on changing the mean of the posterior. As can be concluded from the comparison, the constraint introduced by the exponential term is a hard constraint. It is so hard that the observational data has little effect on the posterior. Incorporating constraint in this way would lead to unwanted result. In the next part, we explore a more reasonable way to handle such constraint along with observation data.
14
a)
b)
Figure 1. Updating using data together with a) correct and b) incorrect information
3.2 An adaptive algorithm for Bayesian Entropy method The goal of this section is to mitigate the strong constraint imposed by the entropy term. Sometimes an expert’s opinion may not be accurate, but a good estimate. Or the abstracted statistics may be outdated and does not reflect the new measurement data. An intuitive thinking is that when there is not much data, the distribution would lean towards the solution with the entropy constraint. And when observations are getting more, the distribution would tend to believe the data. This section introduces a weighing factor into the BEN framework to mitigate the strong effect of the entropy term. The proposed governing equation can be expressed as:
p( | x) q( | x)exp(k g ( ))
24
where q(θ|x) is the Bayesian posterior, k is a weighing factor for the constraint. It is a factor to balance the entropy information and data. It is defined as:
k
N N n
25
N is called the confidence related to the constraint β and n is the number of observations available. For example, if a statistic data states that the mean value of a variable is 10 and this 15
piece of information is abstracted from 50 historical data, then the confidence value could be set as N=50. According to Eq. 25 when no other data is available, k equals 1 and the resulting posterior in Eq. 24 is the entropy solution. When the data quantity is overwhelming, k goes to 0 and the resulting posterior is the Bayesian solution. To demonstrate the mixture model, a numerical example is given as follow. Assume the prior distribution for a parameter θ follows a Gamma distribution θ~Γ(α,β). We are observing x to infer the distribution for θ. The likelihood of x given θ is an exponential distribution:
p( x | ) e x , x 0 . The joint distribution of θ and x can be written as:
q( , x) p( ) p( x | )
( x ) e ( )
26
A mean value constraint was posed on parameter θ:
p( , x)dxd G
27
X
To find a target distribution which maximize the entropy and satisfy the constraints in Eq. 27 as well as the normalization, we form a Lagrange function. To differentiate with the distribution parameters for θ, γ and δ are used as the Lagrange multipliers, corresponding to the moment and normalization constraints. Similar to the previous derivation, the solution for the target posterior can be expressed as:
p( , x) e( x ) e
28
With an analytical solution for the Lagrange function: / G . Plugging into the adaptive form:
p( , x) e( x ) ek
16
29
Assume that the prior for θ follows Γ(4,1) with mean equals 4. The mean constraint G=5 has a confidence N=5. The observations are samples drawn from a Γ(10,10) distribution, indicating the data is coming from a narrow distribution with mean equals 1. In this case, the given constraint is a biased one. The prior distribution is updated with 3, 10 and 20 observations using the classical Bayesian method, the Bayesian Entropy method, and the Bayesian Entropy method with adaptive framework. Since the point observations are handled in the classical Bayesian manner, the sequence of the updating would not affect the results. As we can see from the result in Figure 2, due to the misinformed constraints, the posterior from Bayesian Entropy method deviates from the true solution. With the adaptive framework, when the data is less, the posterior is closer to the Entropy result. When the number of observations increases, the posterior is converging to the Bayesian result.
Figure 2. The posterior distribution updated with different amount of observations.
17
3.3 The Bayesian Entropy Network Based on the above derivation, the BEN method is formulated based on the posterior calculated from Eq. 24. The key idea is that in addition to the classical Bayesian method, the BEN method has an extra exponential term in the updating rule that can encode other types of information. Thus, the topology of a BEN is exactly the same with that of a classical Bayesian network. Instead of the classical Bayes’ rule, the probability is calculated using Eq. 24. Given the specific form of the constraint, the posterior distribution can be analytically solved. The proposed method is easy to implement in any existing Bayesian method. The BEN method can be regarded as a Bayesian network with encoded extra constraints in the updating rule.
4. Demonstration examples In this section, two demonstration examples will be given to illustrate how the effect of the extra information can affect the behavior of a traditional Bayesian framework. The first example applies BEN in classification. The second example studies the BEN in updating the risk assessment of an air traffic control network model. 4.1 BEN for image-based damage classification of pipes When applying the BEN method into classification, the constraint setup would be slightly different than the above derivation. Recall a Naïve Bayes classifier, this type of classifier has the simplest network structure with one class node and several feature nodes. The features are assumed to be independent amongst each other and each feature are only and directly connected to the class nodes (Figure 3).
18
Figure 3. The network structure for a Naïve Bayes classifier.
The class node C is a discrete node representing the labels. Feature nodes f1 to f4 can be continuous or discrete. Each node contains the marginal distribution for the corresponding variable and each edge contains the likelihood function. The network needs a set of data to train each probability function. Giving a new data instance, classification is done by assigning the class label to the one that can achieve the highest posterior distribution. n
c* arg max p(ci ) p( f j | ci ) j 1...m
30
i 1
Since the prior information about a certain feature would involve the class label, for example, the color of a lemon (class) is yellow (feature), the constraint is expressed as:
p( f
j
| C ci ) g ( f j )df j Gi
31
Fj
where p(fj|C= ci) is the likelihood function and fj represents the jth feature and ci is the ith class label. Consider the normalization constraint and a first order moment with g(fj)= fj, the posterior likelihood function in relation to the one trained by data is: p ( f j | C ) ( f j | C )e
fj
32
where β is the Lagrangian multiplier. Assume Gaussian distribution for the likelihood function, β can be analytically solved. Hence, the final solution for the posterior from BEN is: 19
p ( f j | C ) ( f j | C )e
Gi
2
fj
33
The solution for a range constraint in this case would be similar and not given in details. So the classification rule for a BEN as classifier is: n
c* arg max p(ci ) p( f j | ci )e j 1...m
f j
34
i 1
The following example involves the damage classification in plastic pipes. In the pipeline system, the transmission pipelines are usually made of steels and the distribution pipelines are usually made of polymer materials. The working condition of the pipes are under static pressure, damages in the pipe would accelerate the creep failure process in the pipeline system. Damages include dent, slit, rock impingement and squeeze-off are commonly seem in distribution pipelines. In this example, we only focus on the classification of two damages: slit and indentation. Some example of the real damage can be seen in Figure 4. Figure 4 a) is showing an indentation damage and Figure 4 b) is showing a slit on the inner pipe wall.
a)
b)
Figure 4. Two common damage in gas pipeline a) indentation and b) slit.
A hardware device with an endoscope camera and a laser pattern projector was built for the imaging inspection for the pipe inner surface. Advanced image algorithm can reconstruct the surface in 3D, similar application exists in medical research [35]. The idea is to use triangulation to calculate the angel and distance of any point of the laser pattern relative to the camera. The 20
laser pattern scans through the pipe as the camera is moving along. The 3D surface can be reconstructed based on the laser patterned image frames. In this task, we are trying to classify the damage types based on these reconstructed image data. Due to the laborious process in collecting real pipe imaging data, both simulated data and real data are used for the training and testing of the BEN classifier. The simulation is a Monte Carlo code to generate pipe sections with different damage types and random sizes and locations. White noises were added to the simulation to accommodate the potential noise in the camera sensor. The damage could vary in sizes and shapes, but it can be characterized by some geometric features that can be derived from the 3D reconstruction, such as length and volume of the anomaly. These two features were used to build a classifier. The network structure is shown in Figure 5.
Figure 5. The network topology for the pipe damage classifier.
110 simulated data instances and 90 lab generated data (a total of 200 data instances) are available for training and testing. Overall, the data set can be fitted with a Normal distribution listed in Table 1. These distributions are regarded as the ground truth. From the 200 datasets, a general trend can be found that the length feature for slit is around 100 unit and the volume feature for indentation is always larger than 25. These two pieces of information is assumed as the knowledge from an experienced pipe engineer and used as a moment constraint (the length for slit) and a range constraint (the volume of indent) on the two features given the 21
corresponding class in a BEN classifier (listed in Table 2). The constraints are imposed on the likelihood function as expressed in Eq. 32. Table 1. The fitted Gaussian distribution for the full data set
Slit
Indent
Length
N(99.24,24.132)
N(70.51,23.442)
Volume
N(25.65,4.362)
N(40.54,8.632)
Table 2. The constraints applied in BEN classifier
Length Volume
Slit Mean constraint: =100
Indent Range constraint: >25
The average accuracy of the classifier was compared based on various training data size. For each selected training data size, the training data will be randomly selected from the full data set to train and validate the classifier, and the rest are used to test the accuracy. The average accuracy against training data size are compared and plotted in Figure 6. 0.94 0.92
Average accuracy
0.9 0.88 0.86 0.84 0.82
Naïve Bayes
0.8
BEN
0.78 0.76 0
20
40
60
80
100
Training data size Figure 6. The average accuracy vs. training data size for the two classification methods.
22
It can be concluded from the result that, with the additional information, the average accuracy for a BEN classifier is significantly higher than that of the Naïve Bayes classifier when the training size is small. The accuracy converges as the training size increase. As the training data is given in large quantities, the learned distribution from Naïve Bayes and BEN will eventually converge and the imposed constraints may become negligible. As we investigate the trained likelihood function, it can be seen that after imposing the constraints, the probability density function (PDF) assembles the true distribution from the full data set more (Figure 7). The applied moment constraint shifted the trained distribution to the specified mean value (Figure 7 a)) while the range constraint truncated the trained distribution (Figure 7 b)).
a)
b)
Figure 7. The trained distribution for the likelihood function a) P(length|slit) and b) P(volume|indent).
It can be concluded that the BEN has a better performance when the training data is small. The BEN classifier can take advantage of the extra information and achieve fast learning compare to the Bayes classifier. The proposed BEN method provided a new way of encoding constraint into a network model.
23
4.2 BEN for human-related risk assessment in air traffic control The following example explores the application of BEN in the risk assessment in an air traffic control problem. The risk in air transport may be coming from various factors such as aircraft conditions (maintenance and design of the aircraft), environmental conditions (weather and terrain), operation and management etc. [36]. Human error has always been considered as a critical influencing factor in air traffic management (ATM) [37]. Many research work, such as [38][39], focuses on a causal network model to infer the air traffic risk probability. In order to evaluate the probability of risk at an airport, a network model is built as shown in Figure 8.
Figure 8. Network topology for air traffic control.
The network can be interpreted as this: The risk probability is related to two factors, the speed of the aircraft and the pilot condition. The speed is affected by the visibility, and the weather can affect both the visibility and the speed of an aircraft. The pilot condition is an evaluation of the pilot status and can be a prediction for the pilot’s performance. It is related to the experience of the pilot and the amount of rest the pilot gets before the flight. The rest could be understood as the sleeping hours of the pilot. In this network, the risk and weather are 24
considered as discrete node and the others are continuous. The marginal distribution for each variable is listed in Table 3. The node Risk can take two possible values, 0 and 1 corresponds to safe and accident respectively. The probability for risk=1 represents the risk probability. The node weather takes four possible values, each corresponds to a different weather condition. The continuous nodes are all modeled as Normal distributions with specified parameters. Table 3. Parameters and its distribution in the air traffic control model
Parameters
Node name
Distribution type
Risk
Discrete (2 values: 0 1)
Speed
Normal
51
20.5
Pilot
Normal
82
20.5
Weather
Discrete (4 values: 0 1 2 3)
Visibility
Normal
10
1
Experience
Normal
30
5
Rest
Normal
7
1
μ
σ
[0.9, 0.1]
[0.3, 0.3, 0.2, 0.2]
Each edge in the network means the connected variables are jointly distributed. The relation can be expressed through a likelihood function, which is also a distribution. In this example, the likelihood functions are all assumed as normal distributions with fixed mean and variance. The distribution information can be found in Table 4. The conditional distribution of Speed given Visibility, Pilot given Experience and Pilot given Rest simply means that the two continuous variables are jointly distributed with correlation coefficient ρ=0.3, 0.7 and 0.5 respectively. Table 4. The likelihood function in the air traffic control model
Distribution parameters μ
Likelihood 25
σ2
P(Speed|Weather=1)
55
82
P(Speed|Weather=2)
45
82
P(Speed|Weather=3) P(Speed|Weather=4) P(Visibility|Weather=1) P(Visibility|Weather=2) P(Visibility|Weather=3) P(Visibility|Weather=4) P(Speed|Visibility)
52 53 8 10 11 12 51 0.3* 20.5(Visibility 10)
122 92 4 4 4 3 18.655
P(Pilot|Experience)
82 0.7* 0.82(Experience 30)
10.455
P(Pilot|Rest) P(Speed|Risk=0) P(Speed|Risk=1) P(Pilot|Risk=0) P(Pilot|Risk=1)
82 0.5* 20.5(Rest 7) 50 60 85 55
15.375 25 25 25 25
The network will be updated in three scenarios. The first scenario is only a Bayesian updating with an observation of Rest=6. This piece of information might be coming from the status monitoring hardware on the pilot, such as a smart watch. Or the pilot reported that he only slept 6 hours the day before. The second scenario has two pieces of information, the observation of sleeping hour as in the first scenario, and a constraint saying that the average sleeping hour of a pilot is 8. It could be interpreted as the hardware monitoring may be malfunctioned or not reliable, and according to historical profile, the pilot sleeps 8 hours on average. The historical profile is coming from a consecutive of 50 recordings, so a confidence of N=50 is assigned to this constraint. This is a moment constraint on the variable Rest. In the last scenario, a new relationship of the pilot condition and resting hours is imposed by the entropy method along with the observation. The pilot condition can be expressed as a function of Rest as Pilot f (Rest) . This piece of information could come from a new research on the relations between the pilot 26
performance and sleeping hours. This constraint is imposed on the likelihood function. The three scenarios are marked on the topology of the network model in Figure 9. The information will propagate in the network and update for the risk probability.
Figure 9. The topology of the air traffic control model with the information in the three scenarios.
The first and third scenario updates the Rest node using only the observation, while in second scenario there is an additional mean constraint. The distribution for Rest can be calculated as:
p(Rest) q(observation=6 | Rest)q(Rest)exp(k Rest)
35
where q(Rest) is the prior and q(observation=6|Rest) is a normal likelihood with variance equals 1. Since only one observation is available at this point, k=N/(N+1). The updated marginal distribution for Rest is shown in Figure 10. We can see that with only the observation value (first scenario), the updated posterior is changed and variance decreased. With the BEN method, the mean of the posterior is shifted to the value specified by the constraint.
27
Figure 10. The marginal distribution for Rest in the first two scenarios.
The updated distribution for Rest will be used to update the marginal distribution for Pilot assuming that the likelihood function does not change. The product of the marginal distribution for Rest and the likelihood function would give the joint distribution for Pilot and Rest. The marginal for Pilot is calculated by integrating the joint over Rest. In the third scenario, the new correlation is encoded as expressed in Eq. 23: p(Pilot,Rest) q(Pilot,Rest) exp(
f (Rest)
2
Pilot)
36
where q is the joint distribution in Bayesian method and μ and σ are parameters for the likelihood function. The Updated result is shown in Figure 11 a). Since there is no additional information involving Pilot and Risk, the update for Risk is done with Bayesian method. And the result is shown in Figure 11 b).
28
a)
b)
Figure 11. The marginal distribution for a) Pilot node and b) Risk node in three scenarios.
From the result we can see that the risk probability increases with the observation of Rest=6. This could be understood as the decreased sleeping time reduce the pilot performance, as the curve in orange in Figure 11 a). Hence the risk probability increased. But when we impose the constraint for the mean value of rest is 8, we believe that the pilot has a regular schedule and sleeps 8 hours on average. This in turn enhanced the pilot performance, hence reduce the risk. In the third scenario, an observation of rest is 6 is considered in the rest node. And from rest to pilot we added a constraint to alter the likelihood function. According to the result, the pilot performance decreased drastically which lead to the jump of risk probability. The positive correlation of the constraint function on the likelihood function is acting as a punishment for the pilot performance due to the observed low rest hours. And this was reflected in the increase in risk probability. The example demonstrated the ability of the BEN method to take advantage of information from various sources to update for the risk probability. The reader should note that this example
29
is only for illustrating that different types of constraint can be added in BEN and does not reflect any real research work.
5. Conclusion A novel BEN method as a general tool for utilizing various types of information is introduced in this paper. The method combines a classical Bayesian part to handle point data and an exponential part to encode constraints. Several conclusions can be drawn based on the proposed study: 1) It is shown that the proposed Bayesian Entropy Network is a generalized Bayesian Network model and has the same modeling structure; 2) Different types of information can be encoded using the entropy principle with analytical expression of an exponential term and point observations are handled by the likelihood function. Various type of information constrains, including moment constraints, range constraints, and general functional constraints have been developed; 3) In classification, it is observed that the encoded extra information can enhance the performance if the number of point observations is small. If point observations are huge, both BEN and BN converge to the same results; 4) With the adaptive framework, the behavior of posterior from BEN method would believe in the constraint when observations are limited and converges to the classical Bayesian posterior when observations are available in large numbers; 5) In general, the proposed BEN shows its flexibility to handle multiple types of information commonly seen in engineering practice and can serve as a generalized information fusion tool for system reliability analysis. 30
The BEN method can be easily adapted into existing Bayesian-based computational methods without additional computational cost since it only adds an exponential term in the Bayesian equation. BEN can be beneficial in industry problems where there are important extra knowledge other than observed data. The adaptive framework proposed for the BEN method offers an intuitive and flexible way of handling constraint information. But what it lacks is a theoretical support on the rationale behind it. Other types of weighing factors may have similar effect on the behavior of BEN and needs further study. The method was demonstrated in examples using Gaussian and Gamma distributions. More general cases such as mixture models (multi-modal distributions) or non-parametric distributions would require more research. For mixture models, there could exists several underlying mechanics. Setting a single constraint on a mixture model would be impropriate. A more complex scheme is required for setting the constraint regarding each separate mechanism in such engineering systems. Due to the nature of a non-parametric distribution, the lower order statistical moment information is not appropriate to serve as the constraints. The analytical solution for non-parametric or mixture model distributions with constraints could potentially be untraceable. A numerical method (such as Expectation Maximization) may be beneficial in such cases.
Acknowledgments The research reported in this paper was supported by funds from NASA University Leadership Initiative program (Contract No. NNX17AJ86A, Project Officer: Dr. Anupa Bajwa, Principal Investigator: Dr. Yongming Liu). The support is gratefully acknowledged.
31
References [1]
[2]
[3] [4] [5] [6] [7] [8] [9]
[10]
[11] [12] [13] [14] [15] [16] [17]
[18] [19]
M. Bayes and M. Price, “An Essay towards Solving a Problem in the Doctrine of Chances. By the Late Rev. Mr. Bayes, F. R. S. Communicated by Mr. Price, in a Letter to John Canton, A. M. F. R. S.,” Philos. Trans. R. Soc. London, vol. 53, pp. 370–418, Jan. 1763. T. Peng, A. Saxena, K. Goebel, Y. Xiang, S. Sankararaman, and Y. Liu, “A novel Bayesian imaging method for probabilistic delamination detection of composite materials,” Smart Mater. Struct., vol. 22, no. 12, 2013. T. Peng, J. He, Y. Liu, A. Saxena, J. Celaya, and K. Goebel, “Integrated fatigue damage diagnosis and prognosis under uncertainties,” Phm, no. February 2016, pp. 1–11, 2012. J. N. YANG and W. J. TRAPP, “Reliability Analysis of Aircraft Structures under Random Loading and Periodic Inspection,” AIAA J., vol. 12, no. 12, pp. 1623–1630, Dec. 1974. J. L. Beck and L. S. Katafygiotis, “Updating Models and Their Uncertainties. I: Bayesian Statistical Framework,” J. Eng. Mech., vol. 124, no. 4, pp. 455–461, Apr. 1998. D. Kavetski, G. Kuczera, and S. W. Franks, “Bayesian analysis of input uncertainty in hydrological modeling: 2. Application,” Water Resour. Res., vol. 42, no. 3, Mar. 2006. J. L. Beck, “Bayesian system identification based on probability logic,” Struct. Control Heal. Monit., vol. 17, no. 7, pp. 825–847, Nov. 2010. N. Friedman, D. Geiger, and M. Goldszmit, “Bayesian Network Classifiers,” Mach. Learn., vol. 29, pp. 131–163, 1997. M. Chiachío, J. Chiachío, D. Prescott, and J. Andrews, “A new paradigm for uncertain knowledge representation by Plausible Petri nets,” Inf. Sci. (Ny)., vol. 453, pp. 323–345, Jul. 2018. M. Chiachío, J. Chiachío, D. Prescott, and J. Andrews, “Plausible Petri nets as selfadaptive expert systems: A tool for infrastructure asset monitoring,” Comput. Civ. Infrastruct. Eng., vol. 34, no. 4, pp. 281–298, 2019. S. Sankararaman and S. Mahadevan, “Likelihood-based representation of epistemic uncertainty due to sparse point data and/or interval data,” Reliab. Eng. Syst. Saf., 2011. J. Graca, K. Ganchev, and B. Taskar, “Expectation Maximization and Posterior Constraints,” Adv. Neural Inf. Process. Syst. 20, pp. 1–8, 2008. K. Ganchev and J. Gillenwater, “Posterior Regularization for Structured Latent Variable Models,” J. Mach. Learn. Res., vol. 11, no. MS-CIS-09-16, pp. 2001–2049, 2010. J. Zhu, N. Chen, and E. P. Xing, “Bayesian Inference with Posterior Regularization and applications to Infinite Latent SVMs,” vol. 15, pp. 1799–1847, 2012. G. Christakos, “A Bayesian/maximum-entropy view to the spatial estimation problem,” Math. Geol., vol. 22, no. 7, pp. 763–777, Oct. 1990. G. Christakos, Integrative problem-solving in a time of decadence. Springer Science & Business Media, 2010. A. Adam-Poupart, A. Brand, M. Fournier, M. Jerrett, and A. Smargiassi, “Spatiotemporal modeling of ozone levels in Quebec (Canada): a comparison of kriging, land-use regression (LUR), and combined Bayesian maximum entropy-LUR approaches.,” Environ. Health Perspect., vol. 122, no. 9, pp. 970–6, Sep. 2014. S. Banerjee, B. Carlin, and A. Gelfand, Hierarchical modeling and analysis for spatial data. CRC press, 2014. A. Kolovos, J. Angulo, … K. M.-E., and undefined 2013, “Model-driven development of 32
[20]
[21]
[22]
[23] [24] [25] [26] [27] [28] [29] [30]
[31]
[32] [33]
[34]
[35] [36] [37] [38]
covariances for spatiotemporal environmental health assessment,” Springer, vol. 185, no. 1, pp. 815–831, 2013. S.-J. Lee and E. A. Wentz, “Applying Bayesian Maximum Entropy to extrapolating localscale water consumption in Maricopa County, Arizona,” Water Resour. Res., vol. 44, no. 1, 2008. K. P. Messier, T. Campbell, P. J. Bradley, and M. L. Serre, “Estimation of Groundwater Radon in North Carolina Using Land Use Regression and Bayesian Maximum Entropy,” Environ. Sci. Technol., vol. 49, no. 16, pp. 9817–9825, 2015. S. Tang, X. Yang, D. Dong, and Z. Li, “Merging daily sea surface temperature data from multiple satellites using a Bayesian maximum entropy method,” Front. Earth Sci., vol. 9, no. 4, pp. 722–731, 2015. E. T. Jaynes, “Information theory and statistical mechanics,” Phys. Rev., vol. 106, no. 4, pp. 620–630, 1957. E. T. Jaynes, “Information Theory and Statistical Mechanics. II,” The Physical Review, vol. 108, no. 2. pp. 171–190, 1957. A. Caticha and A. Giffin, “Updating probabilities,” in AIPConference Proceedings, 2006. K. Friedman and A. Shimony, “Jaynes’s maximum entropy prescription and probability theory,” Journal of Statistical Physics, vol. 3, no. 4. pp. 381–384, 1971. A. Shimony, “The status of the principle of maximum entropy,” Synthese, vol. 63, no. 1, pp. 35–53, 1985. A. Giffin and A. Caticha, “Updating probabilities with data and moments,” in AIP Conference Proceedings, 2007, vol. 954, pp. 74–84. X. Guan, R. Jha, and Y. Liu, “Probabilistic fatigue damage prognosis using maximum entropy approach,” J. Intell. Manuf., vol. 23, no. 2, pp. 163–171, 2012. E. VanDerHorn and S. Mahadevan, “Bayesian model updating with summarized statistical and reliability data,” Reliab. Eng. Syst. Saf., vol. 172, no. December 2017, pp. 12–24, 2018. S. Yuriy and Y. V Shkvarko, “Estimation of wavefield power distribution in the remotely sensed environment: Bayesian maximum entropy approach,” IEEE Trans. SIGNAL Process., vol. 50, no. 9, p. 2333, 2002. B. Fassinut-Mombot and J.-B. Choquel, “A new probabilistic and entropy fusion approach for management of information sources,” Inf. Fusion, vol. 5, no. 1, pp. 35–47, Mar. 2004. Y. Wang, Y. Liu, Z. Sun, and P. Tang, “A Bayesian-Entropy Network for Information Fusion and Reliability Assessment of National Airspace Systems,” in PHM Society Conference, 2018, vol. 10, no. 1. Y. Wang and Y. Liu, “A Novel Bayesian Entropy Network for Probabilistic Damage Detection and Classification,” in 2018 AIAA Non-Deterministic Approaches Conference, 2018. C. Schmalz, F. Forster, A. Schick, and E. Angelopoulou, “An endoscopic 3D scanner based on structured light,” Med. Image Anal., vol. 16, no. 5, pp. 1063–1072, 2012. J. J. H. Liou, G.-H. Tzeng, and H.-C. Chang, “Airline safety measurement using a hybrid model,” J. Air Transp. Manag., vol. 13, no. 4, pp. 243–249, Jul. 2007. M. Rodgers, Human factors impacts in air traffic management. Routledge, 2017. B. J. M. Ale et al., “Further development of a Causal model for Air Transport Safety (CATS): Building the mathematical heart,” Reliab. Eng. Syst. Saf., vol. 94, no. 9, pp. 33
1433–1441, Sep. 2009. [39] Y. Liu and K. Goebel, “Information Fusion for National Airspace System Prognostics,” PHM Soc. Conf., vol. 10, no. 1, Sep. 2018.
Declaration of interests
☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:
34