Robotics and Computer Integrated Manufacturing 57 (2019) 59–72
Contents lists available at ScienceDirect
Robotics and Computer Integrated Manufacturing journal homepage: www.elsevier.com/locate/rcim
Full length Article
Performance guaranteed human-robot collaboration with POMDP supervisory control☆
T
Xiaobin Zhang , Hai Lin ⁎
Department of Electrical Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
ARTICLE INFO
ABSTRACT
Keywords: Human-robot collaboration Partially observable markov decision process Formal methods Supervisory control
Human-Robot Collaboration (HRC) studies how to achieve effective collaborations between human and robots to take advantage of the flexibility from human and the autonomy from robots. Many applications involving HRC, such as joint assembly manufacturing systems and advanced driver assistance systems, need to achieve highlevel tasks in a provably correct manner. These applications motivate the requirements of HRC to have the performance guarantee to assure the task completion and safety of both human and robots. In this paper, a correct-by-design HRC framework is built to enable a performance guaranteed HRC. To model the uncertainties from human, robots and the environment, partially observable Markov decision process (POMDP) is used as the model. Based on the POMDP modeling, a supervisory control framework is applied and designed to be adaptive to modeling uncertainties. To reduce the model checking complexity involved in the supervisor synthesis process, an abstraction method for POMDP is integrated to find a quotient system with a smaller size of state space. Based on the abstraction method, a verification adaptation technique is developed with simulation relation checking algorithms to deal with possible online model changing. If the verification adaptation indicates the necessity to update the supervisor, supervisor adjustment methods are given. Altogether, it leads to a semi-online adaptation approach for system model changing. Examples are given to illustrate this framework.
1. Introduction Human-Robot Collaboration (HRC) focuses on how to achieve effective collaborations between human and robots. While robots have advantages in handling repeated routine work, the human is more adaptive and flexible to changing factors. These factors may bring uncertainties that cost non-trivial efforts for robots to overcome. Therefore, the efficient collaboration between human and robots can take advantage of the flexibility from human and the autonomy from robots. As more and more advanced sensing and actuation modules are being developed for robots, HRC applications vary from schools (teaching assistant robots), hospitals to deep-sea and out-space explorations. This paper focuses on HRC applications where human and robots work in a close cooperation side-by-side instead of remote operation [1]. Many existing work in HRC focused on the robot actuation design [2] and low-level motion planning [3]. For the high-level planning in HRC, existing work mainly solves the controller design problem through reinforcement learning [4] or optimization with the collaboration task being converted into reward functions [5,6]. The
design is basically a trail-and-error process, and it is possible that some bugs remain undiscovered even after extensive testing [7]. However, many HRC applications, such as joint assembly manufacturing systems [8,9], advanced driver assistance systems [10], and intelligent service robots [11], need to achieve high-level tasks in a provably correct manner. These applications motivate the requirements of HRC to have the performance guarantee from the high-level to assure the task completion and safety of both human and robots. Our basic idea for achieving the performance guarantee in HRC is to convert the planning and control problem in HRC to a partially observable Markov decision process (POMDP) supervisory control problem. In human science, probabilistic sequential models are widely used for the sequential analysis of human behavior [12]. Following many experiment results on human behavior recognizing and prediction, human models with the Markov property are proved with high accuracy in describing human behaviors [13,14]. Among different probabilistic sequential models like Markov chain, and hidden Markov model (HMM), POMDP is more general with partial observability to describe observation errors that come from the hidden human behavior and robot imperfect sensing [6,15]. Meanwhile, a POMDP model can be
The financial supports from NSF-EECS-1253488 and NSF-CNS-1446288 for this work are greatly acknowledged. Corresponding author. E-mail addresses:
[email protected] (X. Zhang),
[email protected] (H. Lin).
☆ ⁎
https://doi.org/10.1016/j.rcim.2018.10.011 Received 14 June 2018; Received in revised form 28 October 2018; Accepted 31 October 2018 0736-5845/ © 2018 Elsevier Ltd. All rights reserved.
Robotics and Computer Integrated Manufacturing 57 (2019) 59–72
X. Zhang, H. Lin
learned from the training data collected for human and robot behavior in HRC [16]. These advantages from POMDP modeling inspires our choice of POMDP as the model of HRC. With POMDP modeling, HRC high-level tasking is a control problem where the design objective is to find the control strategy for the robot to have effective collaborations with the human. Following our previous work in POMDP supervisory control [17], we apply formal methods with POMDP supervisory control to achieve a guaranteed performance in HRC. By modeling HRC planning and control under supervisory control framework, the highlevel tasks can be formally described with Probabilistic Computation Tree Logic (PCTL) [18]. With the philosophy of supervisory control that enables more than one action for the system to execute, the supervisor can be permissive to generate backup actions then have robustness for actuation failures in HRC. While most of the existing work on POMDP planning are based on rewards to describe the design objective with optimization-based approaches (such as point-based value iterations [19], policy iteration and gradient searching with finite state controllers [20]), the formal methods based approach in the POMDP supervisory control framework performs the design of the system controller in a thorough fashion with different computational verification tools to guarantee the high-level performance of the controller [21]. Only recent years saw advances in applying formal methods for POMDP planning. But these methods restrict the controller types to be fixed sized [22], memoryless [23], or non-permissive. Compared to these approaches, our POMDP supervisory control framework finds a historydependent and permissive supervisor, which is more general for our control purpose. In the developed the POMDP supervisory control framework, the supervisory control, machine learning and formal verification methodologies are combined together. To automatically synthesize a proper supervisor, model checking algorithms are applied to find counterexamples that can witness the violations of the design objectives. Then these counterexamples are used to iteratively improve the learned supervisor. For our POMDP supervisory control framework, there are two main challenges. As the first main challenge, while a POMDP model can be learned based on statistical methods, the learned POMDP model may contain modeling uncertainties[24]. To deal with modeling uncertainties, in this paper the probability intervals are used to describe the transition and observation models. Then the upper and lower bounds for these probability intervals are used in the supervisor synthesis process to find a proper supervisor. As the second main challenge, the model checking process for POMDP can introduce high computational complexity. To reduce the state space size involved in the model checking process, an abstraction method for POMDP is integrated based on our previous result on a counterexample-guided abstraction refinement (CEGAR) framework for POMDP [25]. If a proper abstract system is returned with a smaller state space size, the dimension of the belief state space can be reduced, which will further reduce the model checking complexity of POMDP. Based on this abstraction method, an online verification adaptation technique is discussed extensively with simulation relation checking algorithms. If the changing of the system model can be compensated by the simulation relation, then the current supervisor can still regulate the system behavior. Otherwise, the adjustment of the supervisor is necessary. As the supervisor adjustment still replies on model checking and counterexample return, it may consume non-trivial time, which makes the adaptation process semi-online. For this case, we also discuss the future work to improve its performance.
systems to model human and machine. A warning system is designed for automobile systems to assist drivers when they feel drowsy. With the physical positions of human and vehicle as the states and human actions as the observations, a reward-based planning algorithm is used to keep the vehicle driving safely without off the lane. Besides POMDP modeling, existing work that discusses the systematic level design problem in HRC also consider other probabilistic and stochastic models. For example, in [26], the authors use Intention-Driven Dynamics Model (IDDM), an extension of Gaussian Process Dynamical Model (GPDM), to model human intentions. Together with their previous results where Gaussian mixture model is used to describe the robot state changing after certain actions, the robot control policies can be generated through reinforcement learning [4]. In [27], the authors use Markov decision process (MDP) to model the robot and human and describe uncertainties from different sources with the reward function based control policy design. Similarly, in [5], the authors use mixed-observability MDP (MOMDP), where both observable states and partially observable states exist, to model human and robot. The control policy is also generated by solving the optimization problem with a reward function learned to describe the human intention and preference using Inverse Reinforcement Learning (IRL). As a summary of different HRC models, POMDP is an extension model for MDP and MOMDP because POMDP does not assume full observability for any system states. While IDDM and GPDM can be viewed as POMDP with a continuous state space, this paper focuses on the high-level planning in discretized state space, which makes POMDP a more suitable model. In consequence, the POMDP comprehensiveness in representing uncertainties from different sources makes POMDP emerge as one of the most general and thus popular models for HRC applications. Therefore, in this paper, POMDP is adapted as the model for HRC. 1.2. Our contributions As the primary contribution of this paper, the performance guaranteed HRC is achieved following a POMDP supervisory control framework. While this framework assumes a known POMDP model for HRC, POMDP model learning and modeling uncertainty description are discussed. Following by that, the supervisory control framework is extended to consider POMDP with modeling uncertainties. Meanwhile, a semi-online adaptation of the supervisory control framework is achieved by verification adaptation based on safe simulation relation and supervisor adjustment with new counterexamples. This paper is an extended version of our preliminary conference paper [28]. Compared to [28], this paper makes the following new contributions. First, POMDP model is adapted as the model for HRC to avoid parallel composition process involved in the MDP-POMDP modeling. Then the supervisory control framework in [17] is extended to deal with POMDP modeling uncertainties. Meanwhile, an abstraction method for POMDP is integrated to reduce the computational complexity of model checking on POMDP. Extended from this abstraction method proposed in [25], verification adaptation techniques are developed to allow adaptation to online model changing. 1.3. Outline of the paper The rest of this paper is organized as follows. In Section 2, necessary preliminaries for formal methods in POMDP are introduced. Section 3 presents the HRC supervisor design with POMDP supervisory control framework. Following by that, the adaptation abilities of the supervisory control framework are discussed in Section 4 to deal with online model changing. Section 5 addresses a case study for a driving scenario. Finally, Section 6 summarizes this paper with conclusions.
1.1. Related work: POMDP in HRC For HRC applications, in [6], POMDP is used to model an assistive system designed for cognitively impaired participants. Based on the observations on human users’ past behaviors, the planning algorithm selects the optimal policy to help human achieve the desired goal. In [15], POMDP is proposed as a general model for human-in-the-loop 60
Robotics and Computer Integrated Manufacturing 57 (2019) 59–72
X. Zhang, H. Lin
2. Preliminaries
s s s s s
2.1. POMDP Model Definition 1. A POMDP is a tuple
= {S, s¯, A, T , Z , O}, where
• S is a finite set of states; • s¯ S is the initial state; • A is a finite set of actions; • T: S × A × S → [0, 1] is a transition T (s, a , s ) = 1, s S, a A ; • Z is a finite set of observations; • O: S × Z → [0, 1] is an observation s
z Z
S.
function
1
}) p ,
,
(1) k 2
; i
2
k
k , (i ) 0,
1
(j )
2 k
1,
j < i;
2.
with Compared to PCTL satisfaction over MDP, POMDP restricts the adversaries to be observation-based. While the detailed discussions for the probability space that leads to the PCTL satisfaction over POMDP is out of the scope of this paper, readers may refer to our previous results in [17]. For POMDP model checking, the POMDP can be converted to an MDP with POMDP histories as states. Then MDP model checking algorithms [18] can be applied with software like PRISM [31] and Storm model checker [32]. For PCTL specifications like bounded until, the model checking problem over POMDPs can be converted to an equivalent optimal policy computation problem [17]. Then POMDP optimal policy computation techniques can be applied with software like APPL [33] and POMCP [34].
= s0 a 0 s1 a1 s2…, where s0 = s¯, si S, ai A and T (si, ai , si + 1) > 0 for all i ≥ 0 [18]. While the states cannot be directly observed, the observation sequence given a path is a unique sequence obs ( ) = z 0 a0 z1 a1 z2… where zi ∈ Z and O(si, zi) > 0 for all i ≥ 0. This observation sequence embeds a history of a POMDP execution without the initial observation. When the initial state is given, the initial observation can be defined as a special observation Init, and the case that the initial distribution is known can be converted to the earlier case [17]. As for the planning and decision making in POMDP, the observation sequences (histories) are the control inputs. A pure adversary will map each finite history onto an action while a randomized adversary will map it into a probability distribution over A. For finite horizon PCTL that will be introduced later, restricting the set of adversaries to pure strategies will not change the satisfaction relation of the specifications [29]. Note that the adversaries for POMDP must be observation-based in the sense that the same output should be given for paths with the same observation sequence.
3. HRC Supervisor design with POMDP supervisory control 3.1. POMDP Modeling in HRC A POMDP = {S, s¯, A, T , Z , O} can be used as the high-level model in HRC. For HRC, the state space S of its POMDP model will contain states to describe human intentions and statuses of robot and environment. The observation space Z gives observable information from robot sensors for human, robot and the surrounding environment. These observations can be, human body angles, hand gestures, robot positions, and door open/not-open in the environment. The action set is generated based on the actuation abilities of the robot, regarding, for example, base moving and object handling from the end-effector. To learn a POMDP model from the training data of human and robot behaviors, there are existing statistical methods. 3.1.1. POMDP model learning From a statistical point of view, POMDP is an extension of HMM because both of them have hidden states that must be inferred from a sequence of observations. While POMDP models the decision-making process as well, HMM model learning methods can be directly applied to POMDP learning with more training data generated under different action sequences [35]. For HMM model learning, there are two types of approaches: supervised learning and unsupervised learning. In the supervised learning, the underlying hidden state sequence is known for a training sample sequence. Based on the maximum-likelihood estimation, the mean estimation of transition and observation probabilities is selected to be the sample mean, which is a matrix of empirical frequencies of transitions and observations in the training data. The assumption of known hidden state sequences in the training data holds for many scenarios in HRC. For example, in [36], an intelligent wheelchair navigation scenario is considered, and the human hidden states (location of interests) are detected if human stays in one location for a certain amount of time. For the human driver modeling discussed in [16], human are given the instructions of their goals during the data collection process, which assigns the training data with known human hidden states.
2.2. PCTL and PCTL model checking over POMDPs PCTL [18] is the probabilistic extension of Computation Tree Logic (CTL) [30], where the probabilistic operator P is considered as the quantitative extension of CTL’s A (always) and E (exist) operators. Definition 2. [18] The syntax of PCTL is defined as
• State formula ϕ: ≔ true | α | ¬ϕ | ϕ∧ϕ |P [ψ], | , • Path formula : :=X | ⋈p
where α ∈ AP, ⋈ ∈ { ≤ , < , ≥ , > }, p ∈ [0, 1] and k
s 2; Path |
is the set of all observation-based adversaries, and for any Path
1
function
s
X
with
POMDP uses transition function T to capture randomness in state transitions. Compared to MDP that assumes fully observable state space, POMDP assigns a probability distribution over possible observations for each state to capture the partial observability for the states. As for labeled MDP, a labeled POMDP = (S, s¯, A, T , Z , O, L) is a POMDP {S, s¯, A, T , Z , O} with an extra labeling function L: S → 2AP that assigns a subset of atomic propositions AP to each state s ∈ S to describe design requirements with temporal logics. The POMDP execution can be represented by a path as a non-empty sequence of states and actions in the form of
k
S; L (s ); ¬ s ; s 1 1 2 P p[ ] Pr ({
where path
S
O (s, z ) = 1, s
true ,
.
As standard boolean operates, ¬ stands for ”negation”, ∧ for ”conk junction”. For the temporal operators, X stands for ”next”, for ”bounded until” and for ”until”. For the probabilistic operator P, it gives probabilistic threshold p for path formula ψ. Definition 3. [17] For a labeled POMDP = (S, s¯, A, T , Z , O, L), the satisfaction relation ⊨ for any state s ∈ S is defined inductively as follows 61
Robotics and Computer Integrated Manufacturing 57 (2019) 59–72
X. Zhang, H. Lin
Instead of assuming known hidden state sequences, the unsupervised learning gets the parameters for HMM directly from the training data as a set of observation sequences. The most standard algorithm for the unsupervised learning is the Baum-Welch algorithm [37]. The Baum-Welch algorithm is an EM algorithm for learning HMMs/POMDPs from observation sequences. For the POMDP model learning, the Baum-Welch algorithm fixes the number of states in POMDP, then learns the transition probabilities given the observation sequences. By using the EM algorithm, a new state space can be calculated to maximize the log-likelihood of the observations, where the Baum-Welch algorithm is executed again [35]. For example, in [38], the Baum-Welch algorithm is used to learn a POMDP for a robot navigation and localization application. Beside the Baum-Welch algorithm, there are also many recently developed POMDP learning algorithms to reduce the size of the required training data and the computational complexity [35].
represent transition/observation models and applying probability bounds in the supervisor synthesis algorithm. In this process, PCTL is used to describe the high-level tasks in HRC. This allows reasoning over temporal logics with probabilistic thresholds of the formulas (such as, with a probability larger than 0.9, the collaboration task should be finished in 5 time steps and human should be safe in this process). The PCTL specifications with finite horizons are considered in this paper because a lot of HRC applications require task completion within finite time. Meanwhile, reasoning over finite horizons can bound the search space with finite memory in POMDP model checking [45], which makes the decision making problem decidable. With finite horizon PCTL as the system specification, a special type of DFA, za-DFA, is proposed in [17] as the supervisor. For convenience, the finite horizon PCTL specifications considered in this paper should give upper bounds for the probability of satisfying path formulas as the lower bound cases can be converted to upper bound cases [46].
3.1.2. POMDP modeling uncertainties From the training data, the POMDP model for HRC can be learned. However, due to various reasons (such as limited data, insufficient inference time), the learned POMDP model may contain modeling uncertainties [39]. In these cases, the modeling uncertainties will make the learned transition and observation probabilities subject to a certain confidence level or likelihood. For the supervised learning, the collected data sequences in POMDP model learning are associated with the hidden states so the states can be treated as a fully observable part. In this way, the learning processes for the transition and observation models in POMDP are separated, which makes POMDP training a standard estimation problem for the multinomial distribution. Therefore, given a certain confidence level, the modeling uncertainties can be described by the confidence interval or likelihood region [40]. In this case, the transition/observation probabilities of POMDP can be represented by probability intervals with upper and lower bounds, instead of single values. This technique to describe modeling uncertainties has been discussed in [41] for human driver behavior modeling where the transition probabilities are given as convex intervals. Since the supervised learning uses the empirical frequencies as the parameter estimations, the confidence intervals in POMDP transition/ observation models can be generated following the standard confidence interval calculation method for the parameter estimation of the multinomial distribution. For a general HMM, the confidence interval can also be calculated [42]. Given an HMM model trained from the unsupervised learning, the bootstrapping [43] is used to generate bootstrap samples. After enough samples are collected, the bootstrap estimation will generate the distributions of the transition and observation probabilities. Based on these distributions, the confidence intervals for the transition/observation models can be calculated in the learned HMM/POMDP. As a summary, for POMDP learned by either the supervised or unsupervised learning, the modeling uncertainties can be described by confidence intervals, which will give upper and lower bounds for the transition and observation probabilities in the model. Based on that, extensions of POMDP model to consider modeling uncertainties are discussed in [24] as POMDPs with imprecise parameters (POMDPIP) and in [44] as bounded-parameter POMDPs.
3.2.1. Za-DFA as the supervisor and L* learning based supervisor synthesis za-DFA We want to find a supervisor to regulate the closed-loop behavior of POMDP to satisfy finite horizon PCTL specifications. Since the observable history for POMDP execution is a sequence of observations (z) and actions (a), we propose za-DFA as the supervisor for POMDP where a history for POMDP execution can be represented as a path in za-DFA.
= {Q, q¯, , , Qm} is a supervisor for Definition 4. [47] A za-DFA POMDP = {S, s¯, A, Z , T , O}, where
• Q: a finite set of states; • q¯ Q : the initial state; • = { = z, a | z Z, a A}: a finite alphabet; • δ: Q × Σ → Q: a transition function; • Q ⊆Q: a set of accepting states. m
Since DFA is an equivalent representation of regular language, learning a za-DFA as the supervisor can be solved through learning of a regular language. As a classic automata learning algorithm to learn an unknown regular language, L* learning [48] is used in the supervisor synthesis process for POMDP. In the L* learning, membership queries and conjectures are two types of questions generated by the learning process to build the knowledge base for the supervisor. By giving answers to these questions, the learning process actively builds and refines the guess of a suitable supervisor with a Questions & Answers mechanism. The flowchart of the L* learning based supervisor synthesis algorithm is shown in Fig. 1. Consider a PCTL that gives upper bound for the satisfaction probability (lower bound cases can be converted to upper bound cases [46]). In the preprocessing stage, the maximum and minimum probabilities of the satisfaction for a given PCTL specification are calculated. Based on the comparison with the required probabilistic threshold, the algorithm will either terminate for trivial cases or go to the next stage for L* learning. By using membership queries and
3.2. POMDP supervisory control framework With POMDP modeling, uncertainties from human, robots and the environment can be captured. Meanwhile, the modeling uncertainties from the model learning process can be described using confidence intervals. Based on the POMDP modeling, a POMDP supervisory control framework is applied based on our previous work [17]. Meanwhile, we extend the POMDP supervisory control framework to be adaptive to POMDP modeling uncertainties by using probability intervals to
Fig. 1. The flowchart of the L* learning based POMDP supervisor synthesis [17]. 62
Robotics and Computer Integrated Manufacturing 57 (2019) 59–72
X. Zhang, H. Lin
conjectures, L* learning can guarantee the return of a permissive and non-blocking supervisor. For the membership queries answering, the probability for an observation-action sequence to satisfy a PCTL path formula is calculated to decide whether or not this single observationaction sequence will violate the given specification. This can be solved in linear time. After collecting enough information with membership queries, a closed and consistent knowledge table can be generated by L* to make a guess of the target supervisor and have a conjecture. To decide whether or not this conjecture can be used as a suitable supervisor, OracleP, OracleB and OracleS are used to check different properties to find possible mistakes made by the conjecture. In this process, OracleP finds the observation-action sequences disabled by the current conjecture but accepted by the optimal control policy through graph searching algorithms for the transition model of the conjecture, which is in PTIME. These sequences are returned as positive counterexamples to make sure the optimal policy will be enabled by a supervisor. OracleB checks whether or not there exists a POMDP history that leads to a situation with no action being enabled. This is achieved with the depth-first searching in PTIME on the transition model of the regulated system. Based on observation-action sequences leading to that situation, the knowledge table can be adjusted in L* learning. OracleS applies model checking to check whether or not the satisfaction of the given PCTL specification is guaranteed. If not, counterexample selection algorithms will be applied to find an evidence as a set of observationaction paths for the violation. Answering conjectures with OracleS is in EXPTIME-complete if the POMDP is solved exactly [49], which motivates our research on the POMDP abstraction method discussed in Section 4.1. As these oracles will provide counterexamples as feedback information to correct the behavior of the supervisor, we have the following theorem for the L* learning based supervisor synthesis.
Fig. 2. A POMDP
with modeling uncertainties.
Table 1 The observation matrix of POMDP
.
O(s, z)
z1
z2
s0 s1 s2 s3 s4
0.5 0.9 0.1 0.9 0.1
0.5 0.1 0.9 0.1 0.9
state: L (s3) = {fail} . The PCTL specification is given as 2fail], which limits the maximum probability to be = P 0.25 [true 0.25 to reach the failure state in two time steps. For = 0.02, by using the upper bounds in the preprocessing stage, the minimum probability pmin = 0.208. Then by using the lower bounds to answer membership queries and upper bounds to answer conjectures, the returned supervisor from the L* learning based POMDP supervisor synthesis algorithm is shown in Fig. 3. Based on the model checking with the upper bounds for transition probabilities, the regulated system behavior has the maximum probability of 0.2208 < 0.25 in the worst case to reach the failure state s3.
Theorem 1. [17] Given a POMDP and PCTL specification with a finite horizon, the POMDP supervisor synthesis algorithm is sound and complete. If a non-empty supervisor is returned, it will be permissive, non-blocking and guarantee the satisfaction of the given specification. 3.2.2. Supervisor synthesis with modeling uncertainties When the learned POMDP in HRC contains modeling uncertainties, the POMDP supervisor synthesis algorithm can still be applied. Consider a PCTL specification with the constraint on the upper bound of the satisfaction probability. In the preprocessing stage of the POMDP supervisor synthesis algorithm, the upper bounds of the transition and observation probabilities can be used to calculate pmin. If the returned pmin violates the probability threshold given by the task specification, more training data is needed to lower the sample variance of the training data, which will lead to tighter confidence intervals [50]. After the preprocessing stage, to answer the membership queries, the lower bounds for each transition and observation probability will be used to calculate the probabilities associated with certain observationaction sequences. Using the lower bounds to answer membership queries can reduce the possibility of disabling an observation-action sequence due to large modeling uncertainties. As OracleP and OracleB only check the existence of certain paths or connectivity in the network built from the states and transitions of the regulated system, they will not be influenced by the confidence intervals for transition or observation probabilities in POMDP model. For OracleS, the upper bounds of the confidence intervals for the transition and observation models will be used for model checking and counterexample selections. Based on the upper bounds of transition and observation probabilities, the model checking in OracleB gives a more strict verification.
As a summary, by using lower and upper bounds to answer membership queries and conjectures respectively, the proposed L* learning based POMDP supervisor synthesis algorithm can still be applied to learn a supervisor for a POMDP with modeling uncertainties. The modeling uncertainties are represented by confidence intervals assigned to the probability estimations in the POMDP transition and observation models. In this sense, the L* learning based supervisor synthesis algorithm is adaptive to POMDP modeling uncertainties. 4. Semi-online adaptation for HRC With the supervisory control framework for POMDP, a za-DFA as the supervisor can be learned for HRC based on L* learning to guarantee that the collaboration performance can formally satisfy the design requirements. However, during the system execution stage, the human may behave differently compared to his/her model learned from the
Example 1. Consider a POMDP shown in Fig. 2 with the observation table shown in Table 1. This POMDP has modeling uncertainties. Instead of single values to represent the transition probabilities, has transition intervals with the length of each probability interval to be 2Δ for its transition model. Here the state s3 is labeled as the failure
Fig. 3. The supervisor 63
.
Robotics and Computer Integrated Manufacturing 57 (2019) 59–72
X. Zhang, H. Lin
Fig. 4. The overview of the semi-online adaptation in the supervisory control framework for HRC.
horizon PCTL satisfaction relation. It means that, if the abstract system satisfies a given PCTL specification, then the POMDP will also satisfy it. Furthermore, if the POMDP cannot satisfy a given PCTL specification, this abstract system can be used to find counterexamples as the evidence to show the violation for POMDP system. To introduce the abstraction of POMDP, z-labeled 0/1 weighted automaton (0/1-WAz) is defined as an extension of 0/1 weighted automaton (0/1-WA) with observation labels.
offline training data. This leads to concerns of the runtime adaptation for the supervisory control framework of HRC. As has been discussed in [51], the adaptation framework for verifiable software systems with online model changing contains the online system monitoring, system model updating, and verification adaptation. Inspired by this framework, Fig. 4 shows the overview of the semionline adaptation framework for HRC together with the supervisor design process. During the online execution, with actions being selected under the learned supervisor, the online monitoring process will collect new observation data from available sensors for human and robot. With these extra sample data, the model adaptation process can follow the online model training algorithms for HMM to update the transition and observation models of POMDP and generate new confidence intervals accordingly [52]. By re-calculating the confidence intervals of the updated POMDP model, the upper bounds and lower bounds of the transition and observation probabilities can be changed. These changes may influence the satisfaction of the task specification under the current supervisor. Therefore, the regulated behavior of HRC needs to be re-verified, which posts the requirement for the verification adaptation to verify the system performance. To verify the performance of the current supervisor, we first introduce an abstraction method for POMDP developed in our previous work [25]. This abstraction method can be used to reduce the state space size of the POMDP involved in the model checking process. Meanwhile, it provides foundations for verification adaptation techniques developed in this section.
Definition 5. A 0/1-WAz
z
is a tuple {S, s¯, A, T , Z , L, Lz } with
• S: a finite set of states; • s¯: the initial state; • A: a finite set of actions; • T: S × A × S → [0, 1]: a transition function; • Z: a finite set of observation labels; • L: S → 2 : an atomic proposition labeling function; • L : S → Z: a z-labeling function. AP
z
For a POMDP, its corresponding 0/1-WAz can be defined as follows.
z = {X z , x¯ z , A , T z , Z z , Lz } Definition 6. A 0/1-WAz is z corresponding 0/1-WA for POMDP = {S, s¯, A, Z , T , O}, where
the
• X = {x |x = [s, z ], s S, z Z } {¯}s : a finite set of states; • x¯ = s¯: the initial state; • A: a finite set of actions; • T (¯,s a, [s , z ]) = T (¯,s a, s )·O (s , z ); • T ([s, z], a, [s , z ]) = T (s, a, s )·O (s , z ); {Init } ; • ZL (¯)=s Z= Init and L ([s, z ]) = z , s S, z Z . • z
z
z
z
z
z
z
4.1. POMDP abstraction method
z
z
Then based on the corresponding 0/1-WAz, the POMDP abstract system is defined as another 0/1-WAz that holds safe simulation relation with the POMDP’s corresponding 0/1-WAz. The safe simulation relation is a binary relation between the state spaces of two 0/1-WAz. It is defined recursively by first defined the simulation relation between two probability distribution. Consider two z z z and 0/1-WAz 2 = {S2, s¯2, A , T2, 1 = {S1, s¯1, A , T1, Z , L1, L1 } a a z µ2 : Z , L 2 , L2 } . With a ∈ A, s1 ∈ S1 and s2 ∈ S2, let s1 µ1 and s2 µi (s ) = Ti (si, a , s ), s Si, i {1, 2}, respectively. Define Supp (μ) ≔ {s| μ(s) > 0}. Let R⊆S1 × S2 be a binary relation between the z state spaces of 1z and 2.
In the L* learning based supervisor synthesis, OracleS applies model checking on POMDP. Due to the partial observability of POMDP, its model checking problem is EXPTIME-complete [49]. Therefore, in the POMDP supervisor synthesis framework, the model checking on POMDP can introduce high computational complexity. This is the main limitation of the POMDP supervisory control framework when the system is given with a large state space and planning horizon. To lower the model checking complexity, in [25], we explore an abstraction method for POMDP to reduce the potential state space size involved in the model checking process to alleviate the curse of dimension from the belief state space [53]. The POMDP abstraction-based method proposed in [25] targets to find an abstract system with a smaller state space but preserving a finite
Definition 7. A safe simulation relation R between two probability 64
Robotics and Computer Integrated Manufacturing 57 (2019) 59–72
X. Zhang, H. Lin
Fig. 5. The flowchart for the POMDP abstraction method[25].
Fig. 7. The POMDP model for the mental status of the human driver2.
between POMDP and the abstract 0/1-WAz. Remark. For POMDP given under modeling uncertainties, we can use probability intervals to represent the transition or observation model. Then with the upper bounds to represent these intervals, the corresponding 0/1-WAz can be generated accordingly for POMDP. This is because the total probability masses in 0/1-WAz are not restricted to be smaller or equal to 1. As a result, Theorem 2 can be applied for POMDP with modeling uncertainties directly. To find such an abstract system that can hold the safe simulation relation and be used to prove or disprove the satisfaction relation on POMDP, a CEGAR based framework is given in [25]. The flowchart in Fig. 5 shows the POMDP abstraction finding process based on CEGAR. Given a POMDP and a PCTL with a finite horizon as the input, the abstract system is initialized to the coarsest guess where all possible states in POMDP are abstracted together and form a state space partition. The abstract system as a 0/1-WAz can be generated with this partition under quotient construction rules. In this quotient construction process, the concrete states grouped up together are represented by a single abstract state, and the weight between two abstract states is the maximum transition probability from one concrete state in a abstract state to another abstract state. Then the abstraction refinement process utilizes the counterexample finding on the abstract system to check whether or not the abstract system can satisfy the given PCTL specification. If a counterexample is found, a spurious counterexample checking process verifies whether or not the projection of this counterexample on the POMDP is also a counterexample. If the projection of this counterexample cannot witness the violation of the specification on POMDP, this counterexample is spurious and is generated from a coarser abstract system. Then the refinement process can use this counterexample to improve the abstraction by finding the over abstracted state then refining the current state space partition. Otherwise, this counterexample is not spurious, and the evidence of the specification violation on POMDP is found. The abstraction refinement process is carried out iteratively until a proper abstract system is returned for POMDP model checking or a real counterexample is found.
Fig. 6. A lane changing scenario1
distributions with μ1⊑Rμ2 if and only if a weight function w: S1 × S2 → [0, 1] exists with 1. 2. 3. 4.
µ1 (s1) = s2 S2 w (s1, s2), ∀ s1 ∈ S1, µ 2 (s 2 ) w (s1, s2), ∀ s2 ∈ S2, s1 S1 w (s1, s2) = 0 if L1z (s1) L2z (s2) or L1(s1) ≠ L2(s2), w(s1, s2) > 0 only if s1Rs2, ∀ s1 ∈ S1 and s2 ∈ S2.
Remark. Similar. to the strong simulation relation in MDPs, the safe simulation relation tries to split and pair probability masses between two probability distributions. But an extra constraint is given to require that only states with the same observation label can be safe simulated. z Definition 8. A safe simulation relation R between two 0/1-WAz 1 a z µ1 , and 2 is defined recursively that requires for every s1Rs2 and s1 a
µ 2 and μ1⊑Rμ2. For s1 ∈ S1 and s2 ∈ S2, s2 there exists a μ2 with s2 safely simulates s1, denoted s1⪯s2, if and only if there exists a safe z z simulation T such that s1Ts2. 1 , also denoted 2 safely simulates z z 1 2 , if and only if s¯1 s¯2 . z Theorem 2. [25] Consider a finite horizon PCTL ϕ. If a 0/1-WAz 2 can z , for POMDP denoted as safe simulate the corresponding 0/1-WAz 1 z z z only if obs . 1 2 , then 2 obs
Example 2. Consider a lane changing scenarios shown in Fig. 6 based on the example discussed in [25]. An autonomous car A tries to change
By Theorem 2, the satisfaction relation of finite horizon PCTL specifications is proved to be preserved by safe simulation relation
1
2 The image sources: www.shutterstock.com, mynewyorkcitylawyer.com and www.dreamstime.com
The image sources: http://driving.ca 65
http://www.
Robotics and Computer Integrated Manufacturing 57 (2019) 59–72
X. Zhang, H. Lin
Table 2 The Observation matrix of POMDP O(s, z)
zS
sH sN sU si: even si: odd
1 0.95 0.05
. zN 1 0.05 0.95
zA
1
the lane to its left where a human-driven car B is there. But the lane changing behavior may lead to uncomfortable status of the human driver. To model the mental status of the human driver, a POMDP model is shown in Fig. 7 and Table 2. The state space S consists of n + 3 states (n is an even number and n ≥ 2) with S = {sU , sH , sN } {s0, s1, …, sn 1} . Here sU stands for the unhappy human mental status, sH stands for the happy human mental status and sN stands for the normal status. The action set A = {a, b} where a stands for the car A’s attempt to change the lane, and b stands for the car A is driving on its own lane. The lane changing attempt from car A leads to mental status changing of the human driver on B with the transition probability shown in Fig. 7. Specially, we have
T (si, T (sn T (si, T (si, T (si,
Fig. 8. An abstract system
a, si + 1) = 0.5, i [0, n 2], 1, a , sn 1) = 0.5, a, sU ) = 0.25, i [0, n 1], a, sH ) = 0.25, for i: even and i [0, n 1], a, sN ) = 0.25, for i: odd and i [0, n 1].
for POMDP
Fig. 9. An adversary that witnesses
¬
obs
.
.
z z and Consider two 0/1-WAz 1 = {S1, s¯1, A , T1, Z , L1, L1 } z = {S2, s¯2, A, T2, Z , L 2 , L2 } . To check the safe simulation between them, a binary relation R⊆S1 × S2 with s¯1 R s¯2 is needed as the evidence to show the simulation relation. Because the safe simulation relation is defined recursively starting from the simulation relation for two probability distributions, the algorithm to check whether or not μ1⊑Rμ2 is given, where μi is a probability distribution over Si, i = 1, 2 . Following a similar idea of strong simulation relation checking [54], a flow network for (μ1, μ2) can be constructed with respect to R. Here the network (µ1, µ 2 , R): =(V , E , c ) where V is a finite set of vertices, E⊆V × V is a { } is the capacity function. The set of set of edges, c: E >0 vertices is defined as
z 2
On car A, the equipped sensor can detect the speed patterns of B, which gives the observation set O = {z A , z S , zN } . Here zA stands for acceleration, zS stands for slowing down and zN stands for normal. The observation function is shown in Table 2. Since we want to avoid an unhappy status of the human driver, the state sU is labeled with fail: L (sU ) = fail . Given a bounded until specinfail], the model checking on = P 0.45 [true fication checks whether or not the maximum probability of reaching fail state in n steps is smaller or equal to 0.45. Following the CEGAR framework, an abstract system shown in Fig. 8 is found after three iterations as reported in [25]. The model checking on this abstract system gives a real counterexample PathCE as a set of four paths {t0 → t3, t0 → t1 → t3, t0 → t1 → t2 → t3, t0 → t1 → t4 → t3} under the adversary shown in Fig. 9. The realizable probability of this counterexample is larger than the required 0.45. This implies that ¬ obs , with the underlying observation-action sequences of PathCE witnessing the violation.
V={ ,
}
Supp (µ1)
Supp (µ 2 ),
where ↗ and ↘ stand for the source and sink nodes, respectively. The edge in (µ1, µ 2 , R) is defined as
E = {(s, t )|(s, t )
R}
{( , s )}
{(t ,
)},
where s ∈ Supp(μ1) and t ∈ Supp(μ2). The capacity function is defined as 4.2. Verification adaptation
• cc ((t, , s)) == µµ ((ts),), • c (s , t ) = , s •
s Supp (µ1) ; t Supp (µ 2 ) ; Supp (µ1) and t ∈ Supp(μ2). Remark. Compared to the flow network constructed for the strong simulation checking in [54], the network constructed here does not have an additional dummy state ⊥ to deal with sub-stochastic distributions. This is because the summations of the state transition probabilities under one action are always larger or equal to 1 for 0/1WAz considered in this paper. 1
During the offline learning process, CEGAR can be used to iteratively refine an abstract system, and the preservation of the system specification is guaranteed by the safe simulation relation. Therefore, the verification of the regulated system performance is achieved by model checking on the abstract system. During the online execution, the POMDP model may change. This will also change the transition probabilities in the corresponding 0/1-WAz of the updated POMDP model. However, if the hold of safe simulation relation between the corresponding 0/1-WAz and the abstract system can be verified, then the satisfaction relation of the regulated behavior still holds under the current supervisor. In this case, the adjustment of the supervisor for the new POMDP model can be avoided.
2
On this flow network, a flow function f : V × V with three constraints:
can be defined
• capacity constraint: f(v, w) ≤ c(v, w), ∀(v, w) ∈ V × V; • antisymmetry constraint: f (v, w) = f (w, v) (v, w) V × V ; f (u , v ) = constraint: • conservation
4.2.1. Verification for the safe simulation relation To verify the safe simulation relation between the corresponding 0/ 1-WAz of the updated POMDP and the abstract system, the algorithm to check the safe simulation relation is discussed following similar approaches for the strong simulation checking [54].
f u :(u, v) E u :(v, u) E (v, u), v V { , }. Proposition 1. Let R⊆S1 × S2 be a binary relation between S1 and S2. Let μ1, μ2 be the probability distributions over S1 and S2, respectively. Then R is a safe simulation relation for μ1 and μ2 if and only if the maximum flow of 66
Robotics and Computer Integrated Manufacturing 57 (2019) 59–72
X. Zhang, H. Lin
the network
(µ1, µ 2 , R) is ∑sμ1(s).
67
12
11
9
10
8
7
6
5
3
4
1
Example 3. For the POMDP system described in Example 2, the observation error for sodd is 0.05 with O (sodd, z even ) = 0.05. If the upper bound of this observation error is increased to 0.2, it will change the transition probabilities between soddz* and sevenz* in the guided MDP model of the new POMDP new . For the abstract system of POMDP shown in Fig. 8, the safe simulation between new and needs to be verified. For POMDP , [soddzodd]Rt1 and [soddzeven]Rt2. For the new POMDP new, it can be verified that the support distributions of [soddzodd] and t1 still hold the safe simulation relation, similar for [soddzeven] and t2. This is because, the transition probabilities from t1 to t2 are selected to be the maximum transition probabilities between their concrete states [sizodd] and [sizeven], which are 0.475 under action a and 0.95 under action b. If the observation error is changed and influences the probabilities of transitions to [soddzodd] and [soddzeven], the new transition probabilities (0.32 and 0.2) generated from new are still less that the maximum one chosen . As a result, the safe simulation between new and still holds for following Algorithm 1.
2
i←0 R1 ← {(s1 , s2 )|L(s1 ) = L(s2 ), Lz (s1 ) = Lz (s2 )} do i=i+1 Ri+1 ← ∅ for (s1 , s2 ) ∈ Ri do if s1 Rs2 then Ri+1 = Ri+1 ∪ {(s1 , s2 )} end end while Ri+1 , Ri return Ri
With Proposition 1, the equivalence between the safe simulation relation checking and the maximum flow problem for probability distributions is shown. There exist various algorithms in PTIME to solve the maximum flow problem [54]. To check the safe simulation relation between two systems, similar to the algorithm for strong simulation relation checking proposed in [54], the checking process starts from the coarsest guess of R with R = {(s1, s2)|L (s1) = L (s2 ), Lz (s1) = Lz (s2 )} . Then in each iteration i, Ri + 1 is generated from Ri by removing (s1, s2) from Ri if s2 cannot safe simulate s1 under current Ri. This process keeps going until the convergence is achieved with Ri + 1 = Ri . Then the final Ri is a safe simulaz z tion relation. If (¯s1, s¯2) R, it can be concluded that 1 2 . This algorithm is summarized in Algorithm 1. In the worst case, only one pair of states is removed from the current Ri, which makes the number of iterations bounded by O(n2m2) where n and m are the number of z states in 1z and 2. Compared to the strong simulation relation, the safe simulation gives different requirements for weighted functions. However, both strong and safe simulation checking problems are solved by converting into the maximum flow problem. Therefore, different checking algorithms for the strong simulation studied in [54] are directly applicable for the safe simulation checking.
Algorithm 1. Safe simulation relation checking.
Proof. If the maximum flow of the network (µ1, µ 2 , R) is ∑sμ1(s), then f ( , s ) = s µ1 (s ) and f ( , s ) = c ( , s ) = µ1 (s ), bes :( , s ) E cause c ( , s ) = s µ1 (s ) and c(↗, s) ≥ f(↗, s). For any vertices s :( , s ) E s: (↗, s) ∈ E, f ( , s ) = t :(s, t ) E f (s, t ) = t :(s, t ) R f (s, t ) = µ1 (s ). f (t , ) = Following the conservation constraint, f (s , t ) = s :(s, t ) R f (s, t ) c (t , ) = µ 2 (t ) . Then following the s :(s, t ) E definition of safe simulation relation, it can be seen that f(s, t) is a weight function with respect to R, and R is a safe simulation relation for μ1 and μ2. If R is a safe simulation relation for μ1 and μ2, there must exist a weight function w. Then f (s, t ) = w (s, t ) can be assigned for any (s, t) ∈ R. Meanwhile, f ( , s ) = s:(s, t ) R f (s, t ) for any s ∈ Supp(μ1), and f (t , ) = t :(s, t ) R f (s , t ) for any t ∈ Supp(μ2). Since µ1 (s ) = t :(s, t ) R w (s, t ) = c ( , s ) µ 2 (t ) = c (t , ) and s :(s, t ) R w (s, t ), it can be verified that all three constraints for the flow function hold, which means a valid flow function has been designed. Based on the conservation constraint, it follows that the maximum flow is ∑sμ1(s) for the constructed network (µ1, µ 2 , R) .
Robotics and Computer Integrated Manufacturing 57 (2019) 59–72
X. Zhang, H. Lin
4.3. Adjustment of the supervisor With the updated system model, the safe simulation relation between the updated POMDP model and the current abstract system can be verified. If the safe simulation relation fails to hold, the current supervisor needs to be adjusted to guarantee the satisfaction of the task specification. Since is returned as a proper supervisor from the L* learning during the offline design process, all three oracles for conjecture checking return true, and is non-blocking. While the changing in the transition/observation probabilities of the POMDP model will not influence the non-blocking property, the model checking result from OracleS may turn into false since the probability of violating the task specification may get enlarged. In this case, new counterexamples will be generated by the model checking and counterexample selection for the updated POMDP model. With the counterexample returned from OracleS, a new iteration of L* learning is started, and the observation table in L* is updated to generate new membership queries. From the offline learning, a system partition for the abstract system can be reused as the initial partition to find a new abstract system for the updated POMDP model. Compared to initialization with the coarsest state space partition, starting from the offline learning partition will take less number of iterations to return a new converged state space partition following CEGAR. In this way, the supervisor can be updated and adapt to model changing in HRC.
Fig. 11. A driver behavior and situation aware assistance scenario in HRC3.
However, if the updated system model introduces a single observationaction sequence with enough probability mass to violate the specification, the model checking and counterexample selection will only need to return a single observation-action sequence with the maximum probability of violation. This can be solved as the shortest path problem in PTIME [46] without finding the most dangerous observation-based adversary through model checking. Instead of a single sequence, if the model checking of OracleS needs to generate a set of paths under an observation-based adversary as the counterexample to show enough accumulative probability of violation, the model checking is EXPTIMEcomplete. Then the online adjustment of the supervisor can take longer time which makes our adjustment semi-online. This is the main limitation of the online adaptation process proposed for the POMDP supervisory control framework. As the proposed POMDP supervisory control framework does not require a particular model checking method, new developed computational model checking tools can be applied to improve both the offline learning and the online adaptation process, such as the Strom model checker [32]. When the modification of the supervisor is needed, the system still requires a control policy. In this case, existing online POMDP solvers, such as POMCP [34], can be used to get an optimal control policy before a new supervisor is generated. For the future work, one possible approach to deal with online adaptation is to generate a library of supervisors for different abstract systems offline. Then based on the safe simulation relation checking on these abstract systems, the supervisor can be selected if its corresponding abstract system safely simulate the updated system for the online adaptation. In this way, the adjustment of the supervisor through L* learning can be avoided for certain cases.
Example 4. Consider the POMDP model in Example 1. For the offline learning, the lengths of the transition intervals are given as = 0.02. With the learned supervisor shown in Fig. 3, the maximum probability of reaching the failure state in the worst case is 0.2208. Assume during the online execution or with more training data, the underlying POMDP model is changed with new lengths of the transition intervals to be = 0.045. Since the upper bound of every transition probability has been enlarged, the safe simulation relation can not hold. Therefore, the adjustment of the supervisor is needed. By performing the model checking on the regulated system, OracleS finds a maximum probability of 0.27405 > 0.25 to reach the failure state. Through counterexample selections, OracleS returns ⟨z*, a1⟩⟨z2, a1⟩ and ⟨z*, a1⟩ ⟨z1, a2⟩ as counterexamples in four iterations. Following the L* learning is shown in Fig. 10. With the process, the modified supervisor maximum probability of 0.24525 < 0.25 to reach the failure state, the regulated system behavior can satisfy the required PCTL specification again.
4.4. Discussion For the semi-online adaptation of the proposed supervisory control framework in HRC, the safe simulation relation is used to verify whether or not the current supervisor can still be used to regulate the updated system. The safe simulation relation checking can be carried out in PTIME by solving a sequence of maximum flow problems. There also exist improved algorithms for the strong simulation checking for MDPs as shown in [54], which can be applied to safe simulation checking. While the hold of safe simulation relation between the updated system and the abstract system implies no need to adjust the supervisor, if the changing in the system model cannot be compensated by the safe simulation relation, new iterations for L* learning are needed. In this case, the main computational complexity comes from OracleS in L* learning since the model checking on POMDP is EXPTIME-complete.
Fig. 10. The Modified Supervisor
5. Case study Consider a driver behavior and situation aware assistance system modified from [15] for HRC applications with ADASs, shown in Fig. 11. The system is designed to keep a car on a single lane while the human driver may be drowsy. With POMDP modeling, the state space S is the cross product of the human internal state space Sh = {Awake , Sleepy} and the vehicle’s horizontal position space on the lane Sv = { 1, 0, + 1, OFF }, where 1 is the left most, 0 is the middle, + 1 is the right most, and OFF represents off the lane. The observation set Z contains sensor detections on human actions Zh = {Eyes open , Eyes closed} and vehicle states Z v = { 1, 0, + 1, OFF } . The action set A describes the actuation ability of the vehicle to turn the warning on or off for the human driver: A = {ON , OFF } . As a summary, 3 The image sources: http://www.dailymail.co.uk and http://www.toyotaglobal.com
. 68
Robotics and Computer Integrated Manufacturing 57 (2019) 59–72
X. Zhang, H. Lin
Table 3 The transition probabilities for human internal states. sh
a
sh
Prob
Awake
ON
Awake
OFF
Sleepy
ON
Sleepy
OFF
Awake Sleepy Awake Sleepy Awake Sleepy Awake Sleepy
0.95 0.05 0.85 0.15 0.5 0.5 0.1 0.9
Table 4 The probabilities to generate transitions for vechile positions. Eyes
sv
Open
1 0 +1 OFF *
Closed
steer left
0.9
keep straight
steer right
0.1 1 0.1 1
0.9
Fig. 13. The supervisor for the driver behavior and situation aware assistance system given PCTL specification = P 0.10015 4 fail . Here z* stands for any observation; a* stands for any action; z stands for any observation z except (Open , 1) .
do nothing
the state space S = Sh × Sv = {Awake , Sleepy} × { 1, 0, + 1, OFF } . The A = {ON , OFF } . action set The observation set Z = Zh × Z v = {Eyes open , Eyes closed} × { 1, 0, + 1, OFF } . The size of this POMDP is the product of the size of state space, observation space and action set, which is 8 × 8 × 2 = 128. For the transition function T, it is determined by transitions of human internal states and vehicle position states. Depending on whether the warning is ON or OFF, the probability of human internal state transition is shown in Table 3. For transitions of vehicle positions, it is triggered by human driving decisions regardless of warning being ON or OFF. When human is awake, human eyes have a probability of 0.88 to be open, then human will make the decision that leads to a probability distribution to choose to either keep straight, steer left, steer right, or do nothing. When human is sleepy, human eyes have a probability of 0.88 to be closed, then human will do nothing. These probabilities are shown in Table 4. When human steer left, steer right or keep straight, the vehicle horizontal position will have an offset of 1, + 1, or 0, respectively. But when human does nothing, the vehicle will have a probability of 0.4 to go straight, a probability of 0.3 to either go left or right by a one unit offset. Together with the transition probabilities for human internal states, the transition function T is well defined. For the observation function O, it is determined by the observation errors on human internal states and vehicle position states. For human internal states, the sensor for human states detection has a sensing error of 0.2, in the sense that when human is awake (sleepy), the probability of observing Eyes close (Eyes open) is 0.2. For vehicle position states, the observation distribution is shown in Table 5. Together with the observation error for human internal states, the observation function O is well defined. The observation and transition models of the whole system4 are given in Table 6 and Fig. 14. Since our case study focuses on its illustration purpose, the probabilities used in this case study are modified from an assistance driving example discussed in [15]. To consider the system task that requires the vehicle staying inside of the lane, the states ⟨*, OFF⟩ are labeled with fail: L ( sh, s v ) = {fail} if s v = OFF . The control task is given as a PCTL specification P > 0.9□ ≤ 4¬fail, which requires a probability larger than 0.9 to keep the vehicle inside of the lane in four time horizon. This specification is 4fail] equivalent to PCTL = P 0.1 4 fail = P 0.1 [true that requires a probability smaller or equal to 0.1 of eventually being off the lane in four time horizon. Starting from the initial state s¯ = Awake, 1 , the offline design process targets to find the za-DFA as the supervisor to guarantee that the regulated system behavior can satisfy ϕ. Given the size of the observation set and action set, the alphabet size | | = |Z| × |A| = 16. In the
1
Table 5 The observation distribution for vehicle position states. sh \ zh
1 0.8 0.1
1 0 +1 OFF
0
+1
0.1 0.8 0.1
0.1 0.8
OFF 0.1 0.1 1
Table 6 The observation Model for the Driver Behavior and Situation Aware Assistance System. O
zh
sh
s v zv
Awake
1 0 1 OFF 1 0 1 OFF
Sleepy
Eyes Open
Eyes Closed
0
1
0.64 0.08
0.08 0.64 0.08
0.08 0.64
0.16 0.02
0.02 0.16 0.02
0.02 0.16
1
OFF
1
0.08 0.08 0.8 0.02 0.02 0.2
0
1
0.16 0.02
0.02 0.16 0.02
0.02 0.16
0.64 0.08
0.08 0.64 0.08
0.08 0.64
OFF 0.02 0.02 0.2 0.08 0.08 0.8
Fig. 12. A supervisor for the driver behavior and situation aware assistance system. Here z* stands for any observation; a* stands for any action; z stands for any observation z except (Open , 1) .
4
69
The observation and transition models are modified from [15]
Robotics and Computer Integrated Manufacturing 57 (2019) 59–72
X. Zhang, H. Lin
Fig. 14. The transition model for the driver behavior and situation aware assistance system. The first and second values in the square brackets stand for transition probabilities under ON and OFF, respectively.
preprocessing stage of the supervisor synthesis, the control policy that gives the minimum probability is found to give a probability of 0.058 to 4fail, which will turn the warning module on during all satisfy true time. Since 0.058 < 0.1, the L* learning based synthesis algorithm generates membership queries and conjectures to reason a permissive and non-blocking supervisor. After nine iterations, a proper supervisor is found, which is shown in Fig. 12. This supervisor gives the maximum 4fail, which is lower than the probability of 0.098 to satisfy true required probability 0.1. From this supervisor, it can be seen that, if the warning module is turned off at Step 1 and the observation of the system is z = (Open, 1) at Step 2, then the warning module must be turned on at Step 2 to keep the vehicle inside of the lane with the required probability. For other histories of the system execution, the
supervisor provides different actions for the system to select, which shows the permissiveness of the supervisor. If the required probability threshold is relaxed to 0.10015, the returned supervisor will change, as shown in Fig. 13. This supervisor 4fail . At gives the maximum probability of 0.10014 to satisfy true this time, the system is allowed to turn off the warning module at Step 2 if the module is turned off at Step 1 and z = (Open, 1) is observed at Step 2. However, if it happens, the warning module must be turned on at Step 3 when z = (Open, 1) is observed again. In this case study, the POMDP itself is its abstraction since its state space is not large. The proposed supervisory control framework is shown to be able to find a proper supervisor for this HRC application.
70
Robotics and Computer Integrated Manufacturing 57 (2019) 59–72
X. Zhang, H. Lin
6. Conclusions
Technology, 2014 Ph.D. thesis. [23] K. Chatterjee, M. Chmelik, J. Davies, A symbolic sat-based algorithm for almostsure reachability with small strategies in pomdps, arXiv preprint arxiv:1511. 08456(2015). [24] H. Itoh, K. Nakamura, Partially observable markov decision processes with imprecise parameters, Artif. Intell. 171 (8-9) (2007) 453–490. [25] X. Zhang, B. Wu, H. Lin, Counterexample-guided abstraction refinement for pomdps, arXiv preprint arxiv:1701.06209(2017). [26] Z. Wang, K. Mülling, M.P. Deisenroth, H.B. Amor, D. Vogt, B. Schölkopf, J. Peters, Probabilistic movement modeling for intention inference in human–robot interaction, Int. J. Rob. Res. 32 (7) (2013) 841–858. [27] P. Lasota, S. Nikolaidis, J. Shah, Developing an adaptive robotic assistant for close proximity human–robot collaboration in space, _ AIAA Infotech@ Aerospace _ (2013). [28] X. Zhang, Y. Zhu, H. Lin, Performance guaranteed human-robot collaboration through correct-by-design, American Control Conference (ACC), 2016, American Automatic Control Council (AACC), 2016, pp. 6183–6188. [29] K. Chatterjee, L. Doyen, H. Gimbert, T.A. Henzinger, Randomness for free, International Symposium on Mathematical Foundations of Computer Science, Springer, 2010, pp. 246–257. [30] E.M. Clarke, E.A. Emerson, Design and Synthesis of Synchronization Skeletons using Branching Time Temporal logic, Springer, 1982. [31] M. Kwiatkowska, G. Norman, D. Parker, Prism 4.0: Verification of probabilistic realtime systems, Computer aided verification, Springer, 2011, pp. 585–591. [32] C. Dehnert, S. Junges, J.-P. Katoen, M. Volk, A storm is coming: A modern probabilistic model checker, arXiv preprint arxiv:1702.04311(2017). [33] A. Somani, N. Ye, D. Hsu, W.S. Lee, Despot: Online pomdp planning with regularization, Advances in neural information processing systems, (2013), pp. 1772–1780. [34] D. Silver, J. Veness, Monte-carlo planning in large pomdps, Advances in neural information processing systems, (2010), pp. 2164–2172. [35] G. Shani, Learning and solving partially observable markov decision processes, BenGurion University of the Negev, 2007 Ph.D. thesis. [36] T. Taha, J.V. Miró, G. Dissanayake, Pomdp-based long-term user intention prediction for wheelchair navigation, Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on, IEEE, 2008, pp. 3920–3925. [37] J.A. Bilmes, et al., A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, Int. Comput. Sci. Instit. 4 (510) (1998) 126. [38] S. Koenig, R.G. Simmons, Unsupervised learning of probabilistic models for robot navigation, Robotics and Automation, 1996. Proceedings., 1996 IEEE International Conference on, 3 IEEE, 1996, pp. 2301–2308. [39] P.E. Lehner, K.B. Laskey, D. Dubois, An introduction to issues in higher order uncertainty, IEEE Trans. Syst., Man, and Cybern. 26 (3) (1996) 289–293. [40] A. Nilim, L. El Ghaoui, Robust control of markov decision processes with uncertain transition matrices, Oper. Res. 53 (5) (2005) 780–798. [41] D. Sadigh, K. Driggs-Campbell, A. Puggelli, W. Li, V. Shia, R. Bajcsy, A.L. Sangiovanni-Vincentelli, S.S. Sastry, S.A. Seshia, Data-driven probabilistic modeling and verification of human driver behavior, AAAI Spring SymposiumTechnical Report, (2014). [42] I. Visser, M.E. Raijmakers, P. Molenaar, Confidence intervals for hidden markov model parameters, British J. Math. Stat. Psychol. 53 (2) (2000) 317–327. [43] B. Efron, R.J. Tibshirani, An Introduction to the Bootstrap, CRC press, 1994. [44] Y. Ni, Z.-Q. Liu, Bounded-parameter partially observable markov decision processes. ICAPS, (2008), pp. 240–247. [45] A. Biere, A. Cimatti, E.M. Clarke, O. Strichman, Y. Zhu, Bounded model checking, Adv. Comput. 58 (2003) 117–148. [46] T. Han, J.-P. Katoen, D. Berteun, Counterexample generation in probabilistic model checking, IEEE Trans. Software Eng. 35 (2) (2009) 241–257. [47] X. Zhang, B. Wu, H. Lin, Learning based supervisor synthesis of pomdp for pctl specifications, 2015 54th IEEE Conference on Decision and Control (CDC), IEEE, 2015, pp. 7470–7475. [48] D. Angluin, Learning regular sets from queries and counterexamples, Inf. Comput. 75 (2) (1987) 87–106. [49] K. Chatterjee, M. Chmelík, M. Tracol, What is decidable about partially observable markov decision processes with ω-regular objectives, J. Comput. Syst. Sci. 82 (5) (2016) 878–911. [50] D.R. Cox, D.V. Hinkley, Theoretical Statistics, CRC Press, 1979. [51] G. Tamura, N.M. Villegas, H.A. Müller, J.P. Sousa, B. Becker, G. Karsai, S. Mankovskii, M. Pezzè, W. Schäfer, L. Tahvildari, et al., Towards Practical Runtime Verification and Validation of Self-adaptive Software Systems, Software Engineering for Self-Adaptive Systems II, Springer, 2013, pp. 108–132. [52] O. Cappé, Online em algorithm for hidden markov models, J. Comput. Graphi. Stat. 20 (3) (2011) 728–749. [53] S.C.O.S.W. Png, D.H.W.S. Lee, Pomdps for robotic tasks with mixed observability (2009). [54] L. Zhang, Decision algorithms for probabilistic simulations, Saarbrücken, Universität des Saarlandes, Diss., 2008, 2009 Ph.D. thesis.
In this paper, a POMDP supervisory control framework is developed to achieve high-level performance guarantee in HRC. By using counterexamples that witness violations of the design requirements, the supervisor synthesis process iteratively improves the learned supervisor until the collaboration performance can be verified through model checking. To deal with the POMDP modeling uncertainties, the probability intervals are used to describe the transition and observation models with model checking in the supervisor synthesis process being solved using either upper or lower bounds of the intervals. To reduce the model checking complexity, an abstraction method for POMDP is integrated to find an abstract system with a smaller size of state space. The CEGAR framework considers abstraction based on simulation relation and iteratively refines the abstract system based on counterexamples. Based on the simulation relation checking, the verification adaptation and online supervisor adjustment techniques are discussed for the semi-online adaptation purpose. References [1] T. Fong, C. Kunz, L.M. Hiatt, M. Bugajska, The human-robot interaction operating system, Proceedings of the 1st ACM SIGCHI/SIGART conference on Human-robot interaction, ACM, 2006, pp. 41–48. [2] C. Frese, A. Fetzner, C. Frey, Multi-sensor obstacle tracking for safe human-robot interaction, ISR/Robotik 2014; 41st International Symposium on Robotics; Proceedings of, VDE, 2014, pp. 1–8. [3] T. Kruse, A.K. Pandey, R. Alami, A. Kirsch, Human-aware robot navigation: a survey, Rob Auton Syst 61 (12) (2013) 1726–1743. [4] S. Ikemoto, H.B. Amor, T. Minato, B. Jung, H. Ishiguro, Physical human-robot interaction: mutual learning and adaptation, Robot Auto Mag, IEEE 19 (4) (2012) 24–35. [5] S. Nikolaidis, R. Ramakrishnan, K. Gu, J. Shah, Efficient model learning from jointaction demonstrations for human-robot collaborative tasks, Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, ACM, 2015, pp. 189–196. [6] E.M. Jean-Baptiste, P. Rotshtein, M. Russell, Pomdp based action planning and human error detection, IFIP International Conference on Artificial Intelligence Applications and Innovations, Springer, 2015, pp. 250–265. [7] H. Kress-Gazit, G.E. Fainekos, G.J. Pappas, Temporal-logic-based reactive mission and motion planning, Robo, IEEE Trans 25 (6) (2009) 1370–1381. [8] U. européenne, Direction générale de la recherche, Factories of the Future: Multiannual Roadmap for the Contractual PPP Under Horizon 2020, EDC collection, Publications office of the European Union, 2013. [Online]. Available: http://books. google.com/books?id=ZC0wngEACAAJ [9] A. Cherubini, R. Passama, A. Crosnier, A. Lasnier, P. Fraisse, Collaborative manufacturing with physical human–robot interaction, Robot Comput. Integr. Manuf. 40 (2016) 1–13. [10] J.C. McCall, M.M. Trivedi, Driver behavior and situation aware brake assistance for intelligent vehicles, Proc IEEE 95 (2) (2007) 374–387. [11] S.-H. Baeg, J.-H. Park, J. Koh, K.-W. Park, M.-H. Baeg, Building a smart home environment for service robots based on rfid and sensor networks, Control, Automation and Systems, 2007. ICCAS’07. International Conference on, IEEE, 2007, pp. 1078–1082. [12] H. Salamin, A. Vinciarelli, Introduction to Sequence Analysis for Human Behavior Understanding, Computer Analysis of Human Behavior, Springer, 2011, pp. 21–40. [13] A. Pentland, A. Liu, Modeling and prediction of human behavior, Neural Comput. 11 (1) (1999) 229–242. [14] L.J. Mariano, J.C. Poore, D.M. Krum, J.L. Schwartz, W.D. Coskren, E.M. Jones, Modeling strategic use of human computer interfaces with novel hidden markov models, Front Psychol. 6 (2015). [15] C.-P. Lam, S.S. Sastry, A pomdp framework for human-in-the-loop system, Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, IEEE, 2014, pp. 6031–6036. [16] F. Broz, I. Nourbakhsh, R. Simmons, Designing pomdp models of socially situated tasks, RO-MAN, 2011 IEEE, IEEE, 2011, pp. 39–46. [17] X. Zhang, B. Wu, H. Lin, Supervisor synthesis of pomdp based on automata learning, arXiv preprint arxiv:1703.08262(2017). [18] J.J. Rutten, M. Kwiatkowska, G. Norman, D. Parker, Mathematical Techniques for Analyzing Concurrent and Probabilistic Systems, American Mathematical Soc., 2004. [19] J. Pineau, G. Gordon, S. Thrun, Anytime point-based approximations for large pomdps, J. Artificial Intell. Res. (2006) 335–380. [20] C. Amato, B. Bonet, S. Zilberstein, Finite-state controllers based on mealy machines for centralized and decentralized pomdps. AAAI, (2010). [21] H. Kress-Gazit, Robot challenges: toward development of verification and synthesis techniques [from the guest editors], Robo. & Auto. Mag. IEEE 18 (3) (2011) 22–23. [22] R. Sharan, Formal methods for control synthesis in partially observed environments: application to autonomous robotic manipulation, California Institute of
71
Robotics and Computer Integrated Manufacturing 57 (2019) 59–72
X. Zhang, H. Lin Xiaobin Zhang received the B.E. degree in Mechanical Engineering from Shanghai Jiao Tong University (SJTU), China, in 2011 and the Ph.D. degree in Electrical Engineering from University of Notre Dame. His primary area of research is on formal methods for probabilistic systems, decision making under uncertainties, and humanrobot collaboration.
Hai Lin is currently an associate professor at the Department of Electrical Engineering, University of Notre Dame, where he got his Ph.D. in 2005. Before returning to his alma mater, Hai has been working as an assistant professor in the National University of Singapore from 2006 to 2011. Dr. Lin’s teaching and research interests are in the multidisciplinary study of the problems at the intersections of control, communication, computation, machine learning and computational verification. His current research thrust is on cyber-physical systems, multi-robot cooperative tasking, and human-machine collaboration. Hai has been served in several committees and editorial board, including IEEE Transactions on Automatic Control. He is currently serving as the Chair for the IEEE CSS Technical Committee on Discrete Event Systems. He served as the Program Chair for IEEE ICCA 2011, IEEE CIS 2011 and the Chair for IEEE Systems, Man and Cybernetics Singapore Chapter for 2009 and 2010. He is a senior member of IEEE and a recipient of 2013 NSF CAREER award.
72