Mathematical Biosciences 240 (2012) 12–19
Contents lists available at SciVerse ScienceDirect
Mathematical Biosciences journal homepage: www.elsevier.com/locate/mbs
Monitoring and prediction of an epidemic outbreak using syndromic observations Alex Skvortsov ⇑, Branko Ristic Defence Science and Technology Organisation, 506 Lorimer Street, Melbourne, VIC 3207, Australia
a r t i c l e
i n f o
Article history: Received 4 March 2011 Received in revised form 18 May 2012 Accepted 22 May 2012 Available online 13 June 2012 Keywords: Emerging epidemics Bio-surveillance Syndromic data Epidemiological inference Nonlinear filtering Particle filter
a b s t r a c t The paper presents a method for syndromic surveillance of an epidemic outbreak due to an emerging disease, formulated in the context of stochastic nonlinear filtering. The dynamics of the epidemic is modeled using a stochastic compartmental epidemiological model with inhomogeneous mixing. The syndromic (typically non-medical) observations of the number of infected people (e.g. visits to pharmacies, sale of certain products, absenteeism from work/study, etc.) are assumed available for monitoring and prediction of the epidemic. The state of the epidemic, including the number of infected people and the unknown parameters of the model, are estimated via a particle filter. The numerical results indicate that the proposed framework can provide useful early prediction of the epidemic peak if the uncertainty in prior knowledge of model parameters is not excessive. Crown Copyright Ó 2012 Published by Elsevier Inc. All rights reserved.
1. Introduction Epidemics can impose significant challenges on modern societies, not only by affecting the health of the general population, but also by causing negative trends in the economy (medical treatments, absenteeism from work, missed business opportunities, etc.). The ongoing epidemics of AIDS, tuberculosis and the recent outbreaks of SARS and H1N1 (swine flu) provide some revealing examples. In the absence of an effective cure against many diseases, the best approach to mitigate an epidemic outbreak (malicious or natural) resides in the development of capability for its early detection and for prediction of its further development [46,48,6,21,4]. Such a capability would allow making any countermeasures (quarantine, vaccination, medical treatment) much more effective and less costly [47,48]. Syndromic surveillance is referred to as systematic collection, analysis, and interpretation of public health data for the purpose of early (and often cost effective) detection of an epidemic outbreak, its continuous monitoring, and timely response by public health agencies [46,48]. The rationale for this method rests on a resalable assumption that a spread of an infectious disease is usually associated with the measurable changes in the social behavior. It is important to highlight that although the measurements (data streams) for syndromic surveillance often rely on the medical observations (the visits to medical practitioners, occurrence of particular symptoms, the number of diagnosed cases, etc.) they are not restricted to the pure medical data. Recent studies [8,36,39], have ⇑ Corresponding author. E-mail address:
[email protected] (A. Skvortsov).
demonstrated that ‘‘non-medical’’ sources of syndromic data streams, such as the absenteeism from work/school, the pharmaceutical sales, Internet queries, Twitter messages, and alike, can enable one to draw important conclusions regarding the epidemic state in the community. The ‘Google Flu’ project [14] (flu-related searches in Google), is a well publicised example of this approach. The algorithms for syndromic surveillance and its practical implementation have recently attracted significant attention by scientists and practitioners; there is a vast amount of literature devoted to this topic (for more comprehensive review see [46,48] and Refs. therein). In general, all algorithms applied in this area can be divided into two main groups, the data mining methods and the information fusion (also known as data assimilation) methods. Data mining is primarily concerned with the extraction of patterns from massive amounts of raw data without using dynamic models of the underlying process (i.e. epidemic spread) [14]. Information fusion algorithms, on the contrary, strongly rely on mathematical models: in this case, the dynamic model of an epidemic outbreak and the measurement model of a particular syndromic data stream [13,7,23]. Evidently the accuracy of the information fusion algorithms is significantly determined by the fidelity of the underlying models. This paper presents a study of a recursive information fusion algorithm for the prediction of an epidemic outbreak using syndromic-like observations. The problem is formulated in the Bayesian context of stochastic nonlinear filtering and solved using a particle filter. While this problem formulation and the particle filter implementation have been considered earlier, see [18,33,27,42], this paper introduces several novelties. In order to overcome the limitations of the standard compartmental model of an epidemic
0025-5564/$ - see front matter Crown Copyright Ó 2012 Published by Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.mbs.2012.05.010
13
A. Skvortsov, B. Ristic / Mathematical Biosciences 240 (2012) 12–19
(the well-mixed approximation), in this paper we employ a more flexible alternative which includes an explicit parameter of ‘‘mixing efficiency’’ (the level of social interaction). This model is more appropriate to represent the variety of social interactions in a small community (self-isolation and panic). Furthermore, a more flexible imprecise model of syndromic measurements is adopted in the paper. This model has been validated by the Internet queries data, see [14]. The key feature is that we can formulate and solve the optimal Bayes estimator/predictor even when the measurement model parameters are imprecisely known (which is typically the case with syndromic data). The paper includes numerical results as a proof of concept using synthetic syndromic data, generated according to the model in [14]. From this perspective, the present study is a significant extension of our earlier work [33,42]. 1.1. On a dynamic model of an epidemic outbreak The selection of a population model that can adequately capture a rich variety of possible epidemic scenarios has been always considered a challenging task [1]. In the context of computational efficiency, such a model would need to satisfy two competing criteria. On one hand, the model should be complex enough to represent the rich variety of social interactions that may occur in the community (including possible behavioral changes associated with self-isolation, fear, panic, etc.). On another hand, the candidate model should be simple enough to enable operational framework of continuous data assimilation (as more data become available) and early parameter estimation (at initial stages of an epidemic). The latter significantly constrains the possible number of parameters in the candidate model. From the operational point of view, only three parameters of the epidemic curve are of primary interest for epidemic preparedness: the epidemic peak amplitude, the timing of the peak and the overall duration of the epidemic; many other relevant quarantines (e.g. total number of infected individuals during the epidemic) can be approximately estimated from those three parameters. Therefore there is a requirement for an epidemiological model which is capable to produce the three-parameter epidemic curves. All these considerations provide a rationale for our selection of the modified SIR model (mSIR), as a suitable epidemiological model for our framework. The mSIR model is a modification of the celebrated two-parameter SIR model for the case of ‘inhomogeneous’ mixing (this term is used to describe preferable social interactions within subgroups in the community, see [43,26,34]). From the mathematical point of view this modification is achieved by introduction of an additional parameter, commonly referred to as the mixing inhomogeneity. Hence the mSIR model, with its three parameters to fit, meets the criteria of a candidate model for the bio-surveillance data assimilation. It is noteworthy that such a modification is not specific to the SIR model and can be applied to more complex epidemiological models (see [5] and references therein). Our choice of the mSIR model was motivated by its relevance for the operational context of the problem (short-term prediction for a rapidly emerging epidemic), its attractive simplicity and flexibility as well as its consistent compliance with the high fidelity agent-based simulations [43,26]. A critical aspect of capturing the statistical variation of real epidemic data is an adequate representation of stochastic variation in the model, i.e. moving from a deterministic model to a stochastic model. The consistent introduction of stochastic variation in the epidemic models has been recently a topic of significant attention (see [28] and references therein). In general, the stochastic component may arise due to the changes in the environment (so-called environmental noise) and due to the random variability between individuals in the community (the latter known as demographic noise) [28,45,11]. In our study we consider only the demographic
noise, assuming the environment to be static during the early stages of an epidemic. It has been established that the demographic noise in population models undergoes a special scaling rule [28,11,44], that is rather universal and is even independent of underlaying statistics of social contacts [11]. 2. Problem specification This section describes the models to be used in estimation. To describe the dynamics of an epidemic outbreak we employ the mSIR epidemic model with stochastic fluctuations [44,10,1]. According to this model the population of a community can be subdivided into three interacting groups: susceptible, infectious and recovered individuals. Let the number of susceptible, infectious and recovered be denoted by S, I and R, respectively, so that S þ I þ R ¼ P, where P is the total population size. The dynamic model of an epidemic progression in time can be be expressed by two stochastic differential equations [2] and the ‘‘conservation’’ law for the population as follows:
ds ¼ q þ rq n; dt
ð1Þ
di ¼ q ci rq n þ rc f; dt
ð2Þ
r ¼ 1 s i:
ð3Þ
The explanation of notation is as follows: s ¼ S=P; i ¼ I=P; r ¼ R=P are the normalised compartment sizes; n; f are two uncorrelated white Gaussian noise processes, both with zero mean and unity variance; c is the recovery time, that is a reciprocal of the average infectious period for the disease; the term qði; sÞ describes the social interactions between the groups of population. Following [43,26] we assume qði; sÞ ¼ b i sm , where m P 0 is the ‘mixing’ intensity.1 The case m ¼ 1 corresponds to the traditional ‘well-mixed’ approximation, while for m ¼ 0 we obtain a linear system without any social interactions (self-isolation). The contact rate parameter b is a product b ¼ q c, with q being the basic reproductive number, which represents the average number of people infected by a direct contact with a sick person that would be expected if the population was entirely susceptible. For an emerging epidemic, the epidemic model parameters can be assumed to be partially known as interval values, that is b 2 ½b; b; c 2 ½c; c and m 2 ½m; m. The terms rq rq ðs; iÞ and rc rc ðs; iÞ are introduced to capture the demographic stochasticity discussed above. In our case this noise is due to random variations in the contact rate b and in the recovery time c between individuals in the community. Next we briefly discuss the noise amplitude terms rq ; rc that feature in (1) and (2); for a detailed explanation see [28]. The terms rq ; rc can be established from a well-known scaling law of Gaussian fluctuations generated by the random contact rate qðs; iÞ ¼ b i sm and recovery time c. Thus for a dynamical system Eqs. (1)–(3) consisting of a large number of individuals P we can write [28,44,10,44]:
rq ðs; iÞ
rffiffiffiffiffiffiffiffiffiffiffi b i sm ; P
rc ðs; iÞ
rffiffiffiffi ci : P
ð4Þ
1 More general parametrisation of the interaction term have been proposed recently qði; sÞ ¼ b il sm with l; m P 0 [26,34]. Our choice of qði; sÞ was motivated by its relative simplicity (less parameters) and its excellent agreement with the high fidelity agent-based simulations [43].
14
A. Skvortsov, B. Ristic / Mathematical Biosciences 240 (2012) 12–19
With further assumptions s s0 1; r r 0 0; i i0 1=P, that hold during the initial stages of an epidemic (where s0 ; i0 ; r 0 are the initial values of s; i; r) expressions (4) can be reduced to the following scaling rule for the stochastic components of the model
rq ðs; iÞ
pffiffiffi b ; P
rc ðs; iÞ
pffiffiffi
c
P
:
ð5Þ
Observe from (5) that during the initial stages of an epidemic, the stochastic components obey the similar scaling rule (approximately rq ; rc / 1=P). This scaling rule enables to draw predictive conclusions about the performance of the proposed algorithm in different operational settings (e.g. for different sizes of the community). It is worthwhile emphasizing that although we introduced an approximation rq ; rc const at the initial stage of the epidemic, it does not imply that we neglected the random variability of parameters of b and c between individuals (i.e. population inhomeogenity), since the very nature of demographic noise (i.e. appearance of terms rq ; rc in (1)) is a result of such a variability [28,45,11]. The introduction of Gaussian noise terms in Eqs. (1)–(3) deserves additional comments. Generally speaking there is a nonzero probability that as a result of a random fluctuation the system Eqs. (1)–(3) would produce values for s; i that are outside the domain of their validity [0,1] (which would make the model Eqs. (1)–(3) unusable for estimation). In the theory of stochastic processes the likelihood of such an event is characterized by a special parameter referred to as the mean first–passage time (MFPT) [30]. The MFPT has a meaning of a typical time scale for the event of crossing the boundaries of a stochastic system domain. In the proposed framework we assume that the following condition holds:
s=T 1;
ð6Þ
where T is the MFPT for the model Eqs. (1)–(3) to exit boundaries of the domain ½0; 1 and s is a typical integration interval (see Section 3.1). Rigorous estimation of MFPT for the system Eqs. (1)–(3) is a challenging task that involves application of the advanced analytical methods (the Fokker–Planck equation, the eikonal approximation, various closures for statistical moments, etc.). This topic is still an area of active research and its comprehensive review is outside the scope of the current paper (see [19,28,37] and Refs. therein). In summary, it was found that the MFPT for the system Eqs. (1)–(3) can be estimated with the following expression [19,37]
T=tmax expðPÞ;
ð7Þ 2
where ¼ ð1 1=qÞ =2, parameter q was introduced earlier (q > 1 for emerging epidemic), t max is the time when the epidemic reaches its peak and is given by (23). It is straightforward to see that in the limit for large population (large P and P 1), T is exponentially large in comparison with t max and therefore to s (since t max s). A heuristic logic was introduced in the software implementation in order to keep the solutions of the model Eqs. (1)–(3) within the feasibility domain ½0; 1 and protect the proposed algorithm from the unlikely event of crossing its domain boundaries. The measurement model is introduced next. Following [14] we can assume that a power law relationship holds for odds-ratios of syndromic observations and the number of infected people:
1j zj i / ; 1i 1 zj
ð8Þ
where zj is the observable syndrome with index j ¼ 1; . . . ; N z (normalized by population size P), 1j is the power-law exponent (in general different for different syndromes). Since at the initial stage of epidemic i 1; zj 1, Eq. (8) can be reduced to a simple powerlaw model 1
zj ¼ b j i j þ sj ;
ð9Þ
where bj ¼ const P 0. The noise term sj is added to simulate random nature of measurement (measurements noise) which is assumed to be uncorrelated to other syndromes and noises n and f. Since zj P 0 the noise term sj associated with syndrome j should be modelled with a random variable that provides strictly non-negative realisations (e.g. Poissonian or truncated Gaussian, see [23]). In the current study we adopt a random variable that obeys the log-normal distribution: this means that we set sj ¼ rj gj , where gj is the standard log-normal noise, gj ln N ð0; 1Þ. Parameters bj ; rj ; 1j typically are not known, but with a representative dataset of observations the model (9) can be estimated during the calibration phase (see results of the linear regression fits in [14]). The data fit reported in [8] suggests that 1j is close to unity, but it is not precisely known because of the significant scattering of data points. To cater for this uncertainty we assume that 1j can take any value in an interval, 1 2 ½1; 1 around 1 ¼ 1. Since [14,8,39], do not report any specific values of the fitting parameters, in our numerical studies we will synthesize the syndromic measurements using the heuristic values for bj ; rj . The scaling law for the noise parameter rj in (9) can be deduced by employing the arguments leading to (5) and thus we arrive at the scaling rj / ð1=PÞ1=2 . The latter scaling law in conjunction with the scaling laws (5) provides a consistent way to compare predictive skills of our algorithm for communities with different population size and to draw conclusive decisions about its performance before any operational deployment. The problem is to estimate the (normalised) number of infected i and susceptible s at time t, using syndromic observations zj of (9), collected up to time t. Let x denote the state vector to be estimated; it includes i and s, but also the imprecisely known parameters b; c and m. The formal Bayesian solution is given in the form of the posterior probability density function (PDF) pðxt jz1:t Þ, where xt is the state vector at time t and z1:t denotes all observations up to time t. Based on the posterior pðxt jz1:t Þ, our aim is to predict the progress of the epidemic at some future time t 0 > t, using the dynamic model Eqs. (1)–(3). 3. Optimal Bayesian solution for imprecise likelihood 3.1. Formulation The model Eqs. (1)–(3) is given in continuous time. For the purpose of computer implementation, we require a discrete-time approximation of this model. The state vector is adopted as
x ¼ ½i s b c
m T ;
ð10Þ
T
where denotes the matrix transpose. The state includes the unknown disease parameters, b and c, as well as the unknown population mixing parameter m. The parameters of the observation model (9) are not included in the state vector because their values are assumed known (i.e. estimated during the calibration phase), albeit 1j only imprecisely. Neglecting for the moment the process noise terms, the evolution of the epidemic state can be written according to Eqs. 1 and 2 as x_ ¼ gðxÞ where
gðxÞ ¼ ðbsm cÞi bism
0 0 0
T
:
The nonlinear differential equation governing the evolution of the state cannot be solved in the closed-form. The Euler method provides a simple approximation valid for small integration interval s > 0 : xðt þ sÞ xðtÞ þ sgðxðtÞÞ. The state-evolution in discretetime t k can then be expressed as:
xkþ1 f k ðxk Þ þ wk ;
ð11Þ
15
A. Skvortsov, B. Ristic / Mathematical Biosciences 240 (2012) 12–19
xk ½5
In this notation xk ½i represents the ith component of vector xk . Process noise wk in (11) is assumed to be zero-mean white Gaussian with a diagonal covariance matrix Q of size 5 5. Its first two diagonal components, according to (5), can be expressed as Q ð1; 1Þ ðb þ cÞs2 =P2 and Q ð2; 2Þ bs2 =P 2 . The remaining components are set to zero based on assumption that b; c; m are constant during the epidemic. The optimal Bayes filter is typically presented in two steps, prediction and update. Suppose the posterior PDF at time tk is given by pðxk jz1:k Þ. Then the prediction step computes the PDF predicted to time tm ¼ t k þ s as [17]:
pðxm jz1:k Þ ¼
Z
pðxm jxk Þpðxk jz1:k Þ dxk
ð13Þ
where pðxjx0 Þ is the transitional density. Let N ðy; l; PÞ denote a Gaussian PDF with mean l and covariance P. Then according to (11) the transitional density is given by
pðxjx0 Þ ¼ N ðx; f k ðx0 Þ; Q Þ:
ð14Þ
The prediction step is carried out many times with tiny sampling intervals s until an observation zj;kþ1 becomes available about syndrome j at time t kþ1 . The predicted PDF at tkþ1 is pðxkþ1 jz1:k Þ. In the standard Bayesian estimation framework, this PDF is updated using the measurement zj;kþ1 by multiplication with the measurement likelihood function [17]. According to (9), the likelihood function in this case is ‘ðzj;kþ1 jxkþ1 Þ ¼ ln N ðz; hj ðxkþ1 ; 1Þ; r2j Þ, where hj ðxk ; 1j Þ ¼ bj xk ½11j . Now, the problem is that hj ðxk ; 1j Þ defined is this way is not a function because 1j 2 ½1j ; 1j . An elegant solution to the imprecise measurement transformation is available in the framework of random set theory [22]. In this approach hj ðx; 1j Þ þ sj , where 1j is an interval, defines a random closed set Rj;x on the measurement space of syndromic observations Z, i.e. Rj;x # Z. The probability laws of random sets can be specified via distribution functions of random sets, see for details [16]. The question is how to state that a particular (point) measurement zj;kþ1 2 Z matches the measurement model. Mahler [22] takes the approach that a measurement matches the model if it does not contradict it, i.e., if zj;kþ1 \ Rj;x – ;. This translates to the ~ j;kþ1 jxkþ1 Þ ¼ Prfzj;kþ1 2 Rj;x g, measurement likelihood function ‘ðz referred to as the generalised likelihood. The Bayes update step using measurement zj;kþ1 is then defined as [22]:
pðxkþ1 jz1:kþ1 Þ ¼ R
~ j;kþ1 jxkþ1 Þ pðxkþ1 jz1:k Þ ‘ðz ~ j;kþ1 jxkþ1 Þ pðxkþ1 jz1:k Þ dxkþ1 ‘ðz
:
ð15Þ
~ j;k jxk Þ is according to its definiFor the measurement model (9), ‘ðz tion given by [31]:
~ j;k jxk Þ ¼ /ðzj;k ; Rj;x ; r2 Þ /ðzj;k ; Rj;x ; r2 Þ; ‘ðz j j k k
ð16Þ
where
Rj;x ¼ inf hj ðx; ½1j ; 1j Þ; Rj;x ¼ sup hj ðx; ½1j ; 1j Þ and /ðu; l; PÞ ¼ distribution.
Ru 1
ð17Þ ð18Þ
ln N ðy; l; PÞ dy is the cumulative log-normal
The recursions of the Bayes filter start with the initial PDF (at time tk ¼ 0), denoted pðx0 Þ, which is assumed known. 3.2. Particle filter implementation The proposed Bayesian estimator cannot be solved in the closed form. Instead we develop an approximate solution based on the particle filter (PF) [3,12,32]. The PF approximates the posterior PDF pðxk jz1:k Þ by a weighted random sample, that is:
pðxk jz1:k Þ
L X ðiÞ wk dxðiÞ ðxÞ:
ð19Þ
k
i¼1
ðiÞ
Here dy ðxÞ is the Dirac delta function focused at y; xk is a sample ðiÞ (particle), wk is its weight and L is the particle count. As L is increased, approximation (19) becomes more accurate. The weights are normalised so that:
Z
pðxk jz1:k Þ dxk
L X ðiÞ wk ¼ 1:
ð20Þ
i¼1
The adopted PF implementation is known as the bootstrap PF [15]. The algorithm starts by forming a random sample from the ðiÞ initial PDF pðx0 Þ : x0 pðx0 Þ, for i ¼ 1; . . . ; L. The initial weights ðiÞ are equal, that is w0 ¼ 1=L. The prediction steps consist of propagating particles using the dynamic model (11). This is implemented by drawing samples from the transitional density (14). The prediction steps are carried out until a measurement zj;k becomes available, that is until time t k . Suppose the set of predicted ðiÞ particles at time t k is fxkjk1 ; i ¼ 1; . . . ; Lg. The update step is implemented in two stages. The measurements can arrive asynchronously, and let at time k the measurement arrive from syndrome jk . The unnormalised importance weights of predicted particles are then computed as [32]: ðiÞ ~ ~ ðiÞ w k ¼ ‘ðzjk ;k jxkjk1 Þ
ð21Þ
with generalised likelihood ‘~ specified in (16). This is followed by the weight normalisation, that is ðiÞ
ðiÞ
~k wk ¼ w
L .X ~ ðnÞ w k
ð22Þ
n¼1
for i ¼ 1; . . . ; L.
SIR model experim. data
900 800 700
Number of infected people
where k ¼ t k =s is the discrete-time index and f k ðxk Þ is the transition function given by 3 2 xk ½1 þ sxk ½1 xk ½3xk ½2xk ½5 xk ½4 7 6 7 6 7 6 xk ½2 sxk ½3xk ½1xk ½2xk ½5 7 6 ð12Þ f k ðxk Þ ¼ 6 7: xk ½3 7 6 7 6 5 4 xk ½4
600 500 400 300 200 100 0
0
20
40
60
80
100
120
Time [days] Fig. 1. The solid line, representing the number of infected people over time, was obtained from the agent based simulation population model [40,29]. The dashed line shows the modified SIR model fit
16
A. Skvortsov, B. Ristic / Mathematical Biosciences 240 (2012) 12–19
The second stage is the resampling step. The role of resampling is to eliminate (in the probabilistic manner) the particles with low importance weights and to clone the samples with high importance weights. This is carried out by sampling with replacement, ðiÞ
with the probability of sampling each xkjk1 equal to the normalised importance weight
ðiÞ wk .
The result is a mapping of a random
measure fwk ; xkjk1 gLi¼1 into a new random measure with uniform ðiÞ
ðiÞ
ðiÞ f1L ; xk gLi¼1 .
weights: Several computationally efficient resampling schemes have been reported; we have implemented the systematic resampling algorithm [20], following the pseudo-code given in Table 3.2 of [32]. Table 1 Syndromic measurement model parameters. j
bj
1j
rj
1 2 3 4
0.25 0.27 0.23 0.29
1.07 1.05 1.01 0.98
0.0012 0.0008 0.001 0.0011
4. Numerical results 4.1. Simulation setup The estimation and prediction of an epidemic will be carried out using an experimental data set obtained using a large-scale agent simulation based on the CROWD model developed by DSTO (see examples in [40] and details of its technical architecture in [29]). The following scenario was simulated [40,29]. A small Australian virtual town, population size P ¼ 5000, was built according to the Census data from the Australian Bureau of Statistics. This virtual town has a typical age/gender breakdown with the population grouped in families and households. The census data was used to create typical daily activities within the town (workplaces, schools, etc.). The population then moves around the town according to their typical daily habits and interacts. A source of epidemic is injected in this virtual environment, with population interactions driving the spread of the epidemic in a probabilistic manner [29]. In line with the prediction of the SIR model the disease spreads quickly through the susceptible population, however, as more people become infected, the availability of susceptible people
1500
1500
1000
β
500 0
1000
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0
0.5
1000
0
γ 0.095
0.1
0
0.105 0.11 0.115 0.12 0.125 0.13 0.135 0.14
0.25
0.3
0.35
0.4
0.45
0.5
γ 0.095
0.1
0.105 0.11 0.115 0.12 0.125 0.13 0.135 0.14
600
ν
400
ν
400
200
200
0 0.95
1
1.05
1.1
1.15
1.2
1.25
0 0.95
1.3
(a)
1
1.05
1.1
1.15
1.2
1.25
1.3
(a)
0.35
0.35
0.3
0.3
Proportion of infected (I/P)
Proportion of infected (I/P)
0.2
500
600
0.25 0.2 0.15 0.1 0.05
0
0.15
1000
500
0
β
500
0.25 0.2 0.15 0.1 0.05
20
40
60
80
100
120
TIME [days]
(b) Fig. 2. Estimation/prediction results from the particle filter after processing measurements collected over 25 days of surveillance assuming that 11 2 ½1:04; 1:09, 12 2 ½1:02; 1:07; 13 2 ½0:98; 1:03, and 14 2 ½0:95; 1:00 (the true values of 1 given in Table 1): (a) the histograms of estimated parameters b; c and m (‘‘true values’’ indicated by vertical red lines); (b) Prediction results for a random sample of 100 particles (gray lines); the red line is the experimental curve from Fig. 1.
0
0
20
40
60
80
100
120
TIME [days]
(b) Fig. 3. Estimation/prediction results from the particle filter after processing measurements collected over 25 days of surveillance assuming that all 1j 2 ½0:85; 1:15 (the true values of 1j given in Table 1): (a) the histograms of estimated parameters b; c and m (‘‘true values’’ indicated by vertical red lines); (b) Prediction results for a random sample of 100 particles (gray lines); the red line is the experimental curve from Fig. 1
17
0.3
0.3
0.25
0.25
0.2
0.2
I/P
I/P
A. Skvortsov, B. Ristic / Mathematical Biosciences 240 (2012) 12–19
0.15
0.15
0.1
0.1
0.05
0.05
0
0
20
40
60
80
100
0
120
0
20
40
Time [days]
0.3
0.3
0.25
0.25
0.2
0.2
0.15
0.1
0.05
0.05
20
40
60
80
100
0
120
0
20
40
Time [days]
0.25
0.25
0.2
0.2
I/P
I/P
0.3
0.15
0.1
0.05
0.05
40
60
80
100
120
80
100
120
0.15
0.1
20
60
(d)
0.3
0
120
Time [days]
(c)
0
100
0.15
0.1
0
80
(b)
I/P
I/P
(a)
0
60
Time [days]
80
100
120
Time [days]
(e)
0
0
20
40
60
Time [days]
(f)
Fig. 4. Prediction results after processing measurements collected over: (a) 10; (b) 15; (c) 20; (d) 25; (e) 30; (f) 35 days of surveillance. The thin blue solid lines indicate the 95% confidence region; the red solid line is the experimental curve from Fig. 1. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
drops, making it less likely for those infected to pass the disease on. Thus the epidemic eventually dies out. An example of the output of the agent-based simulations is presented in Fig. 1. The blue line shows the number of people of this town infected by a fictitious disease, reported once per day during
a period of 154 days (only first 120 days shown). The dashed red line represents the deterministic mSIR model fit (using the entire batch of 154 data points, and setting wk ¼ 0 in (11)), with esti^ ¼ 0:2399, c ^ ¼ 0:1066; m ^ ¼ 1:2042. The mated parameters b parameter estimates were obtained using a Monte Carlo impor-
A. Skvortsov, B. Ristic / Mathematical Biosciences 240 (2012) 12–19
4.2. Estimation and prediction results The initial PDF for the state vector was chosen as pðx0 Þ ¼ pði0 Þpðs0 Þpðb0 Þpðc0 Þpðm0 Þ with pði0 Þ ¼ N ½0;1 ði0 ; zj;1 =bj ; r2j Þ; pðs0 Þ ¼ N ½0;1 ðs0 ; 1 zj;1 =bj ; r2j Þ; pðb0 Þ ¼ U ½0:14;0:5 ðb0 Þ; pðc0 Þ ¼ U ½0:09;0:143 ðc0 Þ; pðm0 Þ ¼ U ½0:95;1:3 ðm0 Þ, where N ½a;b ðx; l; PÞ and U ½a;b ðxÞ denote a truncated Gaussian distribution restricted at support ½a; b and a uniform distribution with limits ½a; b, respectively. The number of particles was set to L ¼ 10000. The first set of results examines the influence of imprecision in parameters 1j ; j ¼ 1; . . . ; 4 of the measurement model, as well as the estimation of the unknown parameters of the mSIR model: b; c and m. The algorithm assumed the correct knowledge of bj and rj , but only partial knowledge of 1j parameter. Fig. 2 shows the results obtained for small imprecision in 1j , using 11 2 ½1:04; 1:09; 12 2 ½1:02; 1:07, 13 2 ½0:98; 1:03, and 14 2 ½0:95; 1:00. Fig. 3 corresponds to the case with fairly imprecise knowledge of 1, in particular 1j 2 ½0:85; 1:15, for all j ¼ 1; 2; 3; 4. The prediction was carried out after day 25. Fig. 2. (a) shows the histograms of the particle filter estimated values of b; c and m, after processing all 25 days of measurements (i.e. in total 100 measurements with N z ¼ 4). This figure reveals that the uncertainty in parameter b has been substantially reduced compared with the initial PDF pðb0 Þ ¼ U ½0:14;0:50 ðb0 Þ. The uncertainty in parameter c has been also reduced, although not as dramatically. Finally, the uncertainty in m, has not been reduced at all, indicating that this parameter is difficult to estimate in the early stages of an epidemic (as discussed in Section 4.1, m can be estimated when the measurements are available for the entire duration of the epidemic). While this is unfortunate, it does not appear to be a serious problem since the prior on m in practice is tight (m 1). This is confirmed in Fig. 2. (b) which shows a sample of 100 overlayed predicted epidemic curves (gray lines) based on the state estimate after 25 days. Fig. 2. (b) indicates the prediction performance; the timing of the peak of the epidemic appears to be fairly accurate, while the size of the peak is more uncertain. Most importantly, however, the true epidemic curve (solid red line) appears to be always enveloped by the prediction curves. The case of higher imprecision in 1j parameters is presented Fig. 3. Comparing Figs. 2 and 3 one can make two observations: first, in accordance with our expectations, the estimation and prediction results are more uncertain when knowledge of 1 is imprecise; second, even when the knowledge of 1 is imprecise, the curve
120
(a)
[days]
100 80
max
t
tance sampling scheme referred to as the progressive correction [12, Ch. 14]. Fig. 1 serves to verify that the mSIR model adopted for our framework, although very simple and fast to run, is remarkably accurate in explaining the data obtained from a very complex agent-based simulation. This conclusion is line with the results of the agent-based simulations reported in [43] for much larger (multi-million) communities (Los Angeles, Chicago, and Portland). The syndromic-like measurements are generated synthetically in accordance with model (9), where the ‘‘true’’ (normalised) number of infected people i is replaced by the output of the agent-based simulation at each time step (shown by the blue solid line in Fig. 1). The parameters of the measurement model (9) are given in Table 1. The assumption is that conditionally independent measurements, concerning N z ¼ 4 syndromes, are available asynchronously on a daily basis during the epidemic. The problem was to perform the estimation sequentially as the measurements become available until the ‘cut-off’ day, after which the task is to predict the epidemic curve. The prediction is carried out from the solution of the stochastic mSIR model using the estimated parameters. The predictive performance of the proposed framework was compared (after the ‘cut-off’ day) with the ‘‘true’’ epidemic curve, obtained from the agent-based simulations.
60 40 20 0
5
10
15
20
25
30
35
40
30
35
40
Time [days] 0.4
(b)
0.3
imax
18
0.2 0.1 0
5
10
15
20
25
Time [days] Fig. 5. Prediction of the epidemic peak after processing observations collected over 10,15, 20, 25, 30, 35 days of surveillance. The error bars indicate 95% confidence intervals. The actual values indicated by dotted horizontal lines. (a) Timing of the peak tmax ; (b) intensity of the peak imax .
corresponding to what we consider as the ‘‘true’’ number of infected people (solid red line) is always enveloped by predictions. The second set of numerical results investigates the influence of the cut-off day for prediction. Using the same parameters as above, and assuming that 1j 2 ½0:85; 1:15, the cut-off day was selected to be 10, 15, 20, 25, 30, 35. The epidemic prediction results are shown in Fig. 4. The thin blue solid lines indicate the 95% confidence regions; the red solid line, as before, is the experimental curve from Fig. 1. According to Fig. 4, the prediction can be carried out even at early stages of an epidemic, albeit with high uncertainty. As more observations becomes available (i.e. a postponed cut-off date), the prediction uncertainty is reduced. Finally, we consider estimation of the peak values (the timing and intensity) of the epidemic curve, as a function of the cut-off day for prediction. Using the same parameters as above, with the cut-off days for prediction as 10, 15, 20, 25, 30, 35 days, the results, averaged over 20 runs of the algorithm, are shown in Fig. 5. The timing and intensity of the peak are computed using the approximate formulae from [35]:
tmax
1 1 GðqÞ þ ; log i0 cðq 1Þ c
imax 1 q
1
þq
1
1
logðq Þ;
ð23Þ ð24Þ
where GðqÞ is a known function that is tabulated in [35] (e.g. GðqÞ 0:4 for 1 < q < 2) and c; q and i0 are estimated from the observations available after 10, 15, 20, 25, 30, 35 days. Fig. 5 displays the estimates with 95% confidence intervals, as well as the actual values (dotted lines), extracted from the epidemic curve in Fig. 1. The results indicate that both the timing and intensity peak estimates are reasonably accurate, even using observations over a short period of just 10 days. The estimates obtained for shorter periods of time, however, are characterised by large uncertainty. This uncertainty gradually reduces as more data become available. 5. Conclusions We have developed an algorithm for estimation and prediction of an epidemic outbreak formulated in the context of stochastic nonlinear filtering and using syndromic-like observations. The dynamics of the epidemic is modeled using a compartmental epidemiological model with inhomogeneous mixing, which explicitly includes a parameter responsible for the level of social interac-
A. Skvortsov, B. Ristic / Mathematical Biosciences 240 (2012) 12–19
tions. This model enables simulation of a rich variety of social behavior that may be observed in a small community in response to an epidemic (i.e. fear, self-isolations, panic). The syndromic-like observations (e.g. web queries, Twitter messages, sales of certain products, absenteeism from work/study, etc.) are assumed available for estimation and prediction. The observations are created according to a verified internet queries model. The algorithm, implemented as a particle filter, provides continuous estimation of the state of the epidemic, including the number of infected people and the unknown parameters of the epidemiological model. The numerical results indicate that the proposed framework can provide useful early prediction of the epidemic peak when the uncertainty in prior knowledge of model parameters (disease parameters, social behavior) is not excessive. The proposed analytical framework should be seen only as a conceptual solution for a new bio-surveillance system. A number of practical problems still remains to be studied and resolved before its eventual operational deployment. For instance, there is a potential challenge regarding the quality of input data-streams (e.g. Google queries and Twitter) and their information value. Our future work would encompass a generalisation of the proposed framework to include detection and prediction of other social and behavioral processes that can be described by similar population models (computer worm attacks, propaganda campaigns, complex social engagements, sensor networks, quorum sensing, etc. see [24,38,41]). Acknowledgments We thank R. Gailis, B. Meerson, C. Woodruff and E. Yee for valuable discussions and two anonymous reviewers for their important comments and suggestions, which help to improve the quality of the paper. References [1] R.M. Anderson, C. Fraser, A.C. Ghani, Epidemiology transmission dynamics and control of SARS: the 2002–2003 epidemic, Philos. Trans. R. Soc. B Biol. Sci. 359 (2004) 1091. [2] R.M. Anderson, R.M. May, Population biology of infectious diseases: Part 1, Nature 280 (1979) 361. [3] M.S. Arulampalam, S. Maskell, N. Gordon, T. Clapp, A tutorial on particle filters for non-linear/non-Gaussian Bayesian tracking, IEEE Trans. Signal Process. 50 (2) (2002) 174. [4] C. Fraser at al, Pandemic potential of a strain of influenza A (H1N1): early findings, Science 324 (5934) (2009) 1557. [5] C. Breto, D. He, E.L. Ionides, A.A. King, Time series analysis via mechanistic models, Ann. Appl. Stat. 9 (1) (2009) 319. [6] S. Cauchemez, N.M. Ferguson, Likelihood-based estimation of continuous-time epidemic models from time-series data: Application to measles transmission in London, J. R. Soc. Interface. 5 (25) (2008) 885. [7] B. Cazelles, N.P. Chau, Using the kalman filter and dynamic models to assess the changing HIV/AIDS epidemic, Math. Biosci. 140 (2) (1997) 131. [8] C. Chew, G. Eysenbach, Pandemics in the age of Twitter: content analysis of tweets during the 2009 h1n1 outbreak, PLoS One 5 (11) (2010) e14118. [10] C.E. Dangerfield, J.V. Ross, M.J. Keeling, Integrating stochasticity and network structure into an epidemic model, J.R. Soc. Interface 6 (38) (2009) 761. [11] R.A. Desharnais, R.F. Costantino, J.M. Cushing, S.M. Henson, B. Dennis, A.A. King, Experimental support of the scaling rule for demographic stochasticity, Ecol. Lett. 9 (2006) 537. [12] A. Doucet, J.F.G. de Freitas, N.J. Gordon, editors, Sequential Monte Carlo Methods in Practice, Springer, 2001. [13] C. Egat, F.C. Arrat, C.L. Ajaunie, H. Wackernagel, Early detection and assessment of epidemics by particle filtering, in: Proceedings of the Sixth European Conference on Geostatistics for Environmental Applications (GeoENV VI), Springer, Rhodes, Greece, 2008, p. 23. [14] J. Ginsberg, M.H. Mohebbi, R.S. Patel1, L. Brammer, M.S. Smolinski, L. Brilliant, Detecting influenza epidemics using search engine query data, Nature 457 (2009) 1012.
19
[15] N.J. Gordon, D.J. Salmond, A.F.M. Smith, Novel approach to nonlinear/nonGaussian Bayesian state estimation, IEE Proc. F 140 (2) (1993) 107. [16] R.P.S. Mahler, I.R. Goodman, H.T. Nguyenm, Mathematics of Data Fusion, Kluwer, 1997. [17] A.H. Jazwinski, Stochastic Processes and Filtering Theory, Academic Press, 1970. [18] C. Jégat, F. Carrat, C. Lajaunie, H. Wackernagel, Early detection and assessment of epidemics by particle filtering, in: A. Soares, M.J. Pereira, R. Dimitrakopoulos (Eds.), GeoENV VI – Geostatistics for Environmental Applications, Springer, 2008, p. 23. [19] A. Kamenev, B. Meerson, Predicting extinction rates in stochastic epidemic models, Phys. Rev E 77 (6) (2008) 061107. [20] G. Kitagawa, Monte Carlo filter and smoother for non-Gaussian non-linear state space models, J. Comput. Graph. Stat. 5 (1) (1996) 1. [21] L. Dailey, R.E. Watkins, A.J. Plant, Timeliness of data sources used for influenza surveillance, J. Am. Med. Inform. Assoc. 14 (5) (2007) 177. [22] R. Mahler, Statistical Multisource Multitarget Information Fusion, Artech House, 2007. [23] J. Mandel, J.D. Beezley, L. Cobb, A. Krishnamurthy, Data driven computing by the morphing fast Fourier transform ensemble Kalman filter in epidemic spread simulations, Procedia Computer Science 1 (1) (2010) 1221. [24] Y. Moreno, M. Nekovee, A. Vespignani, Efficiency and reliability of epidemic data dissemination in complex networks, Phys. Rev. E 69 (5) (2004) 055101. [26] A.S. Novozhilov, On the spread of epidemics in a closed heterogeneous population, Math. Biosci. 215 (2) (2008) 177. [27] J.B.S. Ong, M. I-C. Chen, A.R. Cook, H.C. Lee, V.J. Lee, R.T.P. Lin, P.A. Tambyah, L.G. Goh, Real-time epidemic monitoring and forecasting of H1N1-2009 using influenza-like illness from general practice and family doctor clinics in Singapore, PLoS ONE 5 (4) (2010) e10036. [28] O. Ovaskainen, B. Meerson, Stochastic models of population extinction, Trends Ecol. Evol. 25 (11) (2010) 643. [29] R. Connell, P. Dawson, A. Skvortsov, Comparison of an agent-based model of disease propagation with the generalised SIR epidemic model, Technical report, DSTO, 2009. [30] S. Redner, A Guide to First-Passage Processes, Cambridge Univ. Press, 2001. [31] B. Ristic, Bayesian estimation with imprecise likelihoods: random set approach, IEEE Signal Process. Lett. 18 (7) (2011) 395. [32] B. Ristic, S. Arulampalam, N. Gordon, Beyond the Kalman Filter, Artech House, 2004. [33] B. Ristic, A. Skvortsov, M. Morelande, Predicting the progress and the peak of an epidemic, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2009), Taipei, Taiwan, 2009, pp. 513– 516. [34] M. Roy, M. Pascual, On representing network heterogeneities in the incidence rate of simple epidemic models, J. Ecol. Complexity 3 (1) (2006) 80. [35] I. Sazonov, M. Kelbert, M.B. Gravenor, The speed of epidemic waves in a onedimensional lattice of SIR models, J. Math. Model. Nat. Phenom. 3 (4) (2008) 28. [36] N.M. Schuster, M.A. Rogers, L.F. Jr, Using search engine query data to track pharmaceutical utilization: A study of statin, Am. J. Manag. Care 16 (8) (2010) e215. [37] I. B Schwartz, L. Billings, M. Dykman, A. Landsman, Predicting extinction rates in stochastic epidemic models, J. Stat. Mech. (1) (2009) P01005. [38] F. Schweitzer, R. Mach, The epidemics of donations: logistic growth and power-laws, PLoS One 3 (1) (2008) 1458. [39] A. Signorini, A.M. Segre, P.M. Polgreen, The use of Twitter to track levels of disease activity and public concern in the u.s. during the influenza a h1n1 pandemic, PLoS One 6 (5) (2011) e19467. [40] A. Skvortsov, R. Connell, P. Dawson, R. Gailis, Epidemic modelling: Validation of agent-based simulation by using simple mathematical models, in: International Congress on Modelling and Simulation (MODSIM 2007), Christchurch, New Zealand, 2007, pp. 657–662. [41] A. Skvortsov, B. Ristic, Modelling and performance analysis of a network of chemical sensors with dynamic collaboration, Int. J. Distrib. Sensor Netw. (1) (2012) 656231. [42] A. Skvortsov, B. Ristic, C. Woodruff, Predicting an epidemic based on syndromic surveillance, in: Proceedings of the 13th International Conference on Information Fusion, Edinburgh, UK, 2010. [43] P.D. Stroud, S.J. Sydoriak, J.M. Riese, J.P. Smith, S.M. Mniszewski, P.R. Romero, Semi-empirical power-law scaling of new infection rate to model epidemic dynamics with inhomogeneous mixing, Math. Biosci. 203 (2006) 301. [44] O.A. van Herwaarden, J. Grasman, Stochastic epidemics: major outbreaks and the duration of the endemic period, J. Math. Biology 33 (4) (1995) 581. [45] H.P. De Vladar, I. Pen, Determinism noise and spurious estimations in a generalised model of population growth, Phys. A 373 (2007) 477. [46] M. Wagner, A. Moore, R. Aryel, Handbook of Biosurveillance, Elsevier, 2006. [47] J. Walden, E.H. Kaplan, Estimating time and size of bioterror attack, Emerg. Infect. Dis. 10 (7) (2004) 1202–1205. [48] A.G. Wilson, G.D. Wilson, D.H. Olwell, Statistical Methods in Counterterrorism: Game Theory Modeling Syndromic Surveillance and Biometric Authentication, Springer, 2006.