Occupancy detection of residential buildings using smart meter data: A large-scale study

Energy & Buildings 183 (2019) 195–208 Contents lists available at ScienceDirect Energy & Buildings journal homepage: www.elsevier.com/locate/enbuild...

Download PDF

2MB Sizes 0 Downloads 53 Views

Report

PDF Reader
Full Text

Energy & Buildings 183 (2019) 195–208

Contents lists available at ScienceDirect

Energy & Buildings journal homepage: www.elsevier.com/locate/enbuild

Occupancy detection of residential buildings using smart meter data: A large-scale study Rouzbeh Razavi a,∗, Amin Gharipour b, Martin Fleury c, Ikpe Justice Akpan a a

Department of Management & Information Systems, Kent State University, USA School of Information and Communication Technology, Griﬃth University, Australia c School of Computer Science and Electronic Engineering, University of Essex, UK b

a r t i c l e

i n f o

Article history: Received 31 May 2018 Revised 16 September 2018 Accepted 14 November 2018 Available online 19 November 2018 Keywords: Buildings occupancy Smart meters Privacy Data mining

a b s t r a c t Advanced Metering Infrastructures (AMIs) are installed to gather localized and frequently acquired energy consumption data. Despite many potential beneﬁts, the installation of such meters has resulted in growing privacy concerns amongst the public. By analyzing the electricity consumption behavior of more than 50 0 0 households over an 18-month period and deploying a wide array of machine learning methods, this paper examines whether high-frequency meter data are suﬃcient to predict the home-occupancy status of households not only in the present but also in the future. The authors believe that this is the ﬁrst study at such a scale on this issue. The study proposes a genetic programming approach for feature engineering when training the models. The results reveal a high predictive power for smart meter data in establishing the present and future occupancy status of households. Also, the analysis of the demographic data suggests that households known to be least concerned with privacy are the ones who are more vulnerable to smart meter privacy implications. © 2018 Elsevier B.V. All rights reserved.

1. Introduction Nearly 40% of the total energy in the world is consumed in the buildings to provide an indoor environment which is comfortable for the occupants [1]. As a part of the growing efforts to move towards energy sustainability, more attention is now being paid to make buildings more energy eﬃcient. Reliable information about the occupancy status of the buildings plays a crucial role in achieving this objective. For example, the occupancy information can be used to optimize the operation of Heating, Ventilation, and Air Conditioning (HVAC) systems in the building by reducing the overall energy consumption without compromising the comfort of the occupants. The study in [2] suggests that the occupancy-based control of indoor HVAC systems can result in up to 40% reduction of the electricity usage in buildings. Similarly, the authors in [3] demonstrated that an adaptive occupancy-based lighting control for buildings could lead to up to 76% reduction of the electricity used for lighting. Moreover, the occupancy status information can be used for fault detection of electrical appliances in the buildings [4]. The recent deployment of Advanced Metering Infrastructures (AMIs) to monitor electrical power consumption over ﬁne-grained ∗

Corresponding author. E-mail address: [email protected] (R. Razavi).

https://doi.org/10.1016/j.enbuild.2018.11.025 0378-7788/© 2018 Elsevier B.V. All rights reserved.

time intervals has provided a potential to infer buildings occupancy status from high-resolution meter readings. The possibility of inferring building occupancy from meter data in real-time for controlling appliances can be quiet useful as this does not require additional sensors (e.g., CO2 measurement or infrared sensors) to be installed. At the same time, however, this poses a risk of burglary along with information leakage of working, dining, and holidaying habits of households in case the meter readings are compromised [5]. Such concerns have resulted in: the formation of some opposition and advocacy groups; legal cases in superior courts; and new legislation [6]. Consequently, it is vital to understand the extent to which smart meter data can be predictive of the occupancy status of the households. If meter data is found to be highly predictive of the buildings occupancy status, it would be promising for adaptive control of appliances but also implies the requirement for additional measures to protect households’ privacy. Various machine learning algorithms have been developed in the past studies to detect occupancy patterns from smart meter data [7,8]. The representation of the data in the form of variables, however, is shown to have a signiﬁcant impact on the performance of most machine learning methods [9,10]. Smart meter data include large volumes of time-series measurements that are subject to noise and may not be effective when used in the raw form [11]. Feature engineering (FE) involves creating derived variables

196

R. Razavi, A. Gharipour and M. Fleury et al. / Energy & Buildings 183 (2019) 195–208

from raw attributes to enhance the performance or interpretability of machine learning models. Without a proper feature engineering process, machine learning models are likely to underperform, overﬁt on the training data and to be unstable over the time [9]. Previous studies either have not used any form of feature engineering or have deployed few manually crafted features. The process of manually creating features is tedious, subjective, and the resulting set of features are likely to be sub-optimal. This is especially true considering that the optimal set of variables may vary from one machine learning algorithm to another. As a part of this study, a dynamic feature engineering mechanism for occupancy detection from meter data is introduced. The proposed method deploys a Genetic Programming (GP) approach which is agnostic to speciﬁc machine learning models, hence can be used along with existing solutions. More importantly, current literature examines the possibility of detecting buildings occupancy status from meter data only in the past or in real-time. We hypothesize that since most households have regular daily/weekly routines, the meter data may not only establish home occupancy at the time of meter readings but also, with suﬃcient data, they can be predictive of the future occupancy status of the households at speciﬁc time-slots. If these hypotheses can be validated, the privacy implications of smart meter data would be much more signiﬁcant. Consequently, some of the key attendant questions addressed by this paper are: 1. Can machine learning models be trained to reveal occupancy status of a given household from their meter data without requiring their past occupancy information? 2. What length of a household’s smart meter demand data is suﬃcient for predictive machine learning algorithms to infer future occupancy status accurately? 3. How far in the future can predictions be made with an acceptable level of accuracy? 4. How many “out-of-home” time-slots in a week can be correctly identiﬁed by predictive models? 5. Which household segments are more vulnerable to such building occupancy detection attacks? To answer these questions, this study analyzes the electricity consumption behavior of more than 50 0 0 households over an 18month period. To ensure that the reported accuracy values were near the upper bound of the information that could be extracted from meter data, the proposed GP-based feature engineering approach was deployed along with a range of machine learning algorithms. A joint optimization framework was used for feature engineering and tuning the hyper-parameters of machine learning models. The remainder of this paper is organized as follows. Section 2 presents a review of the existing research. Section 3 reviews the methodology of this study’s research, including the overview of data used in the study, the GP-based feature engineering approach, and details of machine learning models and performance metrics used in this study. Consequently, the results are presented and discussed in Section 4. Finally, Section 5 presents a conclusion and ﬁnal remarks. 2. Related studies The potential applications of buildings occupancy detection to control electric appliances have been explored in a number of previous studies [2–4,12]. In the research in [13] an aggressive dutycycling of building HVAC systems based on occupancy status is proposed, and practical aspects of the implementation of the proposed solution are discussed. Similarly, authors in [14] used a moving window Markov Chain approach to establish occupancy status of the buildings and used that information for ventilation and

temperature control applications. Moreover, the idea of disconnecting inactive electrical devices, such as televisions, or placing PC or other IT equipment into low-power standby mode when the buildings are vacant was proposed in [15]. Finally, the authors in [16] showed that the accuracy of demand forecasts for buildings could be notably improved when occupancy data is included in the forecast models. Regarding privacy, the Electronic Frontier Foundation, an international public interest digital rights group, has classiﬁed privacy concerns related to smart meters into four interrelated categories: Individuated patterns which refers to inferring consumers’ behavior patterns by analyzing their electric appliance usage; Realtime surveillance which is the live monitoring of consumer behavior; Information detritus, involves sharing consumer data with third parties; and ﬁnally, Physical invasion; which involves establishing whether the household is at home, for the purposes of burglary or assault [17]. Smart meters may at some time in the future include device power signature detection, that is Non-Intrusive Load Monitoring (NILM). NILM has the potential, through consumer data analytics, to give detailed knowledge of a household’s activities, though currently the meter reading frequency reported to the energy provider is probably insuﬃcient [11,18]. For example, in most AMIs, the frequency of relaying a meter reading varies from every hour to every ﬁve minutes, which is very far from a sampling rate of 1 to 1.2 kHz needed for accurate NILM [18]. Eavesdropping and interception of smart meter communications can result in hackers being able to monitor a household’s electricity consumption in real-time. This is partly due to the very heterogeneous, complex and large-scale nature of AMIs. In fact, the smart grid can be directly impacted by all vulnerabilities inherent to the communication systems. For example, in the measurement study in [19], the authors have shown that the frequency hopping algorithm, used in many smart meters, could be reverse engineered in a matter of few days. Disclosure of meter data to third parties is another privacy concern, especially considering the lack of consistent privacy protection laws [20]. Such concerns will indeed be intensiﬁed if meter data can be used to predict the future behavior of consumers, which is investigated in this study in the context of buildings occupancy status. Building’s occupancy detection is another important privacy concern related to smart meters. According to the study in [21], occupancy is frequently cited as the most determining factor in considering burglary target selection. Besides, the occupancy status of a household can reveal other information such as daily routines or time and frequency of their travels, which could be used for other purposes including targeted advertising [5]. Beyond smart meters, existing approaches for establishing buildings’ occupancy use either specialized sensors, such as cameras [22–24], passive infrared (PIR) sensors [25,26], RFID tags [27], or non-specialized methods including measuring WiFi signals [26,28,29] or using sensors that measure CO2 [30] or temperature [31]. For establishing occupancy from smart meter data, most of the previous studies deployed methods that fall into two categories: thresholding and machine learning classiﬁcation methods. As its name implies, the ﬁrst category involves applying thresholds on the total consumption, and variations of demand over successive meter readings. The underlying intuition is that both the amplitude and the variations of consumption are lower when the households are away. In [32], authors used night-time meter readings to determine the background load for deﬁning the threshold. However, since the relationship between occupancy and electricity demand may not be straightforward, thresholding does not generalize well [5]. Machine learning classiﬁcation models, on the other hand, can make it possible to detect more complex patterns from the historical data.

R. Razavi, A. Gharipour and M. Fleury et al. / Energy & Buildings 183 (2019) 195–208

197

Table 1 Summary of past studies related to occupancy detection using smart meter data. Reference

No. households

Feature engineering

Timeline

Algorithm(s)

[32] [33] [34] [35] [8] [36] [37]

2 5 5 1 5 1 1

Manual Manual Manual Manual Manual None None

Past Past Past Present Past & Present Past & Present Past

Thresholding and Segmentation Thresholding, SVM, HHM, KNN Thresholding, SVM, KNN SVM, KNN Unsupervised Algorithms DT, NN and RF Association Rules

There are, however, many classiﬁcation algorithms available. Each algorithm has its advantages and disadvantages, making some algorithms more suitable for a given problem than the others. Table 1 provides a summary of past studies related to occupancy detection using smart meter data. As shown, Support Vector Machine (SVM) and K-Nearest Neighbors (KNN) were deployed as a supervised classiﬁer in [33–35]. The results presented in [8] suggest the superiority of Random Forests (RF) over SVM and Hidden Markov Model (HMM) classiﬁers. Authors in [36] used Decision Trees (DT) and Neural Networks (NN) for establishing occupancy from meter data. The study in [37] deploys an empirical association rules mining approach for establishing occupancy. Since it is diﬃcult to detect the occupancy status directly from raw electricity consumption, feature engineering can improve the performance of the classiﬁcation models by creating relevant attributes. Previous studies either have not used any form of feature engineering or have deployed few manually crafted features. The process of creating manually crafted variables is tedious and subjective, hence most likely results in a sub-optimal set of features. Moreover, the possibility of inferring the occupancy status of residential buildings from meter data in the future has not been addressed by previous studies. Finally, due to unavailability of largescale datasets with ground truth occupancy information, existing studies have considered only a limited number of residential buildings in their evaluations. As such, examination of household demographics as it relates occupancy detection vulnerabilities has not been possible. This study introduces a dynamic feature engineering approach based on Genetic Programming [38] for establishing occupancy in residential buildings from smart meter data. Using this approach, the result suggests notable improvements in the accuracy of occupancy classiﬁcation models. Moreover, by analyzing the electricity consumption behavior of more than 50 0 0 households over an 18month period and deploying an array of machine learning methods, this study further illustrates that it is possible to accurately detect some future ‘away’ time-slots using the past smart meter data of residential buildings. This ﬁnding has an important privacy implication for storage and governance of smart meter data. Also, the study examines the extent to which households of different demographics are vulnerable to occupancy detection attacks. The next section, describe the methodology of this study. 3. Methods This section starts with the overview of the data sources used in this study and then proceeds with the discussion of the data pre-processing and feature engineering methods. Finally, the machine learning algorithms and performance measures that are used throughout the study are described and discussed. 3.1. Datasets One of the primary datasets used in this study for predicting occupancy status of households was collected through the Customer Behaviour Trials (CBT), which were administered by the

Commission for Energy Regulation (CER), the regulator for the energy sector in Ireland [39]. The Irish CBT is considered as one of the most extensive smart metering trials conducted globally to date [39]. With more than 50 0 0 participating households, CBT was administered during 2009 and 2010. The dataset includes halfhourly meter data along with the detailed socio-demographic attributes of the participating households. While the CBT data is unique in a sense that it has high temporal resolution, large sample size, and captures detailed demographic information of households, it does not include home occupancy information, as this required installation of sensors for occupancy detection for all households. However, to train and evaluate models that predict the future home occupancy based on past meter data (which is the objective of this study), the ground truth occupancy status is required. One solution would be to use thresholding [32] or unsupervised machine learning models [8] for establishing the occupancy status of the households in the CBT dataset and use these estimates as a proxy for ground truth occupancy information for further analysis. The problem, however, is that these methods have high error rates where the out-of-sample accuracy was reported to vary from 0.75 to 0.91 [8,32]. Even with the upper bound accuracy of 0.91, the estimates would be subject to 9% error, which is too high for consideration as a proxy to ground truth values. Alternatively, this study deploys a supervised approach, as will be described, for establishing the occupancy of the households in the CBT data. The Electricity Consumption and Occupancy (ECO) dataset [40], the Dutch Residential Energy Dataset (DRED) [41], and Smart∗ dataset [42] are high-resolution datasets that capture households’ meter readings and their occupancy status. Table 2 provides a summary of the datasets. In these datasets, the occupancy information was collected through a combination use of tablet computers (i.e., manual labeling), Bluetooth and passive infra-red sensors. However, these trials considered a smaller number of households and, unlike the CBT dataset, the detailed demographic information of the households was not captured. To take advantage of the larger sample size and availability of households demographic information in the CBT dataset, we deployed ECO, DRED, and Smart∗ dataset to train an occupancy detection model which can then be applied to the CBT dataset to infer the occupancy status of the households. This was only possible since, with the aid of the proposed feature engineering approach, the models trained on the ECO, DRED and Smart∗ datasets shown to produce very accurate estimates and to generalize very well across different datasets. This will be discussed in more details in the following sections. Fig. 1 illustrates a high-level overview of the modeling framework of this study. As it is shown, to ensure compatibility with the CBT dataset, meter readings and occupancy status in ECO, DRED, and Smart∗ datasets were aggregated to 30-min intervals. At the aggregate level, for each 30-min interval, the occupational status is deﬁned as a binary term where the building is ﬂagged as “vacant” only if there was no one in the building during the entire 30-min time-slot.

198

R. Razavi, A. Gharipour and M. Fleury et al. / Energy & Buildings 183 (2019) 195–208 Table 2 Summary of smart meter datasets used in this study. Dataset

No. households

Country

Meter frequency

Duration

Occupancy

Demographics

CBT ECO DRED Smart∗

50 0 0+ 5 1 2

Ireland Switzerland Netherlands U.S.

30 min 1s 1 min 1 min

18 Months 8 Months 6 Months 3 Weeks

No Yes Yes Yes

Yes No No No

Fig. 1. An overview of the steps involved in building machine learning models to predict future occupancy of households.

For future occupancy predictions, it would be possible to predict the future occupancy status of the CBT households using a model which was trained on the ECO, DRED and Smart∗ datasets. However, predicting future occupancy is more sophisticated than establishing past or present occupancy and requires more extended training data which is not available in ECO, DRED and Smart∗ datasets. Subsequently, as shown in Fig. 1, in this study, ﬁrst the occupancy status of the households in the CBT dataset was established and then the dataset was used to train new machine learning models to predict the future occupancy of households. 3.2. Feature engineering The representation of the variables is shown to have a signiﬁcant impact on the performance of most machine learning methods [9]. Feature Engineering (FE), also known as variable engineer-

ing, refers to the process of creating additional predictive features (variables) from the existing features in the dataset. These new features, for example, can be generated by combining existing features (e.g., ratios or differences) or applying various transformations (e.g., logarithmic transformation) to the raw variables. The problem, however, is that the number of combinations can grow very quickly and the evaluation of all candidates may not be possible. To address this issue, this study deploys a Genetic Programming (GP) [38] approach to generate new features when building models to establish the reference occupancy status of households in the CBT dataset. In the context of feature engineering, GP provides an automated framework to create new features by applying a set of deﬁned primitive functions on the original features. The predictiveness (or the ﬁtness in GP terminology) of each of

R. Razavi, A. Gharipour and M. Fleury et al. / Energy & Buildings 183 (2019) 195–208

199

Table 3 List of primitive functions and GP parameters. Type

Name

Description

Value

Parameter Parameter Parameter Parameter Parameter Parameter Function Function Function Function Function Function Function Function Function

Size Max functions Iterations Reproduction rate Crossover rate Mutation rate Select (s,e) Max(A) Min(A) Sum(A) Range(A) Mean(A) SD(A) SAD(A) Ratio(a1 , a2 ) ∗

No. features in each generation Max no. primitive functions in each feature Maximum number of generations % of features created by reproduction % of features created by crossover % of features created by mutation Select meter readings from position s to e Returns the largest element of A Returns the smallest element of A Returns the sum of element of A Returns the range of element of A Returns the mean of the elements A Returns the standard deviation of the elements of A Returns sum absolute delta of the elements of A Returns the ratio of a1 to a2

100 5 20 0 0 10% 80% 10%

∗ a1 and a2 are scalar numeric input variables which can be individual meter readings or a result of an aggregate function.

Fig. 2. Examples of features generated by the GP-based feature engineering framework: a) Two generated features, b) Mutation applied to one of the features, and c) Crossover product of features.

these newly created features is then quantiﬁed. The algorithm will then create a new generation of the features by applying operations that are analogous to naturally-occurring genetic operations. To do that, a subset of new features with high ﬁtness values are selected to create new features of the next generation (i.e., survival of the ﬁttest). The new generation of features is produced by copying a selected feature from the previous generation (Reproduction), or by randomly combining selected parts from two selected features (Crossover), or ﬁnally by randomly changing one part of a selected features (Mutation). The intuition is that, by continuing this over generations, good composite features can be discovered by trying different combinations of promising sub-elements (i.e., genes). Table 3 presents primitive functions and GP parameters used in this study. Since the objective is to establish occupancy status from meter readings, features that can express the amplitude (e.g., mean), or variations (e.g., standard deviation) of the demand can be particularly predictive. The select(s, e) function is used to choose a window of meter readings starting from s and ending at e, where s and e are the positions of meter readings relative to the current time-slot. For example, select (−10, 0 ) would choose the current and the past 10 m readings. Moreover, a ﬁtness proportionate selection process is deployed, where the probability of a feature being selected as a parent is proportional to its ﬁtness. We deﬁned the ﬁtness of each feature as the increase in the Area Under the Receiver Operating Characteristic (AUROC) of the model when that feature is added. More details on model evaluations and performance metrics (including AUROC) is provided in the following Section. Fig. 2 shows few examples of features generated by the GP framework. More speciﬁcally, Fig. 2a shows two generated features. The ﬁrst feature is calculated as the ratio of standard deviation of the past ﬁve meter readings to their mean, and the second feature is simply the sum absolute delta of the four previous meter read-

ings. Fig. 2b shows an example of mutation applied to the second feature where the primitive function SAD() is replaced by Range(). Subsequently, Fig. 2c illustrates an example of crossover, where the left arm of the ﬁrst feature is substituted by the second feature.

3.3. Machine learning algorithms and performance measures Classiﬁcation is the task of assigning an observation to one of the several pre-deﬁned categories using a training set of data which contains observations whose category membership is known. There exists a variety of machine learning models for classiﬁcation purposes; each comes with its advantages and disadvantages. However, to ensure that the reported prediction accuracies are not restricted by the capability of any particular algorithm, this research examined a wide array of methods, including various ﬂavors of decision tree-based methods, Support Vector Machines (SVM), nearest neighbors algorithms and neural networks. Most of these algorithms are parameterized, and modiﬁcation of those parameters can signiﬁcantly inﬂuence the outcome of the learning process. In this study, we use an adaptive search algorithm which was proposed in [43] to optimize the hyper-parameters of each of the algorithms. Table 4 lists the algorithms used in this study along with the tuning parameters of algorithms. In this study, 15 repeats of 10-fold cross-validations [43] were averaged when a model was evaluated, or a performance metric was reported. The performance metrics used in this study include accuracy, precision, and Area Under the Receiver Operating Characteristic curve (AUROC). Accuracy is deﬁned as the fraction of correctly classiﬁed instances by the model while precision is the fraction of true positive instances among the instances identiﬁed as positive by the model. Formally, these metrics can be deﬁned as

200

R. Razavi, A. Gharipour and M. Fleury et al. / Energy & Buildings 183 (2019) 195–208 Table 4 List of machine learning algorithms and their tuning parameters. Algorithm

Tuning parameters

Random Forests (RF) [44] Support Vector Machines (SVM) [45] K-Nearest Neighbors (K) [46] Neural Networks (NN) [47]

No. of sampled variables, mtry Cost of constraints violation, C No. of Neighbors, K Weight decay, decay No. of hidden layer units, size No. of trees, n.trees Depth, interaction.depth Shrinkage parameter, shrinkage

Gradient Boosting (GB) [48]

Table 5 The accuracy of the best classiﬁcation model for establishing past home occupancy from meter data when the model is trained on one dataset and is tested on another dataset. The results are shown for models with (bold values) and without feature engineering. Repeated cross-validation tests are used when a model is trained and tested on the same dataset. Training dataset

ECO DRED Smart∗

Testing dataset ECO

DRED

Smart∗

(0.988,0.910) (0.965,0.843) (0.913,0.762)

(0.969,0.816) (0.994,0.924) (0.927,0.825)

(0.974,0.798) (0.980,0.862) (0.956,0.893)

the following:

TP+TN TP+TN+FP+FN TP Precision = TP+FP

4.1. Past and present occupancy detection from smart meter data

Accuracy =

where TP is the number of true positive observations (i.e., positive observations that were correctly identiﬁed as positive by the model), TN represents the number of true negative observations, FP is the number of false positive cases (i.e., negative observations that were mistakenly predicted as positive by the classiﬁcation model) and ﬁnally FN denotes the number of false negative records. AUROC, commonly referred to as simply the AUC, is constructed by plotting true positive rate, deﬁned as the ratio of TP to TP+FN, versus the false positive rate, deﬁned as the ratio of TN to TN+FP, and calculating the area under this plot. AUROC is frequently used for comparing models [49] where the AUROC value of 1 represents a perfect classiﬁcation model. The use of precision as the performance metric is especially relevant to this study. To investigate the privacy concerns that are linked to the predictive power of smart meter data, models were built that can identify time-slots during which households’ occupants are not at their home (i.e., positive cases). The privacy concern will be signiﬁcant if there is a model that can accurately determine even a small fraction of the time-slots during which the household is absent from home (i.e., high precision). Finally, once a model is constructed, the importance of the input variables can be quantiﬁed by removing each of the variables from the model and measuring the level of performance drop as a result. Following the methodology in [43], in this study, the variable importance values were scaled to have a maximum value of 100. In other words, the most signiﬁcant variable had the importance value of 100, and the values for other variables were scaled accordingly.

4. Results and discussions This section presents the result of different models for predicting home occupancy from meter readings. It starts with the detection of past and present home occupancy, and then proceed with the discussion of how future occupancy can be inferred from past meter data. Fig. 3 illustrates a high-level overview of these classiﬁcation tasks. For the identiﬁcation of home occupancy in the past and present, classiﬁcations are made for a speciﬁc timeslot (T0 ). For predicting future occupancy status, models seek to ﬁnd n ‘away’ time-slots in a given period in the future, T. As shown, the meter readings are the main input variables to the models although the model would additionally include engineered features as well as some external information such as the time of the day, the day of the week, the month of the year and whether the day was a public holiday.

We ﬁrst examined the possibility of inferring the occupancy status of the households in the past using their meter data as shown in Fig. 3a. Using the 30-min aggregated ECO, DRED and Smart∗ datasets, Fig. 4 shows the performance of various machine learning models in inferring the past home occupancy status of the households. The results were based on the average of 15 repeats of 10-fold cross validations on the combined dataset. The models were trained using training record consisted of ﬁve hours of past and ﬁve hours of future meter readings, (i.e., T−10 to T10 ) plus additional features generated through the GP-based feature engineering process. For scenarios where feature engineering (FE) was not deployed, a set of pre-deﬁned attributes similar to [8] was used. The results suggest the high predictive power of meter data in establishing home occupancy. More speciﬁcally, with feature engineering (FE), the AUROC values were above 0.96, and the precision was above 0.99 for GB, RF, and SVM. Regarding the accuracy, SVM, with the accuracy of 0.976, outperformed other models. Comparing the accuracy (Fig. 4b) and the precision (Fig. 4c) of the models suggests a higher rate of false positives (i.e., households were ‘away’ but were classiﬁed as ‘at home’) compared to the false negatives (i.e., households were ‘at home’ but were classiﬁed as ‘away’). This is mainly due to the variability of the background load, such as heating and cooling systems, which has caused the models to miss some of the ‘away’ time-slots. Fig. 5 shows an illustrative example of how different performance metrics are improved with the GP-based feature engineering process when using different classiﬁcation algorithms. As shown, many promising features are discovered during the early generations and then the discovery process slows down. The performance improvements are also seen in the form of step increases which happens when new features are discovered. For most algorithms, the process converges after almost 1500 iterations. The results presented in Fig. 4 are based on the averaged crossvalidation tests on the combined dataset. To examine how well the learning can be generalized across various households in different datasets, Table 5 shows the overall accuracy of the best model when the model is trained on one dataset and is tested on another dataset. According to the Table, with the feature engineering (bold highlighted values), the accuracy values are above 0.965 when the model is trained on the ECO or the DRED datasets. The most likely reason for observing a drop in the accuracy when the model is trained on the Smart∗ dataset is the relatively small size of that dataset (three weeks of training data compared to several months in ECO or DRED). Moreover, there is a notable gap between the reported accuracy measures with and without feature engineering. A closer examination of the high ﬁtness features generated by the GP algorithm revealed that many of these attributes contained some normalized versions of the original meter readings. For example, Ratio(SAD(select (−1, 1 )), Mean(select (−10, 10 ))), which is the normalized sum absolute delta of meter readings from T−1 to T1 , had

R. Razavi, A. Gharipour and M. Fleury et al. / Energy & Buildings 183 (2019) 195–208

201

Fig. 3. Predicting home occupancy from meter readings: a) Inferring past occupancy, b) Inferring present occupancy and c) Predicting future occupancy.

Fig. 4. Various performance measures of classiﬁcation models (with and without feature engineering) in establishing past occupancy status of the households: a) AUROC, b) Accuracy and c) Precision when using ﬁve hours of past and ﬁve hours of future meter readings for model training.

Fig. 5. An illustrative example of the effect of the GP feature engineering process on a) AUROC, b) Accuracy and c) Precision.

appeared in 28% of the ﬁnal set of features generated by the GP algorithm when using SVM. As a result of normalizing, the logic learned by the model would be less speciﬁc to absolute values of the consumption, and therefore less speciﬁc to particular households. Of course, the feature engineering process provides more than simple normalization as there are also notable differences between the performance of models that were trained and tested on the same datasets via repeated cross-validation tests, as shown in Table 5. Similarly, Fig. 6 shows the performance of different classiﬁcation models in establishing the present occupancy status of the households (see Fig. 3b). The results are illustrated when the models are trained using ten hours of the past meter readings (i.e.,

T−20 to T0 ). Compared to the results presented in Fig. 4, the performance measures are deteriorated. This was expected since although the total monitoring period was the same in both cases, immediate future meter readings (e.g., T1 or T2 ) were not available when predicting present occupancy. As shown in Fig. 6, with feature engineering, GB resulted in the highest AUROC of 0.925 (down from 0.970 in Fig. 4a), while in terms of the accuracy, SVM outperformed other models with the accuracy of 0.922 (down from 0.976 in Fig. 4b). The precision quantities, however, are still very high where GB and SVM both have prediction precisions above 0.98. Fig. 7 illustrates the performance of the best model (with feature engineering) for establishing the occupancy status of the households in the past and at the present when the monitoring

202

R. Razavi, A. Gharipour and M. Fleury et al. / Energy & Buildings 183 (2019) 195–208

Fig. 6. Various performance measures of classiﬁcation models (with and without feature engineering) in establishing present occupancy status of the households: a) AUROC, b) Accuracy and c) Precision when using ten hours of past meter readings for model training.

Fig. 7. Effect of the monitoring period on the performance of occupancy classiﬁcation models: a) Past occupancy detection, b) Present occupancy detection.

period was varied. In this Figure, the vertical axis is a dimensionless quantity in range zero to one and represents the values of accuracy, precision and AUROC metrics shown in the Figure. Moreover, here the monitoring period consisted of an equal number of past and future meter readings when past occupancy status was inferred. Expectedly, a more extended monitoring period resulted in more accurate predictions, although the improvements were less visible when the monitoring period was increased beyond 16 h. Using 24 h of meter readings, the best model to infer the past occupancy of the households had an average accuracy of 0.983 and an average precision of 0.997. This was a GB model which was subsequently used to predict the occupancy status of the households in the CBT dataset for building future occupancy models. This means that our proxy for the ground truth occupancy status of the CBT households is 98.3% accurate on average (with the precision of 0.997) which is acceptable for the purpose of this study. To further ensure that the model developed by using the combined ECO, DRED and Smart∗ dataset is appropriate for the households in the CBT dataset (i.e., the model can generalize well for the households in the CBT dataset and performance metrics are similar), formal hypotheses have been formulated and tested. Here, we formulated hypotheses to test the claim that the performance metrics (i.e., AUROC, accuracy, and precision) of the CBT present occupancy detection model are within the 2% margin of the performance of the combined ECO, DRED, and Smart∗ dataset model. In other words, the performance of the CBT present occupancy detection model would not be more than 2% (0.02 in absolute values for metrics in range 0–1) below the performance of the combined ECO, DRED, and Smart∗ dataset model across any of the metrics. Let RC , AC , and PC to denote the AUROC, the accuracy, and the precision of the present occupancy detection model from the past meter readings for the CBT data respectively, and REDS , AEDS , and PEDS to represent similar metrics for the combined ECO, DRED, and Smart∗ dataset model. The following three hypotheses have been deﬁned and tested. Ha0 (Null): REDS − RC ≥ 0.02 versus Ha1 (Alternative): REDS − RC <0.02

(1)

Hb0 (Null): AEDS − AC ≥ 0.02 versus Hb1 (Alternative): AEDS − AC <0.02

(2)

Hc0 (Null): PEDS − PC ≥ 0.02 versus Hc1 (Alternative): PEDS − PC <0.02

(3) To test these hypotheses, the performance metrics for 100 repeats of 10-fold cross-validations of the models were recorded. Fig. 8 illustrates and compares the empirical distribution of the AUROC (Fig. 8)a, accuracy (Fig. 8b), and precision (Fig. 8c) of the models from the two datasets based on the cross-validation tests. The results suggest that the performance metrics are very similar in both cases. To formally test hypotheses 1–3, two-sample tstatistic tests were performed. Table 6 presents the mean and standard deviation (SD) of the metrics for two datasets as well as the t-scores and p-values of the tests. According to the results, the pvalues are signiﬁcantly small for all three metrics suggesting that the null hypothesis can be rejected in favor of the alternative for all three cases. In other words, the model from the combined ECO, DRED, and Smart∗ dataset can well generalize for the CBT households and performance deviations would not to be more than 2% at worst, which is acceptable for the purpose of this study. To better understand the performance of the present occupancy detection model for the CBT households (after establishing their occupancy status by applying the model from the combined ECO, DRED, and Smart∗ datasets), and to gain insight into limitations of the model, the classiﬁcation confusion matrix was derived and is presented in Fig. 9, where the occupancy status of a total of one million half-hourly time-slots of the households in the CBT dataset is examined through cross-validation tests. Since the objective of the model was to detect vacant time-slots when the building was completely unoccupied, positive predictions are referred to time-slots when the building is identiﬁed as vacant by the model.

R. Razavi, A. Gharipour and M. Fleury et al. / Energy & Buildings 183 (2019) 195–208

203

Fig. 8. Performance of the present occupancy detection model from past meter data when using the CBT dataset compared to using the combined ECO, DRED, and Smart∗ datasets. The empirical distribution of a) AUROC, b) Accuracy, and c) Precision metrics are illustrated using repeated cross-validation tests. Table 6 Results of two-sample t-statistic tests for hypotheses 1–3. Metrics

ECO+DRED+Smart∗ Mean

SD

Mean

SD

AUROC Accuracy Precision

0.9203 0.9227 0.9645

0.0293 0.0295 0.0245

0.9109 0.9093 0.9531

0.0328 0.0377 0.0302

Fig. 9. Confusion matrix for the identiﬁcation of the present occupancy status of the CBT households from past meter readings using cross-validation tests.

In line with the previous results, the overall performance of the model seemed to be quite high where the total misclassiﬁcation rate was 9.1%. The model is shown to be conservative in identifying vacant slots since only approximately 16.1% (160,105 observations out of 1,0 0 0,0 0 0) of time-slots are predicted as being vacant by the model, while in the reality households were away in approximately 24.3% of time-slots (243,001 observations out of 1,000,000). As discussed earlier, the variability of the background load results in the model to miss some of the vacant slots. However, the building was indeed vacant in more than 0.95% of those time-slots identiﬁed as unoccupied by the model (i.e., high precision). Moreover, to examine the effect of the household and building characteristics on the performance of the present occupancy detection model, Table 7 illustrates the accuracy and precision metrics across various attributes. From the results, it appears that the size of the households, deﬁned as the number of people living in the house on a regular basis, played a signiﬁcant role. For example, the precision was 0.981 for single occupied homes while it was 0.921 for households with more than three occupants. This is mostly because the consumption variability is higher when more people are in the house, which subsequently makes it more diﬃcult to infer occupancy status from the demand pattern. Household age, deﬁned as the age of the head of the household, as well as the em-

CBT

t-Score

p-Value

−7.6216 −4.3247 −7.0161

1.93e−14 8.03e−06 1.57e−12

ployment status, did not show to have any signiﬁcant effect. Examination of the house type and the number of bedrooms reveals that occupancy detection was more diﬃcult for larger buildings where the precision was decreased from 0.992 for single bedroom homes to 0.929 for houses with more than four bedrooms. Larger houses are likely to have more electric appliances [1] in them and subsequently have higher demand variability. Finally, the result suggests that occupancy detection was more diﬃcult for houses where heating and hot water systems used electricity which lead to the increased background load and makes it more diﬃcult to infer occupancy. To provide a more intuitive understanding of the details of the occupancy detection algorithm, a CART decision tree model for detecting present households occupancy status was constructed and is presented in Fig. 10 where the depth of the tree depth was limited to four levels (excluding the root) for the purpose of presentation. Moreover, to enhance the readability and to simplify the representation of the rules, the input arguments of the Select() functions are shown as subscripts and superscripts of the corresponding aggregate functions. For example, Ratio (SAD0−3 , Mean0−18 ), which appears at the root node of the tree, is the abstract representation of Ratio (SAD (Select (−3,0)), Mean (Select (−18,0))) which is the sum absolute delta of the past four meter readings (T−3 to T0 ) which is normalized by dividing over the average consumption of the past 19 m readings (i.e. past 9.5 hours). Furthermore, the percentages of the vacant and occupied observations that fall into each node of the tree are shown on the top of each node at the left and right-hand sides respectively. For example, considering all of the time-slot observations (i.e., the root node), buildings were vacant in 24.3% of time-slots and were occupied in the rest 75.7% of the time. At this root node, the decision tree examines whether the ratio of the sum absolute delta of the past four meter readings to the mean of the past 19 m readings exceeds 0.83. When the buildings are vacant, the relative variations of the demand are smaller (hence the sum absolute delta of subsequent meter readings) since the background load mainly drives the demand. If the ratio falls below 0.83, the left branch of the decision tree will be selected. In that decision node, the percentage of the vacant time-slots is increased to 45.1%. At this node, the algorithm examines whether the average consumption over the past two meter readings (Mean0−1 ) and the standard deviation of the

204

R. Razavi, A. Gharipour and M. Fleury et al. / Energy & Buildings 183 (2019) 195–208 Table 7 Accuracy and precision of identifying the present occupancy status of households with regards to various household and building characteristics. Household and building attributes

Accuracy

Precision

All Household ize

0.901 0.916 0.908 0.885 0.874 0.903 0.900 0.903 0.899 0.931 0.913 0.902 0.892 0.929 0.919 0.907 0.897 0.889 0.884 0.903 0.897 0.906

0.953 0.981 0.969 0.921 0.917 0.955 0.951 0.961 0.950 0.990 0.976 0.957 0.934 0.992 0.985 0.967 0.946 0.929 0.914 0.958 0.945 0.964

Household employment status Household age

∗

House type

Number of bedrooms

Building heating source Water heating source

∗

1 person 2 people 3 people 3+ people Employed Unemployed 40 ≤ 40 > Apartment Terraced house Semi-detached house Detached house 1 bedroom 2 bedrooms 3 bedrooms 4 bedrooms 4+ bedrooms Electrical Non-electrical Electrical Non-electrical

∗

Note: Household employment status and household age refer to the employment status and age of the head of the household.

Fig. 10. An illustrative example of four level CART decision tree model for establishing the present occupancy status of the households.

demand over the past four readings (SD0−3 ) are both smaller than 0.34 kWh. The Max() function acts as a logical AND in this case since the condition would not be satisﬁed if either the mean or standard deviation exceeds 0.34. Alternatively, if the check at root node fails (right branch), the model examines the ratio of the standard deviation of the demand during the past four meter readings over the average demand during the past 21 readings. The conditioning and branching process continues until a terminal leaf node is reached. Expectedly, the purity of the nodes (i.e. the difference between the percentage of the vacant and occupied records) increases when moving from the root towards the leaves. Close examination of the decision nodes suggests that the variations of the demand, shown by SD() and SAD() functions, the amplitude of the demand represented by the Mean() and indirectly by the SD function, as well as the time of the day, are important factors for establishing the occupancy status of the households.

Table 8 illustrates the top ten variables of the GB model for establishing the present occupancy status of the households. According to the Table, the sum absolute delta of past four meter readings normalized by the average consumption over the past 13 time-slots, Ratio (SAD0−4 , Mean0−12 ), was the most important variable. Moreover, many of the top variables are shown to be the normalized version of the raw attributes which can help the algorithm to better generalize across various households where the absolute values of demand quantities may vary signiﬁcantly. Similar to the CART tree, attributes representing the variations and the amplitude of demand, as well as the time of the day are shown to be relevant in establishing buildings occupancy status. The implications of the results presented in this section are as follows. First, it was shown that meter data was highly revealing of the past and present occupancy status of the households. This ﬁnding is promising in applications where occupancy information is used to control electric appliances such as HVAC sys-

R. Razavi, A. Gharipour and M. Fleury et al. / Energy & Buildings 183 (2019) 195–208

205

Table 8 Top ten variables in GB model for predicting present households’ occupancy status Variable

Relative importance

Comment

Ratio (SADθ−4 , Meanθ−12 ) Ratio (Meanθ−3 ,Meanθ−18 ) Ratio (SDθ−4 ,SDθ−17 ) Ratio (SADθ−5 ,SDθ−18 ) Sum(SADθ−2 ,Meanθ−4 ) Ratio (Meanθ−5 ,Min−7 ) −18 Hour Max (Rangeθ−4 ,SADθ−3 ) SDθ−6 Ratio (Meanθ−4 ,Range−4 ) −12

100 82.43 79.54 65.34 62.18 56.01 50.77 48.17 45.87 44.31

SAD normalized by mean Normalized mean Normalized standard deviation SAD normalized by SD Mix of SAD and mean Mean normalized By min Time of the day Logical mix of range and SAD Standard deviation Mean normalized by range

tems. However, high precision values were particularly concerning from the privacy point of view. Compared to more general performance metrics such as AUROC or accuracy, precision is more directly linked to households’ privacy. For example, privacy will be at great risk if the household were indeed away from home in a high percentage of the time-slots which were predicted as ‘away’. This is true even if the model fails to capture all of the ‘away’ slots. Moreover, the results show that with appropriate feature engineering, the logic learned by the models can be well generalized across different households. This suggests that to predict the home occupancy of a given household, the model does not necessarily need to be trained using the past meter data and occupancy status of the same household. Instead, the model can be trained using data available from some households (e.g., publicly-available datasets) and then be deployed to predict the occupancy of other households. In other words, all it takes to predict the home occupancy of a household is to collect a few hours of their meter readings. For example, using six hours of meter readings, the results suggest that the present occupancy status of a household can be predicted with the precision of 0.945. This implies a high level of privacy risk, especially considering the heterogeneity of the communication technologies used to transmit meter data from consumers to service providers. For example, the study in [50] identiﬁes 17 different smart meter communication standards with varying levels of security provisioning that are currently deployed in Europe. Ongoing efforts to coordinate the development of consistent and compatible standards for information management in smart grid suggest that it is likely to witness more harmonization of the standards in the future. However, in the meanwhile, the security and privacy should be considered as critical non-functional factors when utilities evaluate the choice of technology and standards for their current deployment, especially given the high cost of meter replacements. 4.2. Future occupancy detection from smart meter data This section investigates the possibility of inferring the future occupancy status of residential buildings using past smart meter data. The privacy implications of compromised smart meter data would be much more signiﬁcant if demand data are able not only to reveal occupancy status of households in the past and present (as shown previously) but also if they can be predictive of a household’s home occupancy status in the future. Needless to mention, with accurate future predictions, criminals would have more ﬂexibility for planning ahead. The intuition behind the hypothesis that smart meter data have predictive power to reveal future home occupancy status of households is the fact that most households have regular daily/weekly routines that repeat over the time. This can be easily conﬁrmed by examining the autocorrelation of households’ home presence status shown in Fig. 11, where daily peaks can be easily detected, and a local weekly peak can be spotted at the 7day lag. Moreover, the autocorrelation coeﬃcient has been plotted separately for weekdays and weekends. As expected, the coeﬃcient

Fig. 11. Autocorrelation of households’ home presence status.

is smaller for weekends when households have less regular routines. As discussed in the previous section, the best model trained on the combined ECO, DRED and Smart∗ dataset was applied to the CBT dataset to establish the occupancy status of the households. This was a GB model with the cross-validation accuracy of 0.982 and precision of 0.997. Subsequently, using the CBT dataset along with the predicted occupancy status of the households, predictive models were constructed which attempt to avail themselves of households’ demand data to identify their future home presence (See Fig. 3c). More speciﬁcally, these models seek to identify some ‘away’ slots in a week period. The models provide a classiﬁcation score for every time-slot within the week (i.e., 336 halfhourly slots). The predictions are then ranked-ordered, and timeslots with highest scores are identiﬁed. Using only one week of households meter data for model training dataset, Fig. 12 illustrates the precision of various models in identifying ‘away’ slots in a week period. The models’ precision in identifying a varying number of ‘away’ slots is shown through Fig. 12a–c. Fig. 12a compares the performance of models when the objective was to detect a single ‘away’ slot in a week period in the future. The results also show how the performance of algorithms varies when predictions are made further into the future. Subsequently, four forecast scenarios were considered when predictions were made during the week just after the training data, and then two weeks, one month and ﬁnally three months after the training data. As one might have expected, the precision starts to decrease as predictions are made further into the future owing to the reduced autocorrelation of the home presence of households over the time. Moreover, the results conﬁrm the predictive power of smart meter data in predicting future home presence status. For the scenario when predictions are made just a week after the training data, the precision is just below 0.8. One should notice

206

R. Razavi, A. Gharipour and M. Fleury et al. / Energy & Buildings 183 (2019) 195–208

Fig. 12. Precision of various models in identifying a) one, b) ﬁve and c) ten ‘away’ slot(s) in a week period in the future and when using one week of training data.

Fig. 13. Precision of various models in identifying a) one, b) ﬁve and c) ten ‘away’ slot(s) in a week period in the future and when using one month of training data.

Fig. 14. Precision of various models in identifying a) one, b) ﬁve and c) ten ‘away’ slot(s) in a week period in the future and when using three months of training data.

that the households’ average ‘away’ rate is around 0.19, which is equivalent to the precision of a random model. Comparing different models, GB is shown to outperform others by a small margin. Similarly, Figs. 12b and 12c compare the performance of various models for identifying ﬁve and ten ‘away’ time-slots. Despite the expected decreased precision, the models are still shown to be predictive. Fig. 13 shows similar results except that the training data now consists of one month of household meter data. As shown, the reported precision quantities have improved as a result, and the mean precision of identifying a single ‘away’ slot is approximately 0.85 for GB when predictions are made in the week following the training data. Finally, Fig. 14 represents the results for scenarios when three months of household meter data are available for model training. It is interesting to notice that the improvements

in precision are marginal, suggesting that one month’s worth of a household’s demand data is perhaps suﬃcient to capture most of their daily routines. 4.3. Precision by household demographics The sensitivity of consumers to privacy concerns varies across different demographies [51]. As such, it would be useful to investigate the privacy vulnerabilities of households by their demography segments. Table 9 represents the statistics of the autocorrelation of home presence and the precision of predictions for households based on their socio-demographic characteristics. The Table shows the mean of the autocorrelation coeﬃcient calculated over a sevenday lag period as well as the average daily and weekly peaks. It also presents the maximum precision of models when predicting a

R. Razavi, A. Gharipour and M. Fleury et al. / Energy & Buildings 183 (2019) 195–208

207

Table 9 Autocorrelation of households’ home presence status by their socio-demographic characteristics. Household characteristics

Weekly average

1-Day’s Lag

7- Day’s Lag

Precision

Employment

0.101 0.086 0.099 0.086 0.107 0.097 0.083 0.098 0.087

0.282 0.248 0.280 0.251 0.301 0.272 0.238 0.281 0.252

0.193 0.171 0.194 0.168 0.204 0.181 0.169 0.195 0.171

0.892 0.813 0.884 0.802 0.915 0.870 0.798 0.890 0.809

Social class Household size

Age

Employed Unemployed Higher class Lower class 1 person 2 people 2+ people ≤ 40 > 40

single ‘away’ slot in the week immediately followed by the training data which consisted of 3-months of past consumption data (See Fig. 14a). As might be expected, the results suggest that the autocorrelation coeﬃcient and subsequently the prediction precision are higher for households when the head of the household is employed as well as for the families of a higher social class. It is interesting to observe that both quantities notably decrease with an increase in the size of the household. This is because the home occupancy status depends on the schedule of all household members and evidently it becomes less predictable with an increase in the number of people in the household. Finally, because older families (with members 40 years and above) tend to have larger families (e.g., more teenage children at home) and also include nonworking, retired individuals, their home presence pattern is less predictable. The study in [51] suggests that age is a key factor in predicting the privacy sensitivity and the privacy protective behavior of individuals, although the relationship is shown to be non-linear. From the results, it appears that the age group 46–65 are most conservative and privacy-protective, while young individuals are the least privacy sensitive. The education and social class are also shown to be inversely related to privacy sensitivity and protective behaviors. In the context of home occupancy prediction, this implies that the household segments that are most vulnerable to the privacy implications of smart meters (i.e., young, educated professional individuals and couples) are those with least privacy sensitivity and protective behavior. The results presented in this section conﬁrms a signiﬁcant predictive power of smart meter data in establishing future occupancy of households. This implies that, in addition to securing the communication medium, meter data need to be protected in Meter Data Management Systems (MDMS), especially when storage servers are located in outsourced environments such as public or hybrid cloud environments [52]. Encrypted storage solutions, such as homomorphic encryption [53], are considered promising, although they may come with signiﬁcant computational overhead. In addition, implementation of such solutions requires careful considerations. For example, the encryption may be easily broken [54] or be subject to statistical attacks [55], if the blocks of encrypted meter data are small. 5. Conclusion The introduction of smart meters within an AMI has brought potential advantages for both energy providers and consumers. However, the possibility of high-frequency meter readings raises the question of household privacy, especially in relation to buildings’ occupancy detection. This paper introduces a dynamic genetic programming based feature engineering process for detecting occupancy of residential buildings from their smart meter data. The study shows that using this method, machine learning classiﬁers for occupancy detection from meter data would generalize very well across different households. This means that a model

can be trained using demand information (i.e., independent variables) and home presence information (i.e., dependent variable) of some households (e.g., publicly available datasets), and then be deployed to predict the home occupancy status of other households. This approach results in a performance much superior to unsupervised methods. Using this approach enabled us to study building occupancy detection from smart meter data at a large scale for the ﬁrst time. This paper’s show that real-time smart meter data are a strong predictor of the home presence of households (AUROC ≈ 0.98). However, the privacy implications of compromised smart meter data would be much more signiﬁcant if demand data are able not only to reveal occupancy status of households in real-time but also if it can be predictive of the households’ home occupancy in the future. Using half-hourly electricity consumption data from more than 50 0 0 households for 18 months, our results indicate that even with just a week-long worth of households’ demand data, one can identify some ‘away’ slots with high precision. The precision, however, is shown to fall rapidly when predictions are made further into the future. For example, using one week of households’ meter data as predictors, with the GB model, the precision of detecting a single ‘away’ time-slot in the week followed by the input data is 0.79 while the same quantity is 0.56 when predictions are made for three months later. The results also indicate that increasing the training data from one week to one month improves the precision of the predictions, while the effect of a further increase to three months is marginal. The precision of predictions is expected to be further improved if additional information such as weather, local events, and neighborhood information are integrated into the models. The result also suggests that the accuracy of predictions is highest for the younger working professionals who live in small households (i.e., singles and couples). Previous studies indicate that this segment are least concerned with privacy and privacy-protective behaviors in general. These ﬁndings suggest the importance of securing the communication and storage infrastructure used for smart meter data as well as the need for appropriate and consistent legislation for governance of such data. Conﬂicts of interest statement The authors whose names are listed immediately below certify that they have NO aﬃliations with or involvement in any organization or entity with any ﬁnancial interest (such as honoraria; educational grants; participation in speakers’ bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or nonﬁnancial interest (such as personal or professional relationships, afﬁliations, knowledge or beliefs) in the subject matter or materials discussed in this manuscript. References [1] S. D’Oca, T. Hong, J. Langevin, The human dimensions of energy use in buildings: a review, Renewable Sustain. Energy Rev. 81 (2018) 731–742.

208

R. Razavi, A. Gharipour and M. Fleury et al. / Energy & Buildings 183 (2019) 195–208

[2] F. Wang, Q. Feng, Z. Chen, Q. Zhao, Z. Cheng, J. Zou, Y. Zhang, J. Mai, Y. Li, H. Reeve, Predictive control of indoor environment using occupant number detected by video data and CO2 concentration, Energy Build. 145 (2017) 155–162. [3] T. Leephakpreeda, Adaptive occupancy-based lighting control via grey prediction, Build. Environ. 40 (7) (2005) 881–886. [4] J. Schein, S.T. Bushby, N.S. Castro, J.M. House, A rule-based fault detection method for air handling units, Energy Build. 38 (12) (2006) 1485–1492. [5] D. Chen, S. Kalra, D. Irwin, P. Shenoy, J. Albrecht, Preventing occupancy detection from smart meters, IEEE Trans. Smart Grid 6 (5) (2015) 2426–2434. [6] K. Sangani, You’re being monitored, Eng. Technol. 5 (10) (2010) 28–29. [7] Z. Chen, C. Jiang, L. Xie, Building occupancy estimation and detection: a review, Energy Build. 169 (2018) 260–270. [8] M. Jin, R. Jia, C.J. Spanos, Virtual occupancy sensing: using smart meters to indicate your presence, IEEE Trans. Mob. Comput. 16 (11) (2017) 3264–3277. [9] Y. Bengio, A. Courville, P. Vincent, Representation learning: a review and new perspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2013) 1798–1828. [10] M.K. Masood, C. Jiang, Y.C. Soh, A novel feature selection framework with hybrid feature-scaled extreme learning machine (hfs-elm) for indoor occupancy estimation, Energy Build. 158 (2018) 1139–1151. [11] N. Sadeghianpourhamami, J. Ruyssinck, D. Deschrijver, T. Dhaene, C. Develder, Comprehensive feature selection for appliance classiﬁcation in nilm, Energy Build. 151 (2017) 98–106. [12] J. Lu, T. Sookoor, V. Srinivasan, G. Gao, B. Holben, J. Stankovic, E. Field, K. Whitehouse, The smart thermostat: using occupancy sensors to save energy in homes, in: Proceedings of the 8th ACM Conference on Embedded Networked Sensor Systems, in: SenSys ’10, ACM, New York, NY, USA, 2010, pp. 211–224. [13] Y. Agarwal, B. Balaji, S. Dutta, R.K. Gupta, T. Weng, Duty-cycling buildings aggressively: the next frontier in hvac control, in: Proceedings of the 10th ACM/IEEE International Conference on Information Processing in Sensor Networks, 2011, pp. 246–257. [14] V.L. Erickson, A.E. Cerpa, Occupancy based demand response hvac control strategy, in: Proceedings of the 2Nd ACM Workshop on Embedded Sensing Systems for Energy-Eﬃciency in Building, in: BuildSys ’10, ACM, New York, NY, USA, 2010, pp. 7–12. [15] D. Chen, S. Barker, A. Subbaswamy, D. Irwin, P. Shenoy, Non-intrusive occupancy monitoring using smart meters, in: 5th ACM Workshop on Embedded Systems For Energy-Eﬃcient Buildings, 2013, pp. 1–8. [16] G.R. Newsham, B.J. Birt, Building-level occupancy data to improve arima-based electricity use forecasts, in: Proceedings of the 2Nd ACM Workshop on Embedded Sensing Systems for Energy-Eﬃciency in Building, in: BuildSys ’10, ACM, New York, NY, USA, 2010, pp. 13–18. [17] E. Quinn, Smart metering & privacy: existing law and competing policies: a report for the colorado public utilities commission, 2009. [18] N. Sadeghianpourhamami, J. Ruyssinck, D. Deschrijver, T. Dhaene, C. Develder, Comprehensive feature selection for appliance classiﬁcation in NILM, Energy Build. 151 (2017) 98–106. [19] I. Rouf, H. Mustafa, M. Xu, W. Xu, R. Miller, M. Gruteser, Neighborhood watch: security and privacy analysis of automatic meter reading systems, in: Proceedings of the 2012 ACM Conference on Computer and Communications Security, in: CCS ’12, ACM, New York, NY, USA, 2012, pp. 462–473. [20] S. Newell, M. Marabelli, Strategic opportunities (and challenges) of algorithmic decision-making: a call for action on the long-term societal effects of datiﬁcation’, J. Strategic Inform. Syst. 24 (1) (2015) 3–14. [21] C. Nee, A. Meenaghan, Expert decision making in burglars, Br. J. Criminol. 46 (5) (2006) 935–949. [22] J. Zou, Q. Zhao, W. Yang, F. Wang, Occupancy detection in the oﬃce by analyzing surveillance videos and its application to building energy conservation, Energy Build. 152 (2017) 385–398. [23] S. Petersen, T.H. Pedersen, K.U. Nielsen, M.D. Knudsen, Establishing an image-based ground truth for validation of sensor data-based room occupancy detection, Energy Build. 130 (2016) 787–793. [24] G. Diraco, A. Leone, P. Siciliano, People occupancy detection and proﬁling with 3d depth sensors for building energy management, Energy Build. 92 (2015) 246–266. [25] C. Duarte, K.V.D. Wymelenberg, C. Rieger, Revealing occupancy patterns in an oﬃce building through the use of occupancy sensor data, Energy Build. 67 (2013) 587–595. [26] Y. Zhao, W. Zeiler, G. Boxem, T. Labeodan, Virtual occupancy sensors for real– time occupancy information in buildings, Build. Environ. 93 (2015) 9–20. [27] M. Wang, X. Wang, G. Zhang, C. Li, Occupancy detection based on spiking neural networks for green building automation systems, in: Proceeding of the 11th World Congress on Intelligent Control and Automation, 2014, pp. 2681–2686. [28] H. Zou, H. Jiang, J. Yang, L. Xie, C. Spanos, Non-intrusive occupancy sensing in commercial buildings, Energy Build. 154 (2017) 633–643. [29] H. Zou, Y. Zhou, H. Jiang, S.-C. Chien, L. Xie, C.J. Spanos, Winlight: a wiﬁ-based occupancy-driven lighting control system for smart building, Energy Build. 158 (2018) 924–938.

[30] L.M. Candanedo, V. Feldheim, D. Deramaix, A methodology based on hidden markov models for occupancy detection and a case study in a low energy residential building, Energy Build. 148 (2017) 327–341. [31] L.M. Candanedo, V. Feldheim, Accurate occupancy detection of an oﬃce room from light, temperature, humidity and CO2 measurements using statistical learning models, Energy Build. 112 (2016) 28–39. [32] D. Chen, S. Barker, A. Subbaswamy, D. Irwin, P. Shenoy, Non-intrusive occupancy monitoring using smart meters, in: Proceedings of the 5th ACM Workshop on Embedded Systems For Energy-Eﬃcient Buildings, in: BuildSys’13, ACM, New York, NY, USA, 2013, pp. 9:1–9:8. [33] W. Kleiminger, C. Beckel, T. Staake, S. Santini, Occupancy detection from electricity consumption data, in: 5th ACM Workshop on Embedded Systems for Energy-Eﬃcient Buildings, 2013, pp. 1–8. [34] W. Kleiminger, C. Beckel, S. Santini, Household occupancy monitoring using electricity meters, in: Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, in: UbiComp ’15, ACM, New York, NY, USA, 2015, pp. 975–986. [35] A. Akbar, M. Nati, F. Carrez, K. Moessner, Contextual occupancy detection for smart oﬃce by pattern recognition of electricity consumption data, in: 2015 IEEE International Conference on Communications (ICC), 2015, pp. 561–566. [36] L. Yang, K. Ting, M.B. Srivastava, Inferring occupancy from opportunistically available sensor data, in: 2014 IEEE International Conference on Pervasive Computing and Communications (PerCom), 2014, pp. 60–68. [37] G. Tang, K. Wu, J. Lei, W. Xiao, The meter tells you are at home! non-intrusive occupancy detection via load curve data, in: 2015 IEEE International Conference on Smart Grid Communications (SmartGridComm), 2015, pp. 897–902. [38] M.G. Smith, L. Bull, Genetic programming with a genetic algorithm for feature construction and selection, Genetic Programm. Evolvable Mach. 6 (3) (2005) 265–281. [39] Electricity Smart Metering Customer Behaviour Trials (CBT) Findings Report, Technical Report, Irish Commission for Energy Regulation, 2011. [40] C. Beckel, W. Kleiminger, R. Cicchetti, T. Staake, S. Santini, The ECO data set and the performance of non-intrusive load monitoring algorithms, in: 1st ACM International Conference on Embedded Systems for Energy-Eﬃcient Buildings, ACM, 2014, pp. 80–89. [41] A.S. Uttama Nambi, A. Reyes Lua, V.R. Prasad, Loced: location-aware energy disaggregation framework, in: Proceedings of the 2Nd ACM International Conference on Embedded Systems for Energy-Eﬃcient Built Environments, in: BuildSys ’15, ACM, New York, NY, USA, 2015, pp. 45–54. [42] S. Barker, A. Mishra, D. Irwin, E. Cecchet, P. Shenoy, J. Albrecht, Smart∗ : an open data set and tools for enabling research in sustainable homes, in: Proceedings of the SustKDD workshop on Data Mining Applications in Sustainability, 2012. [43] M. Kuhn, Building predictive models in R using the caret package, J. Stat. Softw. 28 (1) (2008) 1–26. [44] L. Breiman, J. Friedman, C.J. Stone, R.A. Olshen, Classiﬁcation and regression trees, CRC press, 1984. [45] V. Vapnik, The Nature of Statistical Learning Theory, Springer Science & Business Media, 2013. [46] L. Jiang, Z. Cai, D. Wang, S. Jiang, Survey of improving k-nearest-neighbor for classiﬁcation, in: Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 20 07), 1, 20 07, pp. 679–683. [47] J.J. Hopﬁeld, Artiﬁcial neural networks, IEEE Circuits Devices Mag. 4 (5) (1988) 3–10. [48] Y. Freund, R. Schapire, N. Abe, A short introduction to boosting, J. Jpn. Soc. Artif. Intell. 14 (771–780) (1999) 1612. [49] S.B. Taieb, R. Huser, R.J. Hyndman, M.G. Genton, Forecasting uncertainty in electricity smart meter data by boosting additive quantile regression, IEEE Trans. Smart Grid 7 (5) (2016) 2448–2455. [50] S. Erlinghagen, B. Lichtensteiger, J. Markard, Smart meter communication standards in europe a comparison, Renewable Sustain. Energy Rev. 43 (2015) 1249–1262. [51] A.B. Serwin, The demographics of privacy - a blueprint for understanding consumer perceptions and behavior, in: J.F. Delaney (Ed.), Social Media 2013: Addressing Corporate Risks, PLI, 2013, pp. 466–490. [52] M.R. Asghar, G. Dán, D. Miorandi, I. Chlamtac, Smart meter data privacy: a survey, IEEE Commun. Surv. Tut. 19 (4) (2017) 2820–2835. [53] X. Li, X. Liang, R. Lu, X. Shen, X. Lin, H. Zhu, Securing smart grid: cyber attacks, countermeasures, and challenges, IEEE Commun. Mag. 50 (8) (2012) 38–45. [54] R.A. Popa, C.M.S. Redﬁeld, N. Zeldovich, H. Balakrishnan, Cryptdb: processing queries on an encrypted database, Commun. ACM 55 (9) (2012) 103–111. [55] A. Ceselli, E. Damiani, S.D.C.D. Vimercati, S. Jajodia, S. Paraboschi, P. Samarati, Modeling and assessing inference exposure in encrypted databases, ACM Trans. Inf. Syst. Secur. 8 (1) (2005) 119–152.

Occupancy detection of residential buildings using smart meter data: A large-scale study

Occupancy detection of residential buildings using smart meter data: A large-scale study

Recommend Documents