Journal of Psychosomatic Research 126 (2019) 109822
Contents lists available at ScienceDirect
Journal of Psychosomatic Research journal homepage: www.elsevier.com/locate/jpsychores
A validation study of a consumer wearable sleep tracker compared to a portable EEG system in naturalistic conditions
T
Thomas Svenssona,b,c,1, , Ung-il Chunga,c,d, Shinichi Tokunoe, Mitsuteru Nakamurae, Akiko Kishi Svenssona,b,f,1 ⁎
a
Precision Health, Department of Bioengineering, Graduate School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8655, Japan Department of Clinical Sciences, Lund University, Skåne University Hospital, Malmö, Sweden c School of Health Innovation, Kanagawa University of Human Services Graduate School, Research Gate Building Tonomachi 2-A 2, 3F, 3-25-10 Tonomachi, Kawasaki-ku, Kawasaki-shi, Kanagawa, Japan d Clinical Biotechnology, Center for Disease Biology and Integrative Medicine, Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 1138655, Japan e Voice Analysis of Pathophysiology, Graduate School of Medicine, The University of Tokyo, Japan f Department of Diabetes and Metabolic Diseases, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan b
ARTICLE INFO
ABSTRACT
Keywords: Portable EEG Sleep duration Sleep measures Sleep tracking Validation study Wearable device
Objective: To compare a wearable device, the Fitbit Versa (FV), to a validated portable single-channel EEG system across multiple nights in a naturalistic environment. Methods: Twenty participants (10 men and 10 women) aged 25–67 years were recruited for the present study. Study duration was 14 days during which participants were asked to wear the FV daily and nightly. The study intended to reproduce free-living conditions; thus, no guidelines for sleep or activity were imposed on the participants. A total of 138 person-nights, equivalent to 76,539 epochs, were used in the validation process. Sleep measures were compared between the FV and portable EEG using Bland-Altman plots, paired t-tests and epoch-by-epoch (EBE) analyses. Results: The FV showed no significant bias with the EEG for the global sleep measures time in bed (TIB) and total sleep time (TST), and for calculated sleep efficiency (cSE = [TST/TIB] x 100). The FV had 92.1% sensitivity, 54.1% specificity, and 88.5% accuracy with a Cohen's kappa of 0.41, but a prevalence- and bias adjusted kappa of 0.77. The predictive values for sleep (PVS; positive predictive value) and wakefulness (PVW; negative predictive value) were 95.0% and 42.0%, respectively. The FV showed significant bias compared to the portable EEG for time spent in specific sleep stages, for SE as provided by FV, for sleep onset latency, sleep period time, and wake after sleep onset. Conclusions: The consumer sleep tracker could be a useful tool for measuring sleep duration in longitudinal epidemiologic naturalistic studies albeit with some limitations in specificity.
1. Introduction Insufficient or excessive sleep durations are emerging as important risk factors for a large number of health outcomes. Epidemiologic studies have identified clear associations of short and long sleep durations with the risk of incident diabetes [1] coronary heart disease [2], stroke [3], as well as
with increased mortality from all-causes [4], cardiovascular disease [5], and cancer [6]. A recognised limitation of existing epidemiologic studies, however, is the reliance on self-reported sleep duration, which tends to overestimate objectively measured sleep duration [7]. Moreover, the majority of epidemiologic studies rely on sleep duration provided at only one time point, which prevents repeated measures analysis.
Abbreviations: cSE, Calculated sleep efficiency; EEG, Electroencephalogram; EBE, Epoch-by-epoch; FV, Fitbit Versa; PSG, Polysomnography; PVS, Predictive value for sleep (equivalent to the positive predictive value); PVW, Predictive value for wakefulness (equivalent to the negative predictive value); REM, Rapid eye movement; SE, Sleep efficiency; SOL, Sleep onset latency; SPT, Sleep period time; TIB, Time in bed; TST, Total sleep time; WASO, Wake after sleep onset ⁎ Corresponding author at: Precision Health, Department of Bioengineering, Graduate School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8655, Japan. E-mail addresses:
[email protected] (T. Svensson),
[email protected] (U.-i. Chung),
[email protected] (S. Tokuno),
[email protected] (M. Nakamura),
[email protected] (A.K. Svensson). 1 Contributed equally to this study. https://doi.org/10.1016/j.jpsychores.2019.109822 Received 9 July 2019; Received in revised form 28 August 2019; Accepted 30 August 2019 0022-3999/ © 2019 Elsevier Inc. All rights reserved.
Journal of Psychosomatic Research 126 (2019) 109822
T. Svensson, et al.
There are many wearable devices which measure sleep parameters, and which could potentially be used in longitudinal sleep studies. In recent years, a number of studies have investigated the sleep parameters of various wearables [8–15] in comparison to overnight polysomnography (PSG), the gold standard for evaluating sleep [16].The major limitation of PSG is the small number of nights, often a single night, and the artificial environment; it is conducted in sleep laboratories with a completely dark, sound proof and climatized room [17]. Some participants may be disturbed by the environment and measuring equipment during a PSG, leading to a “first night effect” which can lead to increased wakefulness and decreased total sleep time (TST) and sleep efficiency (SE) [18]. Moreover, on the day of a PSG, participants are often instructed to refrain from alcohol, caffeinated drinks, hypnotics and strenuous physical activity [17]. The question thus arises whether wearable devices validated only in laboratory settings over a single night could be considered valid for use in longitudinal studies aiming to identify habitual and dynamic living conditions. Indeed, previous studies have called for the study of wearable devices in non-controlled settings over longer periods of time [13]. One possibility to accomplish this is to use actigraphy, which has reasonable validity compared to PSG as the gold standard [19]. The limitation of actigraphy, however, is the comparatively low specificity for identifying wake [20], and the inability to investigate time spent in specific sleep stages. Contrary to PSG, home-based portable EEG systems would impose no restrictions on study participants and can be used without professionally trained personnel. The validation of wearable devices would thus occur in conditions where study participants are unrestricted with regards to their daily- and nightly patterns, including factors which influence sleep duration, e.g., work time, physical activity, bedtime, and waketime. If indeed wearable devices are to be considered for use in epidemiologic studies lasting many months or years, the devices also need to be validated in naturalistic and free-living conditions. An additional benefit of using portable EEG systems compared to actigraphs in validation studies is the possibility to investigate both overall- as well as stage-specific sleep measures. Despite the large number of wearable devices on the market, the present study selected the Fitbit Versa (FV) for validation due to a number of reasons: i) the company has a large proportion of the market share of wearable sleep and activity trackers; ii) the FV is, at the time of conducting the study, Fitbit's most recent device; iii) the FV is yet to be validated, iv) is relatively inexpensive, v) has a battery life of 4–5 days, and vi) has a designated application programming interface (API) which allows the pulling of participant data. The primary objective of this study was to investigate the agreement of the duration of global sleep parameters and specific sleep stages when comparing the measures of a wearable device, the FV, with a portable single-channel EEG system for one week in naturalistic conditions. The secondary objective was to calculate sensitivity, specificity, accuracy, predictive value for sleep, and predictive value for wakefulness of the wearable device's sleep and wake classification. We hypothesised that the FV would show good agreement with the portable single-channel EEG system, and based on the previous studies, have high sensitivity, and accuracy, but low specificity.
the participants. Participants were compensated for their participation by being given the wearable device. All research was performed in accordance with relevant guidelines/regulations. All participants provided written informed consent and the study was approved by the Ethical Committee of the Department of Bioengineering, Graduate School of Engineering, the University of Tokyo (approval number: KE18-7). 2.2. Study design The aim of the study was to validate the consumer wearable device in naturalistic conditions. Consequently, no guidelines for sleep or activity were imposed on the participants. Study duration was 14 days, during which participants were asked to wear the FV (the device is described in detail below) continuously around the clock. Participants were provided with an internet connected Android smartphone to synchronize the wearable device. Participants were asked to use a validated portable single-channel EEG system (SleepScope manufactured by SleepWell Co., Ltd.; the device is described in detail below) every night throughout the study period. The study sample size was determined from estimations of the expected within-subject standard deviation using the following formula: sw 1.96 2n (m 1) , where sw is within subject standard error, n is the study sample, and m is the number of measurements. With 7 measurements per participant, a within-subject standard deviation estimated to be within 15% of its true value at a significance level of 5% would require a study sample of n = 20 [21]. The 14-day study duration was determined in order to increase the likelihood of complete data for the required 7 nights, equivalent to a total of 140 person-nights used in the validation process. Data selection was based on successful use of the portable single-channel EEG, which was thus considered the rate limiting step. Initial assessment of completeness and suitability of EEG data was determined by SleepWell who did not have access to the corresponding FV measures. The final selection of measures was conducted by TS and AKS based on the following criteria: a) each person had to contribute with 7 nights, of which 5 were weeknights and 2 were weekend nights; b) there could be no missing FV data for the selected nights. Final selection was done by visual inspection for any data gap or missing data. No statistical analyses were conducted to compare EEG and FV data at the time of selection. A summary of the selection process is available in Fig. 1 with further details of the selection process provided under the description of the respective devices below. 2.3. Portable single-channel EEG system The portable single-channel electroencephalogram (EEG) system (SleepScope) is manufactured by SleepWell (SleepWell Co., Ltd., Osaka, Japan). It is a registered medical device (Japanese Medical Device Certification number: 225ADBZX00020000) aimed to record sleep staging, and is suitable for home use. The portable single-channel EEG system has shown acceptable agreement with PSG [22]. Participants were provided with a demonstration and written instructions on how to use the portable EEG system. In brief, it consists of a central AAA-battery-powered unit with an SD memory card. The device is manually turned on and off by the participant at bedtime and at wake-up, respectively. The EEG system uses self-adhesive and pregelled electrodes attached to the forehead and mastoid collecting electrophysiologic signals using bipolar derivation. The electrophysiologic signals are amplified and filtered by a 12-bit analog to digital converter and pre-processing unit with the sampling rate and filter setting being set at 128 Hz and 0.5–64 Hz, respectively. The waveforms recorded by the single-channel EEG system are appropriate for sleep staging (wake (W), R, N1, N2, and N3) as defined by version 2.3 of the American Academy of Sleep Medicine (AASM) Scoring Manual [23]. The EEG data collected for each participant was
2. Method 2.1. Participants The participants of the present study were twenty healthy Japanese adults (10 men and 10 women; ages 25–67) recruited from Kanagawa and Tokyo. None of the participants reported any medical conditions that were the cause for exclusion. Participant eligibility criteria are summarized in Table 1. All participants were informed about the purpose of the study, that participation was entirely voluntary, and that participation could be discontinued at any point without any disadvantage or imposed penalty for 2
Journal of Psychosomatic Research 126 (2019) 109822
T. Svensson, et al.
Table 1 Participant inclusion criteria. Inclusion criteria
Exclusion criteria
Adult aged 20–70 At least 6 months experience of using a smartphone Highly compliant with study protocol
Long-term absence during the study period, expected travel, or shift work during the study period Health issues determined by the Principal Investigator to interfere with study participation Non-native speakers of Japanese Confirmed or suspected sleep apnoea syndrome Sensitive skin precluding use of wearable device or nightly EEG Women suspected of being pregnant Continuous medication use for the past three months or planned use of hypnotics during the study period
EEG = electroencephalogram.
inspected and analysed by SleepWell. Details of the epoch scoring can be found in the validation study by Yoshida 2015 et al. [22] For the purpose of the present study, the following outcomes were used: time in bed (TIB; min), total sleep time (TST; min), sleep efficiency as TST/TIB x 100 (SE; %), sleep onset latency (SOL; min), sleep period time (SPT; min), wake after sleep onset (WASO; min), and time spent in N1, N2, N3 and REM sleep (min). TIB, TST, SE, SOL, SPT and WASO were provided as summary measures for every sleep episode (n = 138) in addition to the 30-second epochs. There were 280 EEG sleep episodes available for the 20 participants during the 14-day study period. Based on the initial inspection of raw data by SleepWell, 36 of these sleep episodes were excluded as they were considered unsuitable for analysis. Of the 244 remaining EEG sleep episodes available for analysis, 104 were excluded by TS based on the quality assessment provided by SleepWell of their EEG data. This quality assessment placed each night's EEG sleep data into one of five
categories: A) suitable for analysis, B) analysis is possible but participant forgot to turn off the device, C) analysis is possible but data is recorded for < 4 h or the electrodes were taken off during the sleep episode, D) data can be analysed but is not recommended, and E) data cannot be analysed (no data or too much noise). Of the sleep episodes selected for the present study, 134 episodes were of quality A, three episodes were of quality B, and three episodes were of quality C. 2.4. Wearable device The FV is a consumer wearable device manufactured by Fitbit Inc. It has a dedicated smartphone app which synchronises with the wearable using Bluetooth technology. Detailed information about the FV can be found on the Fitbit website (http://www.fitbit.com). In brief, the FV provides the user with information on activity (distance, step count), sleep (bedtime, waketime, sleep duration, sleep efficiency, and sleep
Fig. 1. Flowchart detailing the inclusion and exclusion of study measures for the portable EEG system and the wearable device (Fitbit Versa), respectively. 3
Journal of Psychosomatic Research 126 (2019) 109822
T. Svensson, et al.
staging) as well as heart rate. The FV has two modes of registering sleep information: “classic”, and “stages”. In accordance with information from Fitbit [24], “classic” mode provides simplified sleep patterns without information on stages, and occurs if: a) the device is worn too loose, b) the user manually initiates sleep mode, c) the user has slept for < 3 h, or 3) the battery is critically low. All participants of this study were given a demonstration as well as written instructions on how to wear the FV in accordance with the manual (i.e., non-dominant hand, “not too tight”, and “a finger's width below the wrist bone”) [25]. All sleep recordings were obtained using the “normal” setting. A unique Fitbit username and password was created for each participant prior to the start of the study to allow for synchronization of the device with the mobile app throughout the study period. Researchers were blinded to the allocation of usernames to study participants. Application programming interface (API) calls were used to pull participant information pertaining to the following variables relevant for the present study: TIB, TST, SE, time spent in light, deep, REM sleep, and wake, respectively. These variables were provided as summary measures for every sleep episode (n = 138) in addition to 30-second epochs for sleep episodes with information on sleep stages (n = 133). Fitbit does not provide information on all sleep measures. However, based on sleep data obtained through the API, we were able to extract SOL (time awake before first registered sleep), WASO (total time awake – time awake before first registered sleep – time awake after last registered sleep) and SPT (TST + WASO) for each sleep episode used in the final analysis. During the 14-day study period there were 350 sleep episodes collected using the FV (Fig. 1). Of these, 163 sleep episodes were excluded as they did not have corresponding measures with EEG recordings. A further 18 FV measures without a corresponding EEG measure were excluded as they represented naps and daytime sleep occurrences. Finally, 4 measures were excluded as they represented 2 split nights, i.e., where the participants woke up during the night for a longer period without turning off the portable EEG device. Consequently, there were 138 person-nights' data with information on sleep duration for validation of the FV. Of the 138 person-nights, 18 participants contributed with 7 nights each, and two participants contributed with 6 nights each. For validation of sleep stages, FV measures representing simplified sleep patterns (i.e., “classic” mode; n = 5) were excluded. Data for sleep stage validation was available for 133 person-nights where 16 persons contributed with 7 nights each, two persons with 6 nights each, one person with 5 nights and one person with 4 nights.
2.6. Statistical analyses Bland-Altman plots [27] were used to assess measurement agreement between equivalent measurements (i.e., TIB, TST, SE, extracted SOL, extracted SPT, extracted WASO, calculated SE (cSE = [TST/TIB] × 100), minutes spent in light sleep (EEG: N1 + N2), deep sleep (EEG: N3), and REM (EEG: R) obtained using the portable EEG and the FV. Results for all outcome measures were plotted in Bland-Altman plots with the difference (Y-axis: [EEG – FV]) against the mean (X-axis: [EEG + FV]/2). The bias, i.e., the lack of agreement between the respective measures, was calculated as the mean difference between the EEG and the FV. A negative bias indicates overestimation and a positive bias indicates underestimation by the FV compared to the portable EEG. Limits of agreement were calculated as the mean difference ± 1.96 standard deviations (SD). Paired t-tests were used to compare portable EEG and FV sleep measures. Cohen's kappa was calculated to determine interrater reliability for epoch accuracy. Additionally, as recommended by de Zambotti et al. [28], we calculated the prevalence-adjusted and biasadjusted kappa which takes into account the bias index and prevalence index, respectively [29]. All statistical analyses were performed using Stata/MP version 15.1 (StataCorp LLC, College Station, TX, USA). P values were two-tailed and considered significant if p < .05. 2.7. Role of the funding source The sponsors had no role in study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the paper for publication. 3. Results The FV did not significantly differ from the portable EEG for the measures TIB and TST (Table 2, Table 3, and Fig. 2). However, compared with the portable EEG, the FV significantly overestimated SE (bias: 7.4%), time spent in deep sleep (36.2 min) and wake (25.4 min), and significantly underestimated time spent in light sleep (20.2 min), and REM sleep (7.8 min) (Fig. 3). 3.1. Calculation of SE, SOL, SPT and WASO using available measures SOL, SPT and WASO were extracted using available FV data. Additionally, as the FV significantly overestimated SE compared to the portable EEG, we calculated an alternative sleep efficiency (cSE) using TST the following equation: cSE = TIB × 100 . cSE showed no significant bias compared with SE obtained from the portable EEG (Table 2, Table 3, and Fig. 4). On the other hand, compared with the portable EEG, the FV significantly underestimated SOL (13.6 min), while significantly overestimating both SPT (21.6 min) and WASO (14.2 min) (Table 2, Table 3, and Fig. 4).
2.5. Epoch-by-epoch comparison Sensitivity (proportion of EEG sleep epochs identified as sleep by FV), specificity (proportion of EEG wake epochs identified as wake by FV), accuracy (proportion of EEG epochs correctly identified (i.e., true sleep and true wake) by FV, predictive value for sleep [26] (PVS; the proportion of epochs identified by FV as sleep that are true sleep), and predictive value for wakefulness (PVW; the proportion of epochs identified by FV as wake that are true wake) were calculated using 30second epoch-by-epoch (EBE) comparison of wake/sleep states. EBE data was provided by both SleepWell and Fitbit. Epoch data provided by Fitbit was available only for sleep events with recorded sleep stages (i.e., for 133 person-nights). The EEG and FV epochs were coded as sleep if the respective epochs were scored as N1, N2, N3, or R (EEG), or Light, Deep, or REM sleep (Fitbit); epochs were coded as wake if the respective epochs were scored as W. The naturalistic design of this validation study precluded any possibility of participants' synchronising the single-channel EEG system and the wearable device at bedtime. Consequently, epoch data obtained through the EEG system was rounded to the nearest half or full minute mark. Overall, 76,539 epochs were used in the 30-second EBE comparisons.
3.2. Epoch-by-epoch comparisons The Fitbit Versa had a sensitivity of 92.1% (CI: 91.9%–92.3%), and a specificity of 54.1% (CI: 52.9%–55.2%). The PVS was 95.0% (CI: 94.9%–95.1%) and the PVW was 42.0% (CI: 41.2–42.8). Accuracy was 88.5% (CI: 88.2%–88.7%) with a Cohen's kappa of 0.41. The bias index was 0.03 whereas the prevalence index was −0.75. The prevalenceadjusted and bias-adjusted kappa was 0.77. 4. Discussion This study has for the first time demonstrated that there is good agreement between a consumer wearable device and a portable singlechannel EEG system for the sleep measures TIB and TST in naturalistic, free-living conditions over multiple nights. We have also found that four additional sleep measures (SOL, SPT, WASO, and cSE) could be 4
Journal of Psychosomatic Research 126 (2019) 109822
T. Svensson, et al.
Table 2 Sleep measures of the portable single-channel EEG system and the Fitbit Versa, and their comparison using the paired t-test. Sleep measure
TIB (min) TST (min) SE (%) cSEa (%) SOL (min) SPTb (min) WASOc (min) Time spent in Time spent in Time spent in Time spent in Time spent in Time spent in
Single-channel EEG
N1 (min) N2 (min) N1+N2d (min) N3d (min) REMd (min) Waked (min)
Fitbit Versa
Mean ± SD
95% CI
Mean ± SD
95% CI
373.0 ± 77.2 321.5 ± 75.2 86.2 ± 8.5 86.2 ± 8.5 17.0 ± 21.0 351.4 ± 77.8 29.9 ± 22.5 57.5 ± 22.5 169.5 ± 43.0 227.1 ± 49.3 19.3 ± 21.1 78.0 ± 28.5 29.4 ± 21.8
360.0–386.0 308.8–334.1 84.7–87.6 84.7–87.6 13.5–20.5 338.3–364.5 26.1–33.7 53.7–61.4 162.1–176.9 218.6–235.5 15.7–22.9 73.1–82.9 25.7–33.1
382.3 ± 90.8 328.9 ± 78.6 93.6 ± 3.3 86.1 ± 4.0 3.4 ± 4.9 373.1 ± 89.7 44.2 ± 20.9 – – 206.9 ± 52.3 55.5 ± 24.0 70.2 ± 28.1 54.8 ± 20.7
367.0–397.6 315.7–342.1 93.0–94.2 85.4–86.8 13.5–20.5 358.0–388.2 40.7–47.7 – – 197.9–215.8 51.4–59.6 65.4–75.0 51.2–58.3
t
p value
−1.61 −1.67 −11.17 0.13 7.61 −3.99 −6.01 – – 5.35 −14.6 3.23 −10.8
0.11 0.10 < 0.001 0.90 < 0.001 < 0.001 < 0.001 – – < 0.001 < 0.001 0.002 < 0.001
Abbreviations: cSE = Calculated sleep efficiency; CI=Confidence interval; REM = Rapid eye movement; SD = Standard deviation; SE = Sleep efficiency; SOL = Sleep onset latency; SPT = Sleep period time; TIB = Time in bed; TST = Total sleep time; WASO=Wake after sleep onset. a Sleep efficiency calculated manually for Fitbit Versa by using the formula cSE = TST/TIB. b Sleep period time extracted for Fitbit Versa using information from available data. c Wake after sleep onset extracted for Fitbit Versa using information from available data. d Available only for the 133 Fitbit Versa measures with information on sleep stage and their corresponding EEG measures.
derived from the available wearable device data, of which cSE had good agreement with the portable single-channel EEG. Moreover, the wearable device had high sensitivity, high accuracy, high PVS, and a prevalence- and bias-adjusted kappa indicating substantial agreement with the portable EEG system. On the other hand, the FV had low specificity and PVW, and did not show good agreement for time spent in specific sleep stages (i.e., wake, light sleep, deep sleep and REM sleep) or for extracted SOL, SPT and WASO. Objective measures of sleep duration are an integral part to elucidate the association between sleep metrics and disease and mortality outcomes, and validation of such data over multiple nights in home environments are key for longitudinal studies. However, previous validation studies have reported significant biases of activity trackers compared to both PSG and actigraphy for basic sleep measures [8,10–14,30,31], thereby questioning the use of consumer wearable devices in research settings. Our findings of good agreements between the FV and the singlechannel EEG for TIB, TST, and cSE in a home environment over multiple nights differ from the majority of validation studies to date. Our
results concur with five previous studies which have shown non-significant bias for either TIB [10], TST [9,15], or SE [9,15,32,33] in a wearable device albeit for only a single night. The majority of studies to date, however, have predominantly reported significant overestimation by various wearable devices compared to PSG for TST [8,10–14,30] and SE [8,10,11,14], and a significant underestimation of WASO [10–12,14] and TST [30]. Results have been similar when wearable devices have been compared to actigraphic data, with a significant overestimation of both TST [8,31] and SE [8]. Contrary to these results, the Fitbit Charge 2 significantly underestimated TST and SOL, and significantly overestimated WASO when compared to a portable EEG system in free-living conditions over a single night [33]. There are a couple of possible explanations for the discrepant results. First, the present study used a wearable device with automatic sleep tracking, whereas more than half of the previous studies used wearable devices relying on manual sleep tracking activation (e.g., Fitbit Flex, Fitbit one, Jawbone UP, or GENEactive). Manual activation could, as discussed by Mantua et al. [15], introduce bias in either direction with regards to data validity; less motivated participants may
Table 3 Agreement between equivalent portable single-channel EEG and Fitbit Versa sleep measures with the corresponding Bland-Altman plot biases, standard deviations, 95% confidence intervals of the biases, lower and upper agreement limits and the number of sleep measures exceeding the agreement limits. Sleep measure
Bias ± SD
Bias 95% CI
Lower limit of agreement
Upper limit of agreement
N measures exceeding the limits of agreement
TIB (min) TST (min) SE (%) cSEa (%) SOL (min) SPTb (min) WASOc (min) Time spent in Time spent in Time spent in Time spent in
−9.3 ± 67.7 −7.4 ± 51.9 −7.4 ± 7.8 0.08 ± 7.8 13.6 ± 21.0 −21.6 ± 63.7 −14.2 ± 27.8 20.2 ± 43.5 −36.2 ± 28.5 7.8 ± 27.9 −25.4 ± 27.1
−20.7–2.1 −16.1–1.3 −8.7 to −6.1 −1.2–1.4 10.1–17.2 −32.4 to −10.9 −18.9 to −9.6 12.7–27.7 −41.1 to −31.3 3.0–12.6 −30.0 to −20.7
−142.1 −109.1 −22.7 −15.2 −27.6 −146.5 −68.8 −65.1 −92.1 −46.9 −78.5
123.5 94.3 7.9 15.4 54.8 103.2 40.3 105.5 19.7 62.5 27.7
10 8 8 10 8 10 8 12 9 14 13
N1+N2d (min) N3d (min) REMd (min) Waked (min)
Abbreviations: cSE = Calculated sleep efficiency; CI=Confidence interval; REM = Rapid eye movement; SD = Standard deviation; SE = Sleep efficiency; SOL = Sleep onset latency; SPT = Sleep period time; TIB = Time in bed; TST = Total sleep time; WASO=Wake after sleep onset. a Sleep efficiency calculated manually for Fitbit Versa by using the formula cSE = TST/TIB. b Sleep period time extracted for Fitbit Versa using information from available data. c Wake after sleep onset extracted for Fitbit Versa using information from available data. d Available only for the 133 Fitbit Versa measures with information on sleep stage and their corresponding EEG measures. 5
Journal of Psychosomatic Research 126 (2019) 109822
T. Svensson, et al.
Fig. 2. Bland-Altman plots for time in bed (TIB), total sleep time (TST), and sleep efficiency (SE). The Y-axis represents the difference between the portable EEG and the Fitbit Versa (EEG – Fitbit Versa) whereas the X-axis represents the mean value of the two devices ([EEG + Fitbit Versa]/2). The solid line refers to the bias, and the dashed lines refer to the upper and lower limits of agreement calculated as the mean difference ± 1.96 standard deviations.
Fig. 3. Bland-Altman plots for time spent in light sleep, deep sleep, REM sleep, and wake. The Y-axis represents the difference between the portable EEG and the Fitbit Versa (EEG – Fitbit Versa) whereas the X-axis represents the mean value of the two devices ([EEG + Fitbit Versa]/2). The solid line refers to the bias, and the dashed lines refer to the upper and lower limits of agreement calculated as the mean difference ± 1.96 standard deviations. 6
Journal of Psychosomatic Research 126 (2019) 109822
T. Svensson, et al.
Fig. 4. Bland-Altman plots for calculated sleep efficiency (cSE), extracted sleep onset latency (SOL), extracted sleep period time (SPT), and extracted wake after sleep onset (WASO). The Y-axis represents the difference between the portable EEG and the Fitbit Versa (EEG – Fitbit Versa) whereas the X-axis represents the mean value of the two devices ([EEG + Fitbit Versa]/2). The solid line refers to the bias, and the dashed lines refer to the upper and lower limits of agreement calculated as the mean difference ± 1.96 standard deviations.
have reduced compliance whereas highly motivated participants will diligently record their sleep. Second, the technological advancement over the past couple of years has undoubtedly led to improved sensors and proprietary algorithms in consumer wearables. One example of this is that consumer wearables now, in addition to accelerometers, utilize cardiac sensors for sleep tracking purposes [34]. The most recent studies show a reduced bias between the wearable devices and PSG, which indicates an increasing agreement and precision for measuring global sleep parameters. This rapid development is welcome and encouraging, and is further emphasized by our own results for TST and TIB. Four global sleep measures (cSE, SOL, SPT, WASO) were in the present study derived from existing Fitbit data. SE, as provided by Fitbit through their API, showed a significant overestimation bias compared to the portable single-channel EEG, which is in agreement with the findings of previous studies [8,11,14]. Given the overestimation of Fitbit-scored SE, we calculated and subsequently found that the larger numerator is due to Fitbit using SPT, not TST, in their calculation of SE (data not shown). The US National Sleep Foundation defines SE as “The proportion of sleep in the episode potentially filled by sleep (i.e., the
ratio of TST to TIB)” [35]. Based on this definition, we calculated an alternative SE measure, cSE, which did not show any significant bias compared to the single-channel EEG. This further confirms the good agreement of the underlying measures TIB and TST. Contrary to cSE, the extracted SPT and WASO showed a significant overestimation bias, whereas SOL showed a significant underestimation bias for the wearable compared to the single-channel EEG. Our results for WASO and SOL are in agreement with a recent study comparing the Fitbit Charge 2 to the same single-channel EEG system used in the present study [33]. The FV's significant underestimation of SOL thus demonstrates the difficulty for wearables to automatically recognise the time from ‘lightsout’ to sleep onset. However, given the wearable's high sensitivity, a significant underestimation of SOL does not necessarily indicate the wearable's inability to recognise time of sleep onset. The difficulty thus lies in the wearable's inability to recognise the time of ‘lights-out’. At present, the only way to obtain more accurate estimates of SOL would be for users to supplement the information of ‘lights-out’ through manual input. This solution is, however, far from ideal as it defeats the purpose of seamless and automatic sleep tracking. 7
Journal of Psychosomatic Research 126 (2019) 109822
T. Svensson, et al.
When commencing the present validation study, a main outcome was to calculate sensitivity, specificity, and accuracy of the device. These measures represent the ability of the FV to detect sleep, wake, and the proportion of correctly identified sleep/wake epochs by the device. Our calculations show that the FV has high sensitivity (92.1%), accuracy (88.5%), and PVS (95.0%), but a lower specificity (54.1%) and PVW (42.0%). The results are comparable to the previous validation studies of Fitbit devices in adults with high sensitivity (0.96–0.98) [8,13,14,32] and high accuracy (0.87–0.88) [14,32], but low specificity (0.12–0.61) [8,13,14,32]. One major difference, however, is the study designs; whereas the previous studies have considered single-night PSG in controlled laboratory environments, we have conducted a naturalistic study in a home environment over multiple nights with over 76,500 compared epochs. The low specificity of the FV is reflected in the significant overestimation biases for SPT and WASO. Consequently, the low specificity leads to reduced agreement for individuals who have a high number of awakenings and thus a poorer sleep efficiency. This may be a limitation for using the device for clinical purposes. It should, however, be noted that the sensitivity, specificity and accuracy of the wearable device in the present study are comparable to wrist actigraphy [36]. Consequently, the wearable device would still be preferred, despite its low specificity, to subjective questionnaires in longitudinal epidemiologic studies investigating sleep duration. Compared to the number of studies investigating consumer wearable devices for overall sleep measures, there is a paucity of studies investigating the validity of wearable devices for the time spent in specific sleep stages. This could be due to the inherent difficulties of classifying sleep stages using a wrist-worn device where the likelihood of artifacts for key variables (i.e., movement and heart rate) used for sleep staging is high. Indeed, this difficulty is confirmed by our own results which demonstrate significant biases between the FV and the portable single-channel EEG for time spent in all sleep stages. To the best of our knowledge, only four research studies have investigated the sleep staging properties of wearable devices; One study [15] reported that two of the three investigated devices significantly overestimated deep sleep, and that all three devices significantly underestimated light sleep. When compared to PSG, the Fitbit Charge 2 significantly overestimated time spent in light sleep and significantly underestimated time spent in deep sleep but did not show any significant bias when estimating time spent in REM sleep [13]. Conversely, when compared to a portable EEG in free-living conditions, the Fitbit Charge 2 significantly underestimated time spent in light sleep and REM sleep, and significantly overestimated time spent in deep sleep [33]. The only positive results, albeit preliminary, are from an abstract of a sleep stage study comparing the Fitbit against a Type III home sleep testing device [34]. This study showed no statistically significant bias in the time spent in wake, light, deep, and REM stages. However, the lack of details in this abstract as to study duration, the number of nocturnal measures, or the Fitbit model used makes it difficult to draw any wider-ranging conclusions, as was also noted by de Zambotti et al. [13]. Irrespectively, the generalization of Fitbit wearables in the aforementioned abstract leads us to believe that the same sensors are used in all Fitbit models, and that the results of our validation study therefore potentially could be applicable to any one of Fitbit's wrist-worn devices. Future studies need to confirm if this is indeed the case. There are a few limitations to the present study. First, despite the large number of epochs and person-nights, the sample size of the present study (n = 20) is relatively small. This limits the generalizability of results to adults with similar activity patterns and to those without e.g., sleep disorders. Second, data collection from the portable home EEG system was reliant on the compliance of participants to turn the system on and off at the appropriate times. Poor compliance from study participants would be reflected in bias for the measure TIB. However, that no such bias was seen indicates good compliance from study participants. Third, the results of the present study cannot be generalized to
wearable devices other than the one we have investigated. Finally, Fitbit does not, similar to other companies, reveal their algorithms or provide inform about major software updates which affect the calculation of their sleep measures. Such updates may affect the definition of sleep measures and subsequently over time affect the results of validation studies. It would indeed be preferable if developers of consumer wearables would communicate any major updates to the scientific community to allow for independent validation of the resulting sleep measures. This study also has a number of strengths. First, it is the first study to compare a wearable device with a portable single-channel EEG across multiple nights in a home environment. Second, it is also, to the best of our knowledge, the largest EEG-based sleep validation study of a wearable device given that we used 76,539 epochs in the EBE comparison. Third, the 138 person-nights are approximately three times the number of person-nights used in the largest PSG validation study in adults to date. Finally, compared to the portable EEG system, the FV registered an additional 70 sleep events over the 14-day study period, mostly due to participants' daytime naps and other sleep occurrences when the EEG had not yet been turned on. This demonstrates a clear benefit of utilizing devices which automatically register sleep events compared to those which require manual activation. Wearable devices validated for automatic sleep detection can thus provide sleep data across a full 24-hour period and put night-time sleep duration in a wider lifestyle-related context. In conclusion, this study has shown that there is no significant bias between a wearable device and a portable single-channel EEG system for the sleep variables TIB, TST, and calculated SE. Moreover, the FV had high sensitivity, accuracy and PVS for detecting sleep. Although the Cohen's kappa indicated moderate agreement, the prevalence- and biasadjusted kappa indicated substantial agreement with the portable EEG system. The consumer sleep tracker could therefore be a useful tool for measuring sleep duration in longitudinal epidemiologic naturalistic studies with some limitations in specificity. Funding This research was supported by the Center of Innovation Program from Japan Science and Technology Agency, JST. Author contribution statement TS and AKS were responsible for the conception and design of the study. TS and AKS analysed and interpreted the data. UC, ST, and MN assisted with the interpretation of the data. TS and AKS drafted the manuscript. UC, ST, and MN critically revised the manuscript for important intellectual content. All authors approved the final version of the manuscript. Data availability statement We cannot publicly provide individual data due to participant privacy, according to ethical guidelines in Japan. Additionally, the written informed consent we obtained from study participants does not include a provision for publicly sharing data. Qualifying researchers may apply to access a minimal dataset by contacting the Principal Investigator for this study, Dr. Thomas Svensson, Precision Health, Department of Bioengineering, Graduate School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan, at
[email protected]. Declaration of Competing Interest The authors have no competing interests to report. 8
Journal of Psychosomatic Research 126 (2019) 109822
T. Svensson, et al.
Acknowledgements [16]
We thank all staff members at the Center of Innovation, the University of Tokyo for their extensive efforts and help to conduct the study. We would also like to thank all members of Precision Health, the University of Tokyo for their invaluable assistance, in particular Fumiya Tatsuki and Ei Hiruma for their help with protocol writing and data extraction, respectively. Thomas Svensson had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
[17] [18] [19] [20] [21] [22]
References
[23]
[1] F.P. Cappuccio, L. D’Elia, P. Strazzullo, M.A. Miller, Quantity and quality of sleep and incidence of type 2 diabetes: a systematic review and meta-analysis, Diabetes Care 33 (2) (2010) 414–420. [2] D. Wang, W. Li, X. Cui, Y. Meng, M. Zhou, L. Xiao, J. Ma, G. Yi, W. Chen, Sleep duration and risk of coronary heart disease: a systematic review and meta-analysis of prospective cohort studies, Int. J. Cardiol. 219 (2016) 231–239. [3] W. Li, D. Wang, S. Cao, X. Yin, Y. Gong, Y. Gan, Y. Zhou, Z. Lu, Sleep duration and risk of stroke events and stroke mortality: a systematic review and meta-analysis of prospective cohort studies, Int. J. Cardiol. 223 (2016) 870–876. [4] F.P. Cappuccio, L. D’Elia, P. Strazzullo, M.A. Miller, Sleep duration and all-cause mortality: a systematic review and meta-analysis of prospective studies, Sleep 33 (5) (2010) 585–592. [5] J. Yin, X. Jin, Z. Shan, S. Li, H. Huang, P. Li, X. Peng, Z. Peng, K. Yu, W. Bao, W. Yang, X. Chen, L. Liu, Relationship of sleep duration with all-cause mortality and cardiovascular events: a systematic review and dose-response meta-analysis of prospective cohort studies, J. Am. Heart Assoc. 6 (9) (2017). [6] Q.Q. Ma, Q. Yao, L. Lin, G.C. Chen, J.B. Yu, Sleep duration and total cancer mortality: a meta-analysis of prospective studies, Sleep Med. 27–28 (2016) 39–44. [7] D.S. Lauderdale, K.L. Knutson, L.L. Yan, K. Liu, P.J. Rathouz, Self-reported and measured sleep duration: how similar are they? Epidemiology 19 (6) (2008) 838–845. [8] H.E. Montgomery-Downs, S.P. Insana, J.A. Bond, Movement toward a novel activity monitoring device, Sleep Breath. 16 (3) (2012) 913–917. [9] E. Toon, M.J. Davey, S.L. Hollis, G.M. Nixon, R.S. Horne, S.N. Biggs, Comparison of commercial wrist-based and smartphone accelerometers, actigraphy, and PSG in a clinical cohort of children and adolescents, J. Clin. Sleep Med. 12 (3) (2016) 343–350. [10] M. de Zambotti, F.C. Baker, I.M. Colrain, Validation of sleep-tracking technology compared with polysomnography in adolescents, Sleep 38 (9) (2015) 1461–1468. [11] M. de Zambotti, F.C. Baker, A.R. Willoughby, J.G. Godino, D. Wing, K. Patrick, I.M. Colrain, Measures of sleep and cardiac functioning during sleep using a multisensory commercially-available wristband in adolescents, Physiol. Behav. 158 (2016) 143–149. [12] M. de Zambotti, S. Claudatos, S. Inkelis, I.M. Colrain, F.C. Baker, Evaluation of a consumer fitness-tracking device to assess sleep in adults, Chronobiol. Int. 32 (7) (2015) 1024–1028. [13] M. de Zambotti, A. Goldstone, S. Claudatos, I.M. Colrain, F.C. Baker, A validation study of Fitbit Charge 2 compared with polysomnography in adults, Chronobiol. Int. 35 (4) (2018) 465–476. [14] J.D. Cook, M.L. Prairie, D.T. Plante, Utility of the Fitbit Flex to evaluate sleep in major depressive disorder: a comparison against polysomnography and wrist-worn actigraphy, J. Affect. Disord. 217 (2017) 299–305. [15] J. Mantua, N. Gravel, R.M. Spencer, Reliability of sleep measures from four personal
[24] [25] [26] [27] [28] [29] [30] [31]
[32] [33] [34]
[35] [36]
9
health monitoring devices compared to research-based Actigraphy and polysomnography, Sensors (Basel) 16 (5) (2016). D.B. Kirsch, Sleep Medicine in Neurology, John Wiley & Sons Inc, Chichester, West Sussex, 2014. K.E. Bloch, Polysomnography: a systematic review, Technol. Health Care 5 (4) (1997) 285–305. M. Toussaint, R. Luthringer, N. Schaltenbrand, G. Carelli, E. Lainey, A. Jacqmin, A. Muzet, J.P. Macher, First-night effect in normal subjects and psychiatric inpatients, Sleep 18 (6) (1995) 463–469. A. Sadeh, The role and validity of actigraphy in sleep medicine: an update, Sleep Med. Rev. 15 (4) (2011) 259–267. L. de Souza, A.A. Benedito-Silva, M.L. Pires, D. Poyares, S. Tufik, H.M. Calil, Further validation of actigraphy for sleep studies, Sleep 26 (1) (2003) 81–85. J.M. Bland, How Can I Decide the Sample Size for a Repeatability Study? http:// www-users.york.ac.uk/~mb55/meas/sizerep.htm, (2010). M. Yoshida, K. Kashiwagi, H. Kadotani, K. Yamamoto, S. Koike, M. Matsuo, N. Yamada, M. Okawa, Y. Urade, Validation of a portable single-channel EEG monitoring system, Journal of Oral and Sleep Medicine 1 (2) (2015) 140–147. C. Iber, S. Ancoli-Israel, A. Chesson, And Quan SF for the American Academy of Sleep Medicine, the AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications, 1st Ed. American Academy of Sleep Medicine, Westchester, IL, 2007. Fitbit Inc, Why don't I See Sleep Stages Today? url, 2018. https://help.fitbit.com/ articles/en_US/Help_article/2163/?q=sleep+stages&l=en_US&fs=Search&pn= 1#ClassicSleep. Fitbit Inc, Fitbit Versa User Manual Version 1.5, https://staticcs.fitbit.com/content/ assets/help/manuals/manual_versa_en_US.pdf. C.P. Pollak, W.W. Tryon, H. Nagaraja, R. Dzwonczyk, How accurately does wrist actigraphy identify the states of sleep and wakefulness? Sleep 24 (8) (2001) 957–965. J.M. Bland, D.G. Altman, Statistical methods for assessing agreement between two methods of clinical measurement, Lancet 1 (8476) (1986) 307–310. M. de Zambotti, N. Cellini, A. Goldstone, I.M. Colrain, F.C. Baker, Wearable sleep technology in clinical and research settings, Med. Sci. Sports Exerc. 51 (7) (2019) 1538–1557. T. Byrt, J. Bishop, J.B. Carlin, Bias, prevalence and kappa, J. Clin. Epidemiol. 46 (5) (1993) 423–429. M.E. Rosenberger, M.P. Buman, W.L. Haskell, M.V. McConnell, L.L. Carstensen, Twenty-four hours of sleep, sedentary behavior, and physical activity with nine wearable devices, Med. Sci. Sports Exerc. 48 (3) (2016) 457–465. H.A. Lee, H.J. Lee, J.H. Moon, T. Lee, M.G. Kim, H. In, C.H. Cho, L. Kim, Comparison of wearable activity tracker with Actigraphy for sleep evaluation and circadian rest-activity rhythm measurement in healthy young adults, Psychiatry Investig. 14 (2) (2017) 179–185. S.G. Kang, J.M. Kang, K.P. Ko, S.C. Park, S. Mariani, J. Weng, Validity of a commercial wearable sleep tracker in adult insomnia disorder patients and good sleepers, J. Psychosom. Res. 97 (2017) 38–44. Z. Liang, M.A. Chapa Martell, Validity of consumer activity wristbands and wearable EEG for measuring overall sleep parameters and sleep structure in free-living conditions, Journal of Healthcare Informatics Research 2 (1) (2018) 152–178. Z. Beattie, Y. Oyang, A. Statan, A. Ghoreyshi, A. Pantelopoulos, A. Russell, C. Heneghan, Estimation of sleep stages in a healthy adult population from optical plethysmography and accelerometer signals, Physiol. Meas. 38 (11) (2017) 1968–1979. National Sleep Foundation, Sleeptionary - Definitions of Common Sleep Terms, https://www.sleepfoundation.org/sleeptionary. M. Marino, Y. Li, M.N. Rueschman, J.W. Winkelman, J.M. Ellenbogen, J.M. Solet, H. Dulin, L.F. Berkman, O.M. Buxton, Measuring sleep: accuracy, sensitivity, and specificity of wrist actigraphy compared to polysomnography, Sleep 36 (11) (2013) 1747–1755.