Bayesian decision support for coding occupational injury data

Bayesian decision support for coding occupational injury data

Journal of Safety Research 57 (2016) 71–82 Contents lists available at ScienceDirect Journal of Safety Research journal homepage: www.elsevier.com/l...

1MB Sizes 4 Downloads 24 Views

Journal of Safety Research 57 (2016) 71–82

Contents lists available at ScienceDirect

Journal of Safety Research journal homepage: www.elsevier.com/locate/jsr

Bayesian decision support for coding occupational injury data Gaurav Nanda, a Kathleen M. Grattan, b MyDzung T. Chu, b Letitia K. Davis, b Mark R. Lehto a,⁎,1 a b

School of Industrial Engineering, Purdue University, 315 N. Grant Street, West Lafayette, IN 47907-2023, USA Massachusetts Department of Public Health, 250 Washington Street, 4th Floor, Boston, MA 02108, USA

a r t i c l e

i n f o

Article history: Received 1 July 2015 Received in revised form 10 December 2015 Accepted 2 March 2016 Available online 15 March 2016 Keywords: Bayesian models Narrative analysis Occupational injury Text classification Decision support system

a b s t r a c t Introduction: Studies on autocoding injury data have found that machine learning algorithms perform well for categories that occur frequently but often struggle with rare categories. Therefore, manual coding, although resource-intensive, cannot be eliminated. We propose a Bayesian decision support system to autocode a large portion of the data, filter cases for manual review, and assist human coders by presenting them top k prediction choices and a confusion matrix of predictions from Bayesian models. Method: We studied the prediction performance of Single-Word (SW) and Two-Word-Sequence (TW) Naïve Bayes models on a sample of data from the 2011 Survey of Occupational Injury and Illness (SOII). We used the agreement in prediction results of SW and TW models, and various prediction strength thresholds for autocoding and filtering cases for manual review. We also studied the sensitivity of the top k predictions of the SW model, TW model, and SW–TW combination, and then compared the accuracy of the manually assigned codes to SOII data with that of the proposed system. Results: The accuracy of the proposed system, assuming well-trained coders reviewing a subset of only 26% of cases flagged for review, was estimated to be comparable (86.5%) to the accuracy of the original coding of the data set (range: 73%–86.8%). Overall, the TW model had higher sensitivity than the SW model, and the accuracy of the prediction results increased when the two models agreed, and for higher prediction strength thresholds. The sensitivity of the top five predictions was 93%. Conclusions: The proposed system seems promising for coding injury data as it offers comparable accuracy and less manual coding. Practical Applications: Accurate and timely coded occupational injury data is useful for surveillance as well as prevention activities that aim to make workplaces safer. © 2016 National Safety Council and Elsevier Ltd. All rights reserved.

1. Introduction Occupational safety and health research and surveillance are essential for the prevention and control of injuries, illnesses, and hazards that occur in the workplace. A systematic analysis of workplace injuries can provide insights about where hazards exist and what interventions might be effective at preventing workplace injuries, illnesses, and fatalities. The Survey of Occupational Injury and Illness (SOII) is the largest occupational injury survey in the United States conducted by the Bureau of Labor Statistics (BLS), which includes nearly all private sector industries, state and local governments, and provides detailed information on workplace injuries and illnesses (Occupational Safety and Health Statistics Program, 2014; U.S. Department of Labor, 2005). SOII data are coded using the Occupational Injuries and Illnesses Classification System (OIICS) coding scheme, which offers a simple, yet detailed,

⁎ Corresponding author at: Industrial Engineering Discovery-to-Delivery Center, School of Industrial Engineering, Purdue University, West Lafayette, IN, USA. E-mail addresses: [email protected] (G. Nanda), [email protected] (K.M. Grattan), [email protected] (M.R. Lehto). 1 Present/Permanent Address: School of Industrial Engineering, Purdue University, 315 N. Grant Street, West Lafayette, IN 47907-2023, USA.

http://dx.doi.org/10.1016/j.jsr.2016.03.001 0022-4375/© 2016 National Safety Council and Elsevier Ltd. All rights reserved.

hierarchical structure for coding of injury data (Bondy, Lipscomb, Guarini, & Glazner, 2005; Northwood, Sygnatur, & Windau, 2012; U.S. Department of Labor, 2012). SOII 2011 was the first year when the OIICS updated version 2.01 was used for coding the data. The process of coding occupational injury data is currently performed manually by human coders: a time- and resource-consuming task. A promising alternative to manual coding is offered by Bayesian machine learning algorithms such as Naïve Bayes and Fuzzy Bayes, which can learn from a training set of narratives previously coded by experts, and predict the cause-of-injury codes based on injury narrative with a probability that reflects the confidence of prediction (Lehto, Marucci-Wellman, & Corns, 2009; Noorinaeini & Lehto, 2006; Taylor, Lacovara, Smith, Pandian, & Lehto, 2014; Wellman, Lehto, Sorock, & Smith, 2004). Other machine learning algorithms such as support vector machine (SVM) and logistic regression have also been found to yield good classification performance for injury classification (Bertke et al., 2016; Chen, Vallmuur, & Nayak, 2015; Measure, 2014). Most machine learning algorithms perform well for binary text classification tasks but performance declines dramatically as the number of predicted classes increases, partly because there are very few cases for rare categories (Rizzo, Montesi, Fabbri, & Marchesini, 2015; Vallmuur, 2015). As shown by Rizzo et al., the macro-averaged F1 score (a commonly

72

G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82

used measure of accuracy for classification tasks) of SVM model declines from about 0.88 to 0.75 as the number of classes increase from 10 to 45. Classification of SOII data is a challenging task for any machine learning method because there are about 45 two-digit OIICS event or exposure codes (which are the class labels for this study, and are also referred to as “categories” in the paper), and the distribution of data is heavily skewed with most injury cases falling under a small number of categories. The small number of training cases for rare categories makes it difficult for the machine learning model to learn and make predictions with good accuracy. Moreover, other factors such as misspellings, abbreviations, and synonyms in the narratives adversely affect prediction performance. Hence, although using machine learning algorithms to predict event codes is an efficient method for coding injury data with good accuracy, human review cannot be eliminated (Wellman et al., 2004). Semi-automated methods have been suggested as an alternative strategy to: (a) reduce the amount of manual coding required without sacrificing accuracy, and (b) allow expert coders to focus on complicated narratives, thus resulting in a more efficient utilization of their time and resources (Corns, Marucci, & Lehto, 2007). Agreement in prediction results from Fuzzy Bayes and Naïve Bayes models has been used as a strategy for autocoding the agreement cases and filtering the disagreement cases for manual review (Marucci-Wellman, Lehto, & Corns, 2011). In addition to using agreement between different models, prediction performance thresholds have also been explored in a recent study to filter cases that require manual review (Marucci-Wellman, Lehto, & Corns, 2015). The present study proposes a Bayesian decision support system that autocodes a large portion of cases with high accuracy, filters cases for manual review, and assists human coders by providing them the top five choices for possible event codes. Such a semi-automated top k approach has been found to be helpful for human coders in similar text classification tasks with a large number of categories, such as assigning International Classification of Diseases (ICD) codes to short medical text (Rizzo et al., 2015). Similar decision support systems developed using Bayesian models have been found to be very effective for predicting the print defect category based on a customer's narrative of issue, with the actual defect being present in top five predictions 95% of the time (Leman & Lehto, 2003). Therefore, it seems reasonable to use Bayesian models for developing a decision support system that can help human coders in coding occupational injury data. The detailed methodology used in the study is described in the next section. 2. Methods As mentioned earlier, many machine learning methods have been proposed. Among these methods, the simple Naïve Bayes model is well-suited for classification of short textual data, especially when the classes have fewer training cases (Marucci-Wellman et al., 2015; Wang & Manning, 2012). SOII data has a lot of categories and many categories have very few training cases, hence the use of Naïve Bayes model seems appropriate. Previous studies on autocoding injury narratives have used Naïve and Fuzzy Bayes models with single words as well as multiple-word sequences and combinations (Wellman et al., 2004; Zhu & Lehto, 1999). Multiple-word combinations and sequences as predictors are likely to have better prediction accuracy than any single word for certain categories. For example, the two-word sequence ‘felloff’ is a more accurate predictor of the category ‘fall from elevation’ as compared to separate words ‘fell’ and ‘off.’ Use of multiple-word combinations and sequences in Naïve Bayes and Fuzzy Bayes classifiers have been found to yield better results than single words for autocoding injury narratives (Corns et al., 2007; Noorinaeini & Lehto, 2006). The Fuzzy Bayes model particularly works better when two or more word combinations are used (Noorinaeini & Lehto, 2006). However, using multipleword combinations and sequences becomes computationally intensive for large datasets such as SOII because the number of two or more

word combinations and sequences becomes very large. Hence, we have used two Naïve Bayes models for building the proposed decision support system: (a) Single-Word (SW), and (b) Two-Word Sequence (TW), which are discussed below. We used the Textminer (TM) software for implementing these models, which has also been used in previous studies related to autocoding injury data (Corns et al., 2007). 2.1. Naive Bayes model Each injury narrative can be considered as a vector of j words present in that narrative, that is, n = {n1 , n2 , ..nj }. Assuming there can be i possible event codes that can be assigned, the set of event codes can be represented as vector E = { E1, E2, … . Ei}. The Naïve Bayes model assumes conditional independence of words in a narrative given the event code, which implies that the probability of a word being present in the narrative depends on only the event code considered and is independent of the remaining terms in the narrative. Using the conditional independence assumption, the probability of assigning a particular event code Ei to the narrative n can be calculated using the expression: P ðEi jnÞ ¼ ∏ j

 P n j jEi P ðEi Þ  P nj

where P(Ei | n)is the probability of event code category Ei given the set of n words in the narrative, P(nj | Ei) is the probability of word nj given category Ei, P(Ei) is the prior probability of category i, and P(nj) is the probability of word nj in the entire keyword list. In application, P(nj | Ei), P(Ei) and P(nj) are all normally estimated on the basis of their frequency of occurrence in the training set. The probability of an event category is calculated by multiplying the likelihood ratios for each word in the narrative and prior probabilities. Even though the conditional independence assumption of the Naive Bayes model is violated in practice, it yields high accuracy for text classification tasks (Wellman et al., 2004). P(nj | Ei) in this study was estimated by using word counts and category frequencies in the training set as shown below:   count n j jEi þ ∝  count n j P n j jEi ¼ countðEi Þ þ ∝  N 

where count(nj | Ei) = number of times word nj occurs in category Ei, count(nj) = number of times word nj occurs, count(Ei) = number of times category Ei occurs, N = number of training narratives, and ∝ = smoothing constant. The smoothing constant reduces the weight given to the evidence provided by each term and also avoids P(nj | Ei) being improperly set to zero in case a word is not present in the training examples of a particular category (Lehto et al., 2009; Taylor et al., 2014; Wellman et al., 2004). We chose the value of ∝ as 0.05, which corresponds to a small level of smoothing. 2.2. Agreement of two models and prediction thresholds Considering the agreement of prediction results of more than one Bayesian model has been shown to be a good strategy to improve classification accuracy of occupational injury data (Marucci-Wellman et al., 2011, 2015). For example, it was found that the classification accuracy improves for the task of assigning ICD codes based on inpatient discharge summaries by combining prediction output from three different text classification models, namely, K-nearest neighbors, relevance feedback, and Bayesian independence (Larkey & Croft, 1996). The present work considers the agreement in prediction results from SW and TW models and analyzes the prediction performance when the models agree. Furthermore, we examined the effect on prediction performance of applying a minimum threshold to the prediction probability output by the Naïve Bayes model.

G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82

2.3. Data preprocessing For this study, we used a random sample of 50,000 lost workday cases from the 2011 SOII data for the U.S., provided by BLS, and coded according to OIICS version 2.01. The injury narrative field was minimally cleaned and no effort was made to correct grammatical errors and misspellings. Out of the 50,000 cases, 40,000 cases were randomly selected to be used as the training set for the Naïve Bayes model, of which three cases with missing event codes were excluded. The remaining 10,000 cases constituted the prediction set. The distributions of categories in training and prediction sets were similar. 2.4. Performance measures We used sensitivity, positive predictive value (PPV), and F1 score as the measures to evaluate the prediction performance of different approaches used in this study. These measures have been widely used in previous studies to examine the performance of autocoding injury data (Lehto et al., 2009; Marucci-Wellman et al., 2011). Sensitivity, also known as “recall,” is defined as Sensitivity ¼

TP TP þ FN

PPV, also known as “precision,” is defined as PPV ¼

TP TP þ FP

where TP = number of true positives, FN = number of false negatives, and FP = number of false positives. The F1 score considers both sensitivity and PPV, and is defined as F1 ¼ 2 

Sensitivity  PPV Sensitivity þ PPV

The BLS-assigned codes were considered as the gold standard (true class), and the predicted codes were compared with the BLS-assigned codes to calculate the sensitivity and PPV values for each category.

73

different values of k (1, 2, 3, 4, and 5) were calculated by examining if the BLS-assigned event codes were present in the list of top k prediction choices. We propose that providing such a list of the top k prediction choices could help human coders in manually assigning event codes to particular categories. 2.7. Evaluating BLS-assigned and Textminer-predicted codes During model development, we regarded the BLS-assigned event codes as the gold standard. However, it is well-known that the codes assigned by human coders to accident narratives vary between coders. The inter-rater agreement often varies and the agreement between coders is sometimes relatively lower for some categories (MarucciWellman et al., 2015). In order to study the consistency in manual coding, a sample of 871 cases was taken from the prediction dataset and coded independently by two coders at Occupational Health Surveillance Program (OHSP) at the Massachusetts Department of Public Health. This sample consisted of two groups: (a) the discordant group (754 cases), where the predictions from SW and TW models agreed but did not match up with BLS assigned codes, and (b) the concordant group (117 cases), where the event codes predicted by SW and TW models agreed with BLS-assigned codes. For both the concordant and discordant groups, the event codes predicted by SW and TW models agreed, and are referred to as TM-predicted (Textminer predicted) codes henceforth. These two independent OHSP coders were blinded to the BLSassigned codes, TM-predicted codes, or the code assigned by the other OHSP coder. The coders were provided all the information from SOII data needed to assign the event code. They assigned 4-digit event codes to each case in the sample independently. Once the coding process was complete, the cases where both coders disagreed at the 4-digit level were identified, and both coders discussed these cases to resolve codes. Cases for which no agreement could be reached were referred to the Boston Regional Office of BLS, where three experienced staffers reviewed the cases and assigned event codes to them. These codes assigned by OHSP coders were considered as the gold standard for this sample of 871 cases. The BLS-assigned codes and TMpredicted codes were compared with the codes assigned by OHSP coders to calculate the sensitivity and PPV for discordant and concordant groups separately.

2.5. Confusion matrix 3. Results and discussion In addition to sensitivity and PPV, classification results are evaluated using a two-dimensional confusion matrix, which is often used for displaying multiclass prediction results (Witten & Frank, 2005). A confusion matrix consists of a row and a column for each class. Each cell in the matrix represents the number of cases in the prediction set for which the true class is the row label and predicted class is the column label. In terms of representation, good prediction results would mean large numbers as diagonal elements and very few small numbers as off-diagonal elements. Confusion matrices can reveal trends and patterns in the classification results and have also been used to create visualization tools for machine learning (Talbot, Lee, Kapoor, & Tan, 2009). Given the structure of the confusion matrix, we propose that it could be used to assist the human coders in identifying the categories that often get misclassified by the machine learning algorithm and might require further attention. 2.6. Top k predictions The Naïve Bayes model outputs the prediction probability of each class for each case in the prediction dataset. We ranked the classes based on prediction probabilities to obtain a list of the top k prediction choices outputted by the SW model, TW model, and SW–TW combination. For SW–TW combination, the prediction results from SW and TW models were combined together and then ranked based on prediction probabilities to get top k prediction choices. The sensitivities for

In this section, the individual prediction performance of the SW and TW models are first discussed. We then consider prediction performance when TW and SW models agree, before addressing the use of prediction probability thresholds to improve accuracy. The sensitivity of the top k prediction choices from the SW model, the TW model, and the SW–TW combined model is then presented for k = (1, 2, 3, 4, 5). Finally, we estimate the accuracy of the BLS-assigned codes. Further analysis indicated that using the proposed Bayesian decision support system is likely to result in comparable accuracy as the BLS-assigned codes while requiring less than half of the cases to be manually coded. 3.1. SW and TW model predictions The overall sensitivity of the prediction set for the SW model was 66%, and for the TW model it was 69%. The sensitivity, PPV, and F1 score of SW and TW models on the prediction sets for each event code category are presented in Table 1. As shown in Table 1, each measure of accuracy varies between categories; there seems to be large variation in measures of accuracy for smaller categories (with fewer cases) while there is some tendency for larger categories (with more number of cases) to report better accuracy. This is also illustrated in Fig. 1(a) and (b), where the F1 score of each category is plotted against the frequency (number of cases) of the category in Fig. 1(a) for the SW model and in Fig. 1(b) for the TW model.

74

G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82

Table 1 Category-wise sensitivity, PPV, and F1 score for SW and TW models (Column ‘N’ represents the number of cases in the prediction set for that event code. Columns ‘SW Sen’ and ‘TW Sen’ represent SW and TW model sensitivities, respectively. Columns ‘SW PPV’ and ‘TW PPV’ represent SW and TW model PPVs, respectively. Columns ‘SW F1’ and ‘TW F1’ represent SW and TW model F1 scores, respectively). Event code

Event code description

N

SW Sen

SW PPV

SW F1

TW Sen

TW PPV

TW F1

10 11 12 13 20 21 22 23 24 25 26 27 29 31 32 40 41 42 43 44 45 49 50 51 52 53 54 55 56 57 59 60 61 62 63 64 65 66 67 69 70 71 72 73 74 78 79 99

Violence and other injuries by persons or animals, unspecified Intentional injury by person Injury by person—unintentional or intent unknown Animal- and insect-related incidents Transportation incident, unspecified Aircraft incidents Rail vehicle incidents Animal and other non-motorized vehicle transportation incidents Pedestrian vehicle incident Water vehicle incidents Roadway incidents involving motorized land vehicle Non-roadway incidents involving motorized land vehicles Transportation incident, n.e.c. Fires Explosions Fall, slip, trip, unspecified Slip or trip without fall Falls on same level Falls to lower level Jumps to lower level Fall or jump curtailed by personal fall arrest system Fall, slip, trip, n.e.c. Exposure to harmful substances or environments, unspecified Exposure to electricity Exposure to radiation and noise Exposure to temperature extremes Exposure to air and water pressure change Exposure to other harmful substances Exposure to oxygen deficiency, n.e.c. Exposure to traumatic or stressful event, n.e.c. Exposure to harmful substances or environments, n.e.c. substances Contact with objects and equipment, unspecified Needlestick without exposure to harmful substance Struck by object or equipment, unspecified Struck against object or equipment Caught in or compressed by equipment or objects Struck, caught, or crushed in collapsing structure, equipment, or material Rubbed or abraded by friction or pressure Rubbed, abraded, or jarred by vibration Contact with objects and equipment, n.e.c. Overexertion and bodily reaction Overexertion involving outside sources Repetitive motions involving microtasks Other exertions or bodily reactions Bodily conditions, n.e.c. Multiple types of overexertion and bodily reactions Overexertion and bodily reaction and exertion, n.e.c. Non-classifiable

2 214 237 64 2 1 0 8 54 3 287 68 0 7 4 41 404 1701 392 16 2 1 8 17 9 158 0 182 1 21 4 36 6 1171 494 369 1 59 8 8. 88 2620 365 745 22 10 18 72

0.00 0.74 0.46 0.86 0.00 0.00 0.00 0.13 0.28 0.00 0.96 0.28 0.00 0.00 0.25 0.00 0.18 0.78 0.73 0.06 0.00 0.00 0.00 0.82 0.22 0.95 0.00 0.91 0.00 0.57 0.00 0.03 0.33 0.65 0.28 0.78 0.00 0.51 0.00 0.00 0.01 0.80 0.83 0.45 0.09 0.00 0.00 0.08

— 0.54 0.37 0.82 — — — 1.00 0.48 0.00 0.70 0.17 — — 1.00 0.00 0.43 0.71 0.39 0.17 — — — 1.00 0.50 0.76 — 0.70 — 0.48 — 1.00 0.67 0.67 0.53 0.45 — 0.51 — — 0.06 0.87 0.67 0.63 0.40 0.00 0.00 0.20

0.00 0.62 0.41 0.84 0.00 0.00 0.00 0.23 0.35 0.00 0.81 0.21 0.00 0.00 0.40 0.00 0.25 0.74 0.51 0.09 0.00 0.00 0.00 0.90 0.31 0.84 0.00 0.79 0.00 0.52 0.00 0.06 0.44 0.66 0.37 0.57 0.00 0.51 0.00 0.00 0.02 0.83 0.74 0.53 0.15 0.00 0.00 0.11

0.00 0.64 0.38 0.70 0.00 0.00 0.00 0.00 0.50 0.00 0.94 0.26 0.00 0.00 0.25 0.00 0.31 0.82 0.74 0.25 0.00 0.00 0.00 0.59 0.44 0.84 0.00 0.71 0.00 0.38 0.00 0.00 0.33 0.68 0.39 0.72 0.00 0.68 0.00 0.00 0.01 0.84 0.82 0.52 0.23 0.00 0.00 0.19

— 0.55 0.36 0.83 — — — — 0.42 — 0.79 0.22 — 0.00 1.00 0.00 0.45 0.73 0.45 0.29 — — — 0.83 0.80 0.78 — 0.77 — 0.73 0.00 — 1.00 0.74 0.60 0.54 — 0.44 — — 0.05 0.84 0.69 0.66 0.33 0.00 0.00 0.28

0.00 0.59 0.37 0.76 0.00 0.00 0.00 0.00 0.46 0.00 0.86 0.24 0.00 0.00 0.40 0.00 0.37 0.77 0.56 0.27 0.00 0.00 0.00 0.69 0.57 0.81 0.00 0.74 0.00 0.50 0.00 0.00 0.50 0.71 0.47 0.62 0.00 0.53 0.00 0.00 0.02 0.84 0.75 0.58 0.27 0.00 0.00 0.23

Please note that as shown in Table 1, the categories vary greatly in frequency (number of cases), ranging from 1 to 2620. It is shown in Fig. 1(a) and (b) that the high-frequency categories have relatively higher F1 scores, and most of the low-frequency categories have inferior F1 scores except few categories such as 13 (animal related), 51 (electricity), and 53 (temperature extremes), which have high F1 scores. These categories (13, 51, and 53) are very unique in their definition and might have very specific words in the narrative that are strong predictors, thereby resulting in high F1 scores even with few cases. The sensitivity and PPV along with the percentage of number of cases in the prediction dataset for the 10 highest frequency categories are presented graphically in Fig. 2 for SW model, and in Fig. 3 for TW model. As shown in Figs. 2 and 3, both SW and TW models performed particularly well for the following categories: 26 (roadway MVA), 42 (fall same level), 71 (overexertion—outside sources), and 72 (repetitive motion) with relatively good sensitivity and PPV values. The TW model performed better than the SW model for the following categories: 41, 42, 43, 62, 63, and 73, with both measures of sensitivity and PPV being higher for the TW model. For categories 26 and 64, the TW model had lower sensitivity but higher PPV as compared to the SW model.

For some of the categories such as 43 (fall lower level) and 64 (caught in object), the sensitivity is relatively high but PPV is low for both SW and TW, which indicates a higher number of false positive cases. This might be happening because these categories are closely related to some other more frequent categories. For example, categories 43 (fall lower level) and 41 (slip trip without fall) both have about 4% of cases, and are closely related to category 42 (fall same level), which is a relatively large category with 12% cases. Because of the considerably higher number of cases of category 42 as compared to categories 41 and 43, the latter often get misclassified as 42 because they have almost the same words in the narrative. Of 404 cases of category 41 in the prediction set, misclassification as category 42 occurs in 210 instances for the SW model and in 192 instances by the TW model. This effect is also illustrated in Fig. 4, where the percentage of predictions of true category, related categories, and non-related categories are presented for the SW and TW models for all high-frequency categories except 26 (since 26 is not closely related to any other high-frequency category). Related categories are those where the cause of injury is similar, such as, falls (41, 42, and 43), struck (61, 62, and 63), and overexertion (71, 72, and 73).

G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82

a) Plot of Category F1-score vs Number-ofcases in Category for Predictions of SW model

b) Plot of Category F1-score vs Number-ofcases in Category for Predictions of TW model

75

explain to some extent why the TW model performed better. For example, the TW model's consideration of word sequences such as ‘struck against’ or ‘bumped against’ as one term in training set narratives precludes them being considered as separate single words ‘struck’ and ‘against.’ This is important because the word sequence ‘struck against’ or ‘bumped against’ are very strong predictors of category 63, thus helping the TW model to distinguish category 63 (with 5% cases) from being misclassified as the bigger related category 62 (with 12% cases). Out of 494 cases of category 63 in the prediction set, it is misclassified as category 62 in 139 cases by the SW model and in 91 cases by the TW model. The high percentage of prediction of the true category and closely related categories by the Naïve Bayes models demonstrates that the models are not making any unreasonable mistakes. A challenging aspect is to distinguish two closely related categories, which can be handled by the model if there are enough training cases available for all closely related categories so that the model can learn the fine differences between them. However, there are many closely related categories that do not have enough training cases for the model to learn which reinforces the point that human review and manual coding is particularly important for smaller categories.

3.2. Agreement in prediction results of SW and TW models: semi-automated combined strategies

Fig. 1. (a) Plot of category F1 score vs number of cases in category for predictions of SW model. (b) Plot of category F1 score vs number of cases in category for predictions of TW model.

Categories 64 (caught in object), 62 (struck by), and 63 (struck against) are closely related and often get misclassified together as shown in Fig. 4. The TW model yielded higher PPV and sensitivity for categories 62 and 63 as compared to the SW model. Although these improvements were modest, the differentiating two-word sequences may

The prediction results from both SW and TW models were examined to identify the cases where the two models predicted the same event code (also referred to as ‘agreement approach’). Out of the 10,000 prediction cases, the predictions agreed for 7366 cases, out of which 5859 were predicted correctly. The PPV for the 7366 cases where the two models agreed were much higher (80%) than the PPV of the SW (66%) and TW (69%) models. Further analysis showed that the PPV and sensitivity for most of the individual categories was higher when the predictions from SW and TW models agreed, as compared to the individual SW and TW models. The cases where the predictions from the SW model and TW model did not agree can be filtered for manual review by expert coders (this strategy is referred to as ‘semi-automated combined strategy 1’ henceforth). These disagreement cases are probably the more challenging to code since the two models predicted different categories for the same narrative. This might be explained by various reasons such as the narrative being complex or ambiguous, the true category being closely related to another category, etc. We tried to estimate the overall sensitivity of the ‘semi-automated combined strategy 1’. Assuming the manual coding by experts to be

Fig. 2. Percentage of cases, PPV, and sensitivity of SW model for the ten highest frequency categories.

76

G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82

Fig. 3. Percentage of cases, PPV, and sensitivity of TW model for the ten highest frequency categories.

perfect, the disagreement cases were assigned the original event codes as the predicted codes, and the overall sensitivity was calculated to be 85%. The overall sensitivity of predictions using this strategy was considerably higher than SW (66%) or TW (69%) models, and required only a small portion of cases (26%) to be manually coded. The number of prediction cases, sensitivity, and PPV of individual categories for the ‘semi-automated combined strategy 1’ are shown in the form of a confusion matrix in Fig. 5, where (a) each row label represents the true category, which is the BLS-assigned code in our case; (b) each column label represents the predicted category; (c) diagonal entries depict the correct predictions; (d) the off-diagonal entries show the misclassifications; (e) the elements in the last column represent the sensitivity of individual categories; and (f) the elements in the last row represent the PPV of individual categories. The confusion matrix can help identify which categories get frequently misclassified together and can be a useful tool for human coders. Typically, these categories are closely related to each other and have similar injury narratives. For example, category 31 (fire) had 7 cases in the prediction dataset, and was misclassified 6 times as category 53 (temperature extremes). It is worth mentioning that sensitivity and PPV for most categories (particularly rare categories) are significantly better for the ‘semi-automated combined strategy 1’ as compared to SW or TW models. In addition to requiring agreement between prediction results from SW and TW models, we also evaluated the effect of applying a minimum prediction strength threshold on the TW model. The approach was to

autocode only those cases in the prediction dataset where (a) the two models agreed, and (b) the minimum threshold condition was satisfied. It is intuitive that there should be a tradeoff between sensitivity and the portion of cases being autocoded based on the prediction strength threshold selected. We observed this tradeoff as illustrated in Fig. 6 where the Y-axis represents (in percentage terms) the sensitivity of autocoded cases, and the portion of cases autocoded, using the ‘agreement approach’ plus a minimum prediction threshold applied to TW model. When a prediction strength threshold was not applied (i.e. just the ‘agreement approach’), 73.7% of cases (in the prediction set) were autocoded with a PPV of 79.5%, and 26.3% of cases were filtered for manual coding. At a medium prediction strength threshold of 0.5, 71.8% of cases were autocoded with a PPV of 80.2%. For a very high prediction strength threshold of 0.95, only 55.3% of cases were autocoded, but with a considerably higher PPV of 86.1%. The cases which did not meet the above conditions of agreement between models and minimum prediction strength threshold, were filtered to be coded manually by expert coders. This strategy is referred to as ‘semi-automated combined strategy 2’ henceforth. Assuming that the manual coding will be perfect, the cases filtered for manual review were assigned the original BLS codes as the predicted codes and the sensitivity was measured. For a high prediction strength threshold of 0.95, the overall sensitivity of ‘semi-automated combined strategy 2’ was 92%, which is considerably higher than the overall sensitivity of ‘semi-automated combined strategy 1’ (84.9%) but comes at a cost of

Fig. 4. Percentage of predictions of true category, related categories, and non-related categories.

G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82

77

Fig. 5. Confusion matrix for 'semi-automated combined strategy 1'.

manually coding about 18% more cases. The sensitivity and PPV of individual categories in the form of a confusion matrix for ‘semi-automated combined strategy 2’ is available in Appendix 1. 3.3. Top k predictions The overall sensitivity of the top k predictions for the SW model, the TW model, and the SW–TW combined model for different values of k (= 1, 2, 3, 4, 5) are presented in Table 2 and Fig. 7. The sensitivity values are calculated by counting those cases in the prediction set as

‘correct predictions’ where the BLS-assigned event code was present in the top k predictions. As shown in Table 3, the overall sensitivity increases as the value of k increases for all the models. This effect is larger for smaller values of k; for example, for the SW–TW combined model, there is a steep increase in sensitivity of 14% between top 1 and top 2 predictions, but the increase between top 4 and top 5 predictions is only about 2%. This pattern is more clearly represented in Fig. 7. Comparing the performance of individual models, we can see that the SW-TW combined model performs consistently better than the SW and TW

Fig. 6. Sensitivity and percentage of autocoded cases for different prediction strength thresholds.

78

G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82

Table 2 Overall sensitivity of top k predictions for SW, TW, and SW–TW combined models. Top k

Overall sensitivity SW (%)

TW (%)

SW–TW combined (%)

Top 1 Top 2 Top 3 Top 4 Top 5

66.14 81.70 87.46 90.61 92.37

69.05 82.82 87.6 90.19 91.41

70.24 84.16 89.14 91.48 93.3

models. The results show similar levels of performance for the SW and TW models. The TW model performs slightly better than SW model for smaller values of k but the SW model performs better for higher values of k. Given the relatively high sensitivity of the top 5 choices, it would be reasonable to assume that the manual coding performance would improve if the human coders are presented the list of top 5 options. As mentioned in the Introduction section, this approach has been found to be useful for similar tasks in other domains such as manually assigning ICD codes based on medical notes. The high sensitivity (92.4%) of the top 5 predictions for the SW models indicates that even a simple and computationally efficient SW Naïve Bayes model can be used for presenting informative top 5 choices to human coders to improve accuracy.

3.4. Coding accuracy In order to examine the consistency of BLS-assigned codes or computer-predicted codes, two independent OHSP coders manually coded a sample of 871 cases from the prediction set which had 752 cases from the discordant group (where SW and TW models both agreed but did not agree with BLS-assigned codes), and 117 cases from the concordant group (where SW and TW model predictions both agreed with BLS-assigned codes). For the concordant group, the BLS-assigned codes agreed with the OSHP manually assigned ‘gold standard’ codes for 111 out of 117 cases (i.e., the overall agreement was 94.8%). The agreement percentage for individual categories is presented in Table 3. Such high levels of agreement were expected as the codes predicted by SW and TW models agreed with BLS-assigned codes indicating that the cases in the concordant group were not very challenging to code.

Results for the discordant group are presented in Table 4, revealing very similar levels of consistency for the BLS-assigned codes and TMpredicted codes (when SW and TW model predictions agree). The overall sensitivity of the TM-predicted codes was 39.6% and 42.3% for BLSassigned codes. The discordant group cases are probably more challenging to code since different approaches predict different categories for the same narrative. Hence, the overall low sensitivity of BLS-assigned codes and TM-predicted codes was not surprising. In the discordant group, nine categories had 25 or more cases (12, 41, 42, 43, 62, 63, 64, 71, and 73) and accounted for 77% of the cases. As shown in Table 4, sensitivity and PPV for most of these categories was less than 0.5 for both BLS-assigned and TM-predicted codes. Several findings for the TM-predicted codes suggested potential systematic coding errors/issues for some categories, such as, fall-related categories (41, 42, 43), struck-by (62), struck-against (63), and caught-in (64). TM-predicted codes did well in capturing category 43 (falls to a lower level) with sensitivity of 73.7% but there were many false positives (PPV 28.9%). In turn, TM-predicted codes did poorly in capturing category 41 (slip and trips without falls) with sensitivity of only 14.5% but a high percentage of cases predicted this code were confirmed as true cases (PPV 72.7%). These juxtapositions of performance measures for closely related codes suggest some miscoding within fall-related categories. The same point applies to the struck-related categories 62, 63, and 64. BLS-assigned codes tended to be inconsistent (both measures—sensitivity and PPV b 30%) for category 12 (injury by person unintentional or intent unknown), and also for category 64 (caught in). Consistency for category 43 (falls to a lower level) was also low with sensitivity being 21.1% and PPV 42.1%. For most of the rare categories (≤ 25 cases), BLS-assigned codes had better consistency than TM-predicted codes but these findings are based on small numbers. 3.4.1. Accuracy of BLS-assigned codes Based on the above results, we estimated the accuracy of BLS assigned codes for this dataset as follows. We divided the prediction set into three parts: concordant (where SW = TW = BLS), discordant (where SW = TW ≠ BLS), and neither-concordant-nor-discordant (where SW ≠ TW). In our prediction dataset of 10,000 cases, the SW and TW models agree for 7366 cases among which 5859 cases belong to the concordant group, and 1507 cases to the discordant group. The remaining 2634 cases belonged to neither-concordant-nor-discordant group. As shown in Table 3 for the concordant group, P(BLS = True | Concordant) = 94.8%. For the discordant group, P(BLS = True | Discordant) = 42.3%, P(TM = True | Discordant) = 39.6% as shown in Table 4. Using these results, the P(BLS = True | Concordant or Discordant) can be estimated by taking a weighted average of 5859 Concordant cases and 1507 Discordant cases. That is, P ðBLS ¼ TruejConcordant or DiscordantÞ ¼ ¼ 84:06%

5859  94:8% þ 1507  42:3% 7366

Then, making an optimistic assumption: P(BLS = True | Neither Concordant nor Discordant) = P(BLS = True | Concordant) = 94.8% for the 2634 cases where the SW and TW models did not agree, and taking a weighted average, the overall P(BLS = True) can be estimated as P ðBLS ¼ TrueÞ ¼

Fig. 7. Overall sensitivity of top k predictions of SW, TW, and SW–TW combined models for different values of k.

7366  84:06% þ 2634  94:8% ¼ 86:88% 10; 000

This optimistic estimation can be treated as the upper bound of accuracy of BLS-assigned codes.

G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82

79

Table 3 Agreement between concordant case event codes and gold standard OHSP manually assigned event codes. Event code

Event description

Total number of cases

Number of agreements

% Agree

11 12 13 26 41 42 43 51 53 55 62 63 64 66 70 71 72 73 Total

Intentional injury by person Injury by person—unintentional or intent unknown Animal- and insect-related incidents Roadway incidents involving motorized land vehicle Slip or trip without fall Falls on same level Falls to lower level Exposure to electricity Exposure to temperature extremes Exposure to other harmful substances Struck by object or equipment, unspecified Struck against object or equipment Caught in or compressed by equipment/objects Rubbed or abraded by friction or pressure Overexertion and bodily reaction, unspecified Overexertion involving outside sources Repetitive motions involving micro-tasks Other exertions or bodily reactions

1 2 2 3 2 27 2 1 3 2 13 2 4 1 1 41 6 4 117

0 2 2 3 1 27 2 1 3 2 12 2 3 1 0 40 6 4 111

0% 100% 100% 100% 50% 100% 100% 100%. 100% 100% 92.3% 100% 75% 100% 0 97.5% 100% 100% 94.8%

This pessimistic estimation can be treated as the lower bound of accuracy of the BLS-assigned codes. From the above calculations, the estimated accuracy of the BLS-assigned code lies in the range [73%–86.88%].

The cases where the prediction results from SW and TW models do not agree, will most likely not be as straightforward as the Concordant group. Making a pessimistic assumption that they were as challenging as the discordant group, P(BLS = True | Neither Concordant nor Discordant) = P(BLS = True | Concordant) = 42.3%, then taking a weighted average, the overall P(BLS = True) can be estimated as

P ðBLS ¼ TrueÞ ¼

3.4.2. Accuracy of semi-automated coding If the ‘semi-automated combined strategy 1’ is used, the proposed Bayesian decision support system will autocode the cases where the predictions from SW and TW models agree, and the disagreement cases will be coded manually by expert human coders. We have assumed that these coders are well trained experts, and

7366  84:06% þ 2634  42:3% ¼ 73:06% 10; 000

Table 4 Sensitivity and PPV for TM-predicted and BLS-assigned 2-digit OIICS event codes using OHSP manually assigned codes as gold standard. Column ‘N’ represents the number of cases in the prediction set for that event code. Event code

Description

N

TM Sensitivity

TM PPV

BLS Sensitivity

BLS PPV

11 12 13 24 26 27 31 40 41 42 43 44 45 51 52 53 55 57 60 62 63 64 66 67 70 71 72 73 74 78 99

Intentional injury by person Injury by person—unintentional or intent unknown Animal- and insect-related incidents Pedestrian vehicle incident Roadway incidents involving motorized land vehicle Non-roadway incidents involving motorized land vehicles Fires Fall, slip, trip, unspecified Slip or trip without fall Falls on same level Falls to lower level Jumps to lower level Fall or jump curtailed by personal fall arrest system Exposure to electricity Exposure to radiation and noise Exposure to temperature extremes Exposure to other harmful substances Exposure to traumatic or stressful event, n.e.c. Contact with objects and equipment, unspecified Struck by object or equipment, unspecified Struck against object or equipment Caught in or compressed by equipment or objects Rubbed or abraded by friction or pressure Rubbed, abraded, or jarred by vibration Overexertion and bodily reaction Overexertion involving outside sources Repetitive motions involving microtasks Other exertions or bodily reactions Bodily conditions, n.e.c. Multiple types of overexertion and bodily reactions Nonclassifiable Total

18 25 5 10 10 7 1 11 110 99 38 5 1 2 2 2 11 1 7 57 51 25 13 2 22 113 18 60 6 11 9 752

77.8 64 40 30 90 28.6 0 0 14.5 54.5 73.7 20 0 0 0 50 63.6 100 0 35.1 11.8 72 84.6 0 0 53.1 66.7 28.3 0 0 0 39.6

58.3 42.1 100 60 40.9 25

22.2 28 60 70 10 57.1 100 9.1 61.8 33.3 21.1 80 100 100 100 50 27.3 0 28.6 54.4 66.7 20 15.4 0 9.1 37.2 22.2 67.1 50 0 66.7 42.3

33.3 20 100 100 100 30.8 100 6.3 59.6 51.6 42.1 57.1 100 100 100 33.3 50 0 16.7 41.3 43 26.3 66.7 0 8.3 58.3 21.1 37 60 0 35.3

72.7 33.8 28.9 100

11.1 63.6 100 28.2 60 29.5 91.7

49.2 35.3 42.5

0

80

G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82

would also be helped by a list of top 5 prediction choices and the confusion matrix. The accuracy of this strategy can be estimated as follows. As shown in Table 3, P(TM = True | Concordant) = 94.8%, and P(TM = True | Discordant) = 39.6% as presented in Table 4. Then, taking the weighted average for concordant (5859) and discordant (1507) cases, and assuming that the manual coding done by expert human coders (on remaining 2634 cases) would be perfect, P(TM = True) can be estimated as

5859  94:8% þ 1507  39:6% þ 2634  100% 10; 000 ¼ 87:85%

P ðTM ¼ TrueÞ ¼

which is about 1% better than the estimated upper bound of accuracy (86.88%) of BLS-assigned codes and involves a significantly smaller amount of manual coding (26%). Instead of assuming that the manual assigned codes will be 100% correct, even if we assume the accuracy to be 94.8%, which is P(BLS = True | Concordant), the overall sensitivity can be estimated as

5859  94:8% þ 1507  39:6% þ 2634  94:8% 10; 000 ¼ 86:48%

P ðTM ¼ TrueÞ ¼

which is almost the same as the estimated upper bound of accuracy of BLS-assigned codes (86.8%) with the expert coders having to code only 26% cases. The ‘semi-automated combined strategy 2’ can also be used by selecting different levels of prediction strength thresholds, but it will be difficult to estimate the overall accuracy for that approach since the manual coding exercise was carried out only for concordant and discordant groups. However, it is intuitive that a higher prediction strength threshold would filter more cases for manual review, and assuming the accuracy of expert coders to be at least 94.8%, the overall accuracy would be slightly more than that of ‘semi-automated combined strategy 1’. The estimated accuracy values suggest that a Bayesian decision support system can be used for coding occupational injury data with good accuracy while requiring reduced amount of manual coding. 4. Conclusions Results from this study show that a semi-automated Bayesian decision support system that autocodes a large portion of the data with good accuracy and leaves a small portion of cases to be coded by a few well-trained expert coders may yield comparable accuracy to an entirely manual coding system for coding occupational injury data. Hence, use of such a Bayesian decision support system seems promising for coding event information based on injury narratives from large occupational injury databases. The SW and TW Naïve Bayes models used for building the decision support system yielded reasonably good prediction performance, with the TW model being slightly better. The PPV of prediction cases where the SW and TW models agreed was higher than individual models and was similar to levels reported in other studies applying Naïve Bayes models for coding injury narratives in workers' compensation records. Use of an additional prediction strength threshold along with agreement in prediction results of SW and TW model exhibited a tradeoff between the accuracy of autocoding and the proportion of cases being autocoded,

but even with a high threshold of 0.95, we are able to autocode more than half of cases (55.3%), and assuming perfect manual coding of filtered cases, an overall sensitivity of 92% can be achieved. Among the event code categories, rare categories that are closely related to other larger categories remain a challenge to autocode as these categories are consistently misclassified by different models with high prediction strengths. A possible approach to handle rare categories could be to use a confusion matrix as a filtering tool to examine the cases that can potentially be misclassified by autocoding. We also examined the consistency of BLS-assigned and TMpredicted event codes by comparing them with codes assigned by two independent OHSP coders and observed some discrepancies. The high level of agreement for the concordant group, while based on small numbers, was reassuring. The higher inconsistency in coding on the discordant group for both the TMpredicted and BLS-assigned codes provided some insight into some specific coding challenges to be addressed in both manual coding and autocoding. This study is also important as it demonstrates the feasibility of autocoding SOII data according to the new version 2.01 of OIICS coding scheme, which has about 45 two-digit event codes, higher than most of the previous studies in this area. One of the strengths of this study was the large size of the training data set compared to that used in prior studies of coding occupational injury data. Notably, even with this larger training set, there were very few training cases of rare event codes in the training dataset. The effect of incorporating coded information on nature of injury and body part in the model(s) should also be explored. Future work should also be directed toward using other machine learning models in combination with Bayesian methods, such as support vector machine, logistic regression, etc. Each of these models has different underlying mathematical principles and looking at the agreement between their prediction results may lead to improved performance for individual categories as well as overall PPV.

5. Practical applications Occupational injury survey data are often used for surveillance and other purposes by various groups of people such as government agencies, policymakers, safety standards writers, insurance companies, and manufacturers of safety equipment. Use of the proposed Bayesian decision support system may result in faster and comparably accurate coding of occupational injury data as compared to manual coding, and will also provide assistance to manual coders. Accurately and timely coded occupational injury data can help in quickly identifying the prevalent causes of injuries at workplace, and thus planning for remedial action or revising safety standards. Early implementation of revised safety measures by organizations can prevent occupational injuries, and save lives.

Acknowledgements This study was conducted at the Massachusetts Department of Public Health Occupational Health Surveillance Program (OHSP) in collaboration with Purdue University. This study was funded through a cooperative agreement with the Bureau of Labor Statistics: OS 24725-13-75-J-25. The authors would like to express their gratitude to Sangwoo Tak, ScD, MPH and James R. Laing, B.S. from Massachusetts Department of Public Health for their contributions in this study.

G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82

81

BLS Assigned Event Codes

Appendix 1. Confusion matrix for ‘semi-automated combined strategy 2’(autocoding of cases where (a) prediction results of SW and TW models agree, and (b) prediction strength of TW model > 0.95. Manual coding of remaining cases)

References Bondy, J., Lipscomb, H., Guarini, K., & Glazner, J. E. (2005). Methods for using narrative text from injury reports to identify factors contributing to construction injury. American Journal of Industrial Medicine, 48(5), 373–380. http://dx.doi.org/10.1002/ajim.20228. Bertke, S. J., Meyers, A. R., Wurzelbacher, S. J., Measure, A., Lampl, M. P., & Robins, D. (2016). Comparison of methods for auto-coding causation of injury narratives. Accident Analysis & Prevention, 88, 117–123. http://dx.doi.org/10.1016/j.aap.2015.12. 006. Chen, L., Vallmuur, K., & Nayak, R. (2015). Injury narrative text classification using factorization model. BMC Medical Informatics and Decision Making, 15(Suppl. 1), S5. http:// dx.doi.org/10.1186/1472-6947-15-S1-S5. Corns, H. L., Marucci, H. R., & Lehto, M. R. (2007). In M. J. Smith, & G. Salvendy (Eds.), Development of an approach for optimizing the accuracy of classifying claims narratives using a machine learning tool (Textminer[4]) (pp. 411–416). Berlin Heidelberg: Springer (Retrieved from http://link.springer.com/chapter/10.1007/978-3540-73345-4_47). Larkey, L. S., & Croft, W. B. (1996). Combining classifiers in text categorization. New York, NY, USA: ACM, 289–297. http://dx.doi.org/10.1145/243199.243276. Lehto, M., Marucci-Wellman, H., & Corns, H. (2009). Bayesian methods: A useful tool for classifying injury narratives into cause groups. Injury Prevention, 15(4), 259–265. http://dx.doi.org/10.1136/ip.2008.021337. Leman, S., & Lehto, M. R. (2003). Interactive decision support system to predict print quality. Ergonomics, 46(1–3), 52–67. http://dx.doi.org/10.1080/00140130303531. Marucci-Wellman, H. R., Lehto, M. R., & Corns, H. L. (2015). A practical tool for public health surveillance: Semi-automated coding of short injury narratives from large administrative databases using Naïve Bayes algorithms. Accident Analysis and Prevention, 84, 165–176. http://dx.doi.org/10.1016/j.aap.2015.06.014. Marucci-Wellman, H., Lehto, M., & Corns, H. (2011). A combined Fuzzy and Naive Bayesian strategy can be used to assign event codes to injury narratives. Injury Prevention, 17(6), 407–414. http://dx.doi.org/10.1136/ip.2010.030593. Measure, A. C. (2014). Automated Coding of Worker Injury Narratives (Joint Statistical Meetings 2014 - Government Statistics Section). Boston, MA, USA: U.S. Bureau of Labor Statistics Retrieved from http://www.bls.gov/osmr/pdf/st140040.pdf. Noorinaeini, A., & Lehto, M. R. (2006). Hybrid singular value decomposition: A model of human text classification. International Journal of Human Factors Modelling and Simulation, 1(1), 95–118 (Retrieved from http://inderscience.metapress.com/ content/4JDTJNQ7EJDNGUD1).

Northwood, J. M., Sygnatur, E. F., & Windau, J. A. (2012). Updated BLS occupational injury and illness classification system. Monthly Labor Review, 19 (Retrieved from http://wwwn.cdc.gov/wisards/oiics/Doc/UpdatedOIICSNorthwoodetal2012.pdf). Occupational Safety and Health Statistics Program (2014, January). Retrieved March 14, 2014, from http://www.mass.gov/lwd/labor-standards/occupational-safety-andhealth-statistics-program/ Rizzo, S. G., Montesi, D., Fabbri, A., & Marchesini, G. (2015). In N. Ashish, & J. -L. Ambite (Eds.), ICD code retrieval: Novel approach for assisted disease classification (pp. 147–161). Springer International Publishing Retrieved from http://link.springer. com/chapter/10.1007/978-3-319-21843-4_12 Talbot, J., Lee, B., Kapoor, A., & Tan, D. S. (2009). EnsembleMatrix: interactive visualization to support machine learning with multiple classifiers. New York, NY, USA: ACM, 1283–1292. http://dx.doi.org/10.1145/1518701.1518895. Taylor, J. A., Lacovara, A. V., Smith, G. S., Pandian, R., & Lehto, M. (2014). Near-miss narratives from the fire service: A Bayesian analysis. Accident Analysis & Prevention, 62, 119–129. http://dx.doi.org/10.1016/j.aap.2013.09.012. U.S. Department of Labor (2005). OSHA recordkeeping handbook. (Retrieved from https:// www.wisconsin.edu/workers-compensation/download/frequently_used_guidance/ OSHA1904Recordkeepingpub3245rev.pdf). U.S. Department of Labor, W. D. C. (2012). Bureau of Labor Statistics, Occupational injury and illness classification manual, version 2.01. Retrieved from http://wwwn.cdc.gov/ wisards/oiics/Doc/OIICS Manual 2012 v201.pdf Vallmuur, K. (2015). Machine learning approaches to analysing textual injury surveillance data: A systematic review. Accident Analysis & Prevention, 79, 41–49. http://dx.doi.org/ 10.1016/j.aap.2015.03.018. Wang, S., & Manning, C. D. (2012). Baselines and bigrams: Simple, good sentiment and topic classification. Stroudsburg, PA, USA: Association for Computational Linguistics, 90–94 Retrieved from http://dl.acm.org/citation.cfm?id=2390665. 2390688 Wellman, H. M., Lehto, M. R., Sorock, G. S., & Smith, G. S. (2004). Computerized coding of injury narrative data from the National Health Interview Survey. Accident Analysis & Prevention, 36(2), 165–171. http://dx.doi.org/10.1016/S0001-4575(02)00146-X. Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). Morgan Kaufmann Retrieved from https://books.google.com/ books?id=QTnOcZJzlUoC Zhu, W., & Lehto, M. R. (1999). Decision support for indexing and retrieval of information in hypertext systems. International Journal of Human Computer Interaction, 11(4), 349–371. http://dx.doi.org/10.1207/S15327590IJHC1104_5.

82

G. Nanda et al. / Journal of Safety Research 57 (2016) 71–82

Gaurav Nanda is a Ph.D. candidate in the School of Industrial Engineering at Purdue University. His research area is to use machine learning methods and natural language processing techniques along with domain knowledge-based rules to improve the autocoding accuracy of injury data for public health and occupational injury surveys. Kathleen Grattan, MPH is an applied epidemiologist with the Massachusetts Department of Public Health (MDPH) in the Occupational Health Surveillance Program and has coordinated a wide range of surveillance and research projects that involve the management, analysis, and summary of large administrative and survey datasets, including hospital data, workers' compensation data, and data from the Survey of Occupational Injuries and Illness (SOII). MyDzung Chu, MSPH is an epidemiologist with the Massachusetts Department of Public Health (MDPH) in the Occupational Health Surveillance and the Health Survey Programs. She started at MDPH as a CDC/CSTE Applied Epidemiology Fellow and has led geographic analyses of work-related injuries using ACS data, analysis of workers' compensation data for local public sector workers, and also coordinates the state's Youth Health Survey.

Letitia Davis, ScD, EdM is the Director of the Occupational Health Surveillance Program, Massachusetts Department of Public Health (MDPH) and has worked over many years to develop state-based surveillance systems for work-related injuries, illnesses, and hazards. She has overseen the formation of a comprehensive surveillance system for fatal occupational injuries, the Massachusetts Sharps Injury Surveillance System, a surveillance system for work-related asthma, the Massachusetts Occupational Lead Registry, and a model surveillance system for work-related injuries to young workers. Mark Lehto, Ph.D., is a Professor at the School of Industrial Engineering and the Director of Industrial Engineering Discovery-to-Delivery Center at Purdue University. His research interests include text mining, safety engineering, decision support systems, and human factors. He has taught and developed several different undergraduate and graduate courses within the School of Industrial Engineering, including classes on Safety Engineering, Engineering Economics, Industrial Ergonomics, and Work Design.