Available online at www.sciencedirect.com
ScienceDirect ScienceDirect
Procedia Computer Science 00 (2019) 000–000 Procedia Computer Science 00 (2019) 000–000
Available online at www.sciencedirect.com
ScienceDirect
www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia
Procedia Computer Science 156 (2019) 150–157
8th International Young Scientist Conference on Computational Science 8th International Young Scientist Conference on Computational Science
Diagnoses Detection in Short Snippets of Narrative Medical Texts Diagnoses Detection in Short Snippets of Narrative Medical Texts Aleksei Dudchenkoa,b* Matthias Ganzingerb Georgy Kopanitsac Aleksei Dudchenkoa,b* Matthias Ganzingerb Georgy Kopanitsac Tomsk Polytechnic University, Tomsk, Russia a Institute of Medical Biometry Informatics, Heidelberg University, Tomsk and Polytechnic University, Tomsk, Russia Heidelberg, Germany c b ITMOand University, Saint-Petersburg, Russia Heidelberg, Germany Institute of Medical Biometry Informatics, Heidelberg University, c ITMO University, Saint-Petersburg, Russia a
b
Abstract Abstract Data extraction from narrative medical texts is a significant task to enable secondary use of medical data. Supervised learning algorithms showfrom goodnarrative results in naturaltexts language processing (NLP) tasks.secondary We have use developed a NLP on Data extraction medical is a significant task to enable of medical data.framework Supervisedbased learning supervised machine learning entity extraction medical texts.tasks. The We framework is language independent andbased entities algorithms show good resultsfor in natural language from processing (NLP) have developed a NLP framework on independent as long as an appropriately dataset is given. Thetexts. framework is based onisvector representation of words and a supervised machine learning for entitylabeled extraction from medical The framework language independent and entities neural network as a classifier. We have trained evaluated the The framework on two different text corpuses: diagnoses paragraphs independent as long as an appropriately labeledand dataset is given. framework is based on vector representation of words and a written in German medicalWe records Russian. Thethe neural networkonhyperparameters were adjusted for everyparagraphs dataset to neural network as aand classifier. have written trained in and evaluated framework two different text corpuses: diagnoses get better results. records Finally, written accuracy, standard The deviation, standard error were calculated for both models written in classification German and medical in Russian. neural and network hyperparameters were adjusted for network every dataset to engaging cross-validation. The accuracy, obtained accuracy 97,64% for 96,81% for German get better 10-folds classification results. Finally, standard is deviation, andRussian standardtexts errorand were calculated for bothones. network models engaging 10-folds cross-validation. The obtained accuracy is 97,64% for Russian texts and 96,81% for German ones. © 2019 The Authors. Published by Elsevier Ltd. © 2019 The Authors. Published by Elsevier Ltd. This is an open accessPublished article under the CC BY-NC-ND license https://creativecommons.org/licenses/by-nc-nd/4.0/) © 2019 The Authors. by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational This is an open access article under the CC BY-NC-ND license https://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational Science Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational Science. Science Keywords: natural language processing; medical records Keywords: natural language processing; medical records
1. Introduction 1. Introduction Electronic medical records (EMR) often provide fields for free narrative text along with fields for structured and Electronic medical (EMR) often fields fordata free isnarrative text along with fields for structured coded information. Asrecords a consequence, 80% provide of all healthcare unstructured, including free texts [1]. Thus, and the codedextraction information. a consequence, of all healthcare data secondary is unstructured, includinganalysis free texts [1]. Thus,huge the data fromAsmedical texts is a 80% significant task to enable use including of existing data extraction from medical texts is a significant task to enable secondary use including analysis of existing huge
* Corresponding author. Tel.: +49-6221-56-5143; fax: +49-6221-56-4997. address:author.
[email protected] * E-mail Corresponding Tel.: +49-6221-56-5143; fax: +49-6221-56-4997. E-mail address:
[email protected] 1877-0509 © 2019 The Authors. Published by Elsevier Ltd. This is an open access under the CC BY-NC-ND license https://creativecommons.org/licenses/by-nc-nd/4.0/) 1877-0509 © 2019 Thearticle Authors. Published by Elsevier Ltd. Peer-review under responsibility of the committee of the 8th International Young Scientist Conference on Computational Science This is an open access article under the scientific CC BY-NC-ND license https://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational Science 1877-0509 © 2019 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the scientific committee of the 8th International Young Scientist Conference on Computational Science. 10.1016/j.procs.2019.08.190
2
Aleksei Dudchenko et al. / Procedia Computer Science 156 (2019) 150–157 Author name / Procedia Computer Science 00 (2019) 000–000
151
amounts of information stored in medical texts and to support treatment of current patients by decision support systems. Ford et al. [2] report about groups of approaches to extract information from medical records. They describe keyword searches, rule-based algorithms, and machine learning (ML) methods. Only 6% of observed papers applied ML. At the same time, natural language processing (NLP) based on ML has outperformed other approaches in different fields from sentiment analysis [3–5] and text classification [6,7] to image annotation [8] and machine translation [9]. The aim of our work is to develop an NLP framework based on supervised machine learning for entities extraction from medical narrative text. The framework must be language independent and entities independent as long as an appropriately labeled dataset is given. 2. Methods 2.1. Collection of a labeled dataset We have two text corpuses: an archive of medical records written in Russian and diagnoses paragraphs from clinical reports written in German. We have processed the archive of 100 de-identified medical records from a public hospital in Saint-Petersburg, Russia to collect a labeled dataset. All records are written in Russian, were created between 2015 and 2017 and contain patient demographics, diagnosis, complaints, laboratory/instrumental tests result, and treatments. We searched for an entity occurring in the text and added the corresponding snippet of text together with the label of the entity to the dataset. We have limited desired entities to a list of five disorders and five findings from SNOMED classification (Table 1). The dataset is a comma-separated file (.csv) and has following structure: every line represents one labeled example in the form: “entity, snippet of text containing the entity”. An example is given in Figure 1.
Figure 1 – An example of the dataset structure
We also used an annotated text corpus of main diagnoses paragraphs extracted from clinical reports of a research database from Heidelberg University Hospital [10]. The corpus comprises in total 737 instances and four different diagnoses. The corpus contains information not only about diagnosis itself, but also about some details like date of initial diagnosis, class of immunoglobulin, staging, creatinine level. Moreover, the same instance might include up to three diagnoses. We have selected only two items for every instance: the main diagnoses paragraph itself and the diagnosis mentioned first. Then we converted it to the same structure as described above for the Russian dataset. The whole paragraphs act as snippets and diagnoses act as entities.
Aleksei Dudchenko et al. / Procedia Computer Science 156 (2019) 150–157 Author name / Procedia Computer Science 00 (2019) 000–000
152
3
2.2. Construction of a Framework for NLP The framework has been constructed using Python and the open source library Keras [11,12]. The parts of the NLP framework are described in the following paragraphs. Tokenization and integer vector representation of words and samples. Tokenization is the process of segmenting text into tokens which in our case are words. We also deleted all nonletter symbols. We did not do lemmatization and kept the original forms of words including misspellings, acronyms and abbreviations. We coded every unique word as an integer number. Equalization of the samples length. Collected samples have the same length in terms of number of symbols, but not in terms of number of words. Since the next steps deal with words and their representations, we had to align the number of tokens per sample. Therefore, we took the size of largest sample as basis. All other samples we padded with zeroes at the end. Word embedding. To get word embedding we engaged the Keras embedding layer. One-hot vector representation of labels. All labels from the dataset were encoded as one-hot vectors. A one-hot vector is a vector with a single component set to ‘1’ and all the other components set to ‘0’. The position of the ‘1’ value corresponds to the ordinal number of the label. For example, for the first label, the ‘1’ value is located in first position and ‘0’ values in the remaining positions. The length of the vectors corresponds to total amount of labels in the dataset. Samples classification. After we had samples embedded and labels encoded, we have applied a multi-layer perceptron (MPL) for the classification. Every encoded label represents a class (i.e. actually an extracted entity) and every sample refers to one class depending which entity it contains. 2.3. The model training and evaluation We have trained the model with the Adam Optimizer [13] since it outperforms other optimization algorithms in many cases [14] and at the same time has benefits like computation efficiency and little memory requirements [13]. As a loss function or optimization score function we have applied the 'categorical_crossentropy'-loss-function built into Keras. This function, along with the softmax activation function of the output layer, are used for multiclass classification tasks. The value of softmax activation function ranges between 0 and 1 and the sum of all values is 1. It allows to use the outputs of the network as probability distribution. The difference between obtained and desired distributions is then quantified by a cross entropy function. The number of training epochs is an important hyperparameter of any network training process. Lack of epochs leads to underfitting as well as excess causes overfitting. Both overfitting and underfitting are easily seen on a training plot. We have plotted the training process for 300 epochs to see when the network performance does not increase anymore. To plot the training process, we created a new model with the same hyperparameters (type and amount of layers and neurons, optimizer and loss function) but we randomly split the dataset into a training part (70%) and a test part (30%). A validation data set was derived as a subset of 30% of the training data. K-folds cross validation is a common way to verify a model. When using this technique, the dataset is repeatedly split into k-folds. The model is trained on k-1 folds and evaluated the held out fold. This process is repeated for each fold [15]. We used ten folds and calculated the accuracy according to the formula (1). 1
𝑘𝑘
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = 𝑘𝑘 ∑𝑗𝑗=1 𝐴𝐴𝐴𝐴𝐴𝐴𝑗𝑗
(1)
4
Aleksei Dudchenko et al. / Procedia Computer Science 156 (2019) 150–157 Author name / Procedia Computer Science 00 (2019) 000–000
153
Every time we compiled a neural network, we have randomly set initial trainable parameters. Consequently, different results might be obtained after model training on the same data. Thus, to get more accurate estimation of the model we trained and evaluated the model with 10-fold cross validation ten times and calculated the grand mean of those estimations (2). If fact, we trained the network 100 times: 10 times we reinitialized trainable parameters, and for every initialization we trained the network with 10-folds cross validations. 1
𝑛𝑛
1
𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 = ∑ 𝑛𝑛
𝑖𝑖=1 𝑘𝑘𝑖𝑖
𝑘𝑘
𝑖𝑖 ∑𝑗𝑗=1 𝐴𝐴𝐴𝐴𝐴𝐴𝑗𝑗,𝑖𝑖
(2)
We also have calculated standard deviation and standard error according to (2) and (3). 2 ∑𝑛𝑛 𝑖𝑖=1(𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑖𝑖 −𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺)
𝑆𝑆𝑆𝑆 = √ 𝑆𝑆𝑆𝑆 =
3. Results
(3)
𝑛𝑛
𝑆𝑆𝑆𝑆
(4)
√𝑛𝑛
3.1. Collection of a labeled dataset The dataset comprises 378 instances for 10 entities. Five entities each were derived from diagnosis sections and complaints sections of the medical records. All entities have a corresponding SNOMED code. Table 1 shows how many instances for each entity have been collected in the dataset. Table 1 – Number of entities in the datasets
Entities to extract Benign essential hypertension (disorder),
SNOMED code 1201005
Atherosclerosis of aorta (disorder) + 2 Atherosclerosis of coronary artery (disorder),
81817003 + 443502000
Chronic ischemic heart disease (disorder),
413838009
Diabetes mellitus type 2 (disorder),
44054006
Venous varices (disorder),
128060009
Fatigue (finding),
84229001
Headache (finding),
25064002
Lightheadedness (finding),
386705008
Body mass index 30+ - obesity (finding),
162864005
1
3 4 5 6 7 8 9
samples # 75 46 39 71 21 52 34 13 14
Aleksei Dudchenko et al. / Procedia Computer Science 156 (2019) 150–157 Author name / Procedia Computer Science 00 (2019) 000–000
154
10
Entities to extract Edema of lower extremity (finding),
Total samples in the Russian dataset : 1
Multiple myeloma (MM),
Monoclonal gammopathy of undetermined 2 significance (MGUS), 3 4
SNOMED code 102572006
109989006 277577000
Smoldering MM,
413587002
Solitary plasmacytoma of bone,
426336007
5
samples # 13 378 401 238 69 11
Total samples in the German dataset: 719 For the second dataset, after removing all missing values we obtained 719 instances. The numbers of instances for every diagnosis are given in Table 1. 3.2. Construction of a framework for NLP After tokenization we have got a vocabulary of 371 words for the Russian dataset and 600 words for the German one. The longest sample has 9 words for the first and 57 words for the second dataset. All other samples were padded to these lengths. At that step we have samples in form of vectors with integers as representations of words and zeros in the end to get the length of 9/57. This is a preprocessing necessary for the network. The network is made up of four layers: embedding, flatten and two dense layers. The first embedding layer has the size of the vocabulary, what for our samples equals to 371 and 600, respectively. The output of the embedding layer is a 2-dimensional 9×8/57×8 size vector. Here, 9/57 are the lengths of the samples and 8 is the size we have set for new embedding for every word in samples. To process the obtained 2-diminsional representations through dense (fully connected) layers and make predictions we flatten the vectors into one dimension of 72 (9×8) and of 456 (56×8) nodes. In fact, the dense layers in our network are a multilayer perceptron (MLP). We implemented the first dense layer with 25 neurons and ReLU activation function and output layer according to the number of classes:10 neurons for the Russian and 4 neurons for the German dataset. As we mentioned in method section, the output neurons have a softmax activation function. In total, the networks have 5053 and 16329 trainable parameters, respectively. Summaries of the networks are given in Table 2. Table 2 – The networks structures
German
Russian
Layer
Input/output shape
Param #
Embedding
(371)/(9, 8)
2968
Flatten
(9, 8)/(72)
0
Dense
(72)/(25)
1825
Dense
(25)/(10)
260
Total trainable parameters
5053
Embedding
(600)/(57, 8)
4800
Flatten
(57, 8)/(456)
0
Dense
(456)/(25)
11425
Dense
(25)/(4)
104
6
Aleksei Dudchenko et al. / Procedia Computer Science 156 (2019) 150–157 Author name / Procedia Computer Science 00 (2019) 000–000
Layer
Input/output shape
155
Param #
Total trainable parameters
16329
3.3. The model training and evaluation We have plotted the training process of both networks with 300 epochs (Figure 2) and have found that 80 epochs are enough for training on the Russian dataset since accuracy for test subset does not increase anymore. For the German dataset, the number of epochs sufficient is 35.
Figure 2 – Plots of the network accuracy on train and test datasets. The Russian dataset is on the left and the German one on the right.
Table 3 provides obtained values of grand mean accuracy, standard deviation, and standard error according to the formulas (1), (2), and (3). The values are given for both training and test subsets. Table 3 – Scores Score
Russian
German
GM Accuracy (train)
99,32%
99,02%
GM Accuracy (test)
97,64%
96,81%
SD (train)
0.07
0.07
SD (test)
0.75
0.35
SE (train)
0.02
0.02
SE (test)
0.23
0.11
4. Discussion We have shown that our framework can be efficiently applied for different languages, as long as an appropriate labeled dataset is given. Moreover, the differences in the network’s hyperparameters and training parameters are caused by distinctions in the datasets. For example, the German dataset has unlimited length of samples. When collecting the Russian dataset, the length of sample was limited to 50 symbols. In consequence, the maximum amount of words is 9 against 57 words in German. This has resulted in a change of the number of neurons on the input layer of the network and, as consequence, the number of trainable parameters. Another example of significant distinction is vocabulary size. There are 371 unique words in the first dataset and 600 in second. This also causes changes in the networks. Nonetheless, despite those differences, we managed to just slightly modify the network model and adjust hyperparameters to get high accuracy for both languages. Regarding the framework’s applicability to the real conditions the German dataset provides full paragraphs from every record and all possible diagnosis. All necessary preprocessing is parsing the desired paragraph from a new medical record. In case of our Russian dataset, the samples are not full paragraphs, but snippets of texts with fixed
156
Aleksei Dudchenko et al. / Procedia Computer Science 156 (2019) 150–157 Author name / Procedia Computer Science 00 (2019) 000–000
7
size and not all possible diagnoses are provided. This means that to implement the framework for medical records like the ones we used to collect our dataset, several preprocessing steps are necessary. Those steps are parsing sections of the record and cutting it into parts of a certain size. Some parts contain desired entities and some parts do not contain any entities for extraction at all. We have applied MLP, but there are other types of neural networks which have shown good results in similar tasks. Convolutional neural networks (CNN) [16,17] were designed for image data classification, but also were successfully applied in NLP [5,6,18]. Recurrent Neural Networks (RNN) [19] were developed to tackle the drawback of feedforward neural network (like MLP we use) that only a fixed number of words can be taken. RNN can work with sequences of words of any length [20]. Long short-term memory (LSTM) [21] is a type of RNN and it shows high performance on NLP tasks [22]. A good comparative study of several types of neural networks for NLP has been done by Yin et al. [23]. Results of the studies we mentioned above tell us to try to add specific neuron layers to the network architecture to modify our framework, evaluate it and compare results in future work. We have utilized accuracy as metric to evaluate classification skill of the network. Accuracy is defined as the ratio of correct classifications to the number of classifications done. This does not give us an understanding of which samples were misclassified, or if there is a correlation between misclassified samples and classes and so on. This is especially important when we have an unbalanced dataset like the German one (11 samples for solitary plasmacytoma of bone and 401 for multiple myeloma). To get a more informative assessment, we are going to engage scores based on confusion matrix, such as F1 Score and AUC-ROC. Precise analysis of every misclassified sample also might shed light on the framework’s weaknesses. Future development of our approach includes extending the entity list. This will increase the number of available diagnoses for extraction. The biggest obstacle on this way so far is a lack of labeled datasets. Development of a tool to enable automated collection of training datasets with samples from real medical records is an important step for the proposed approach. We are going to continue on creating such a tool, which is going to be based on dictionaries of possible expressions of diagnosis. 5. Conclusions We proposed a framework based on a neural network model as classifier for entity extraction from medical narrative text regardless of language. The framework can be used for structuring medical records to enable intellectual analysis of data, interoperability, and more efficient data storage. We have tested the framework on two different language datasets, adjusted the model according to datasets properties and the models achieved accuracy 97,64% and 96,81%. Acknowledgements This work was financially supported by the government of the Russian Federation through the ITMO fellowship and professorship program. The study was supported by the Russian Foundation for Basic Research, project 18-3720002. References [1] S.R. Prashant Dhamdhere, Jeremiah Harmsen, Raaghav Hebbar, Srinath Mandalapu, Ashish Mehra, ELPP 2016: Big Data for Healthcare, 2016. http://scet.berkeley.edu/wp-content/uploads/Big-Data-for-Healthcare-Report-ELPP-2016.pdf (accessed September 6, 2017). [2] E. Ford, J.A. Carroll, H.E. Smith, D. Scott, J.A. Cassell, Extracting information from the text of electronic medical records to improve case detection: a systematic review, J. Am. Med. Informatics Assoc. 23 (2016) 1007–1015. doi:10.1093/jamia/ocv180. [3] K.S. Tai, R. Socher, C.D. Manning, Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks, (2015). http://arxiv.org/abs/1503.00075 (accessed November 13, 2018). [4] B. Pang, L. Lee, S. Vaithyanathan, Thumbs up?: sentiment classification using machine learning techniques, in: Proc. ACL-02 Conf. Empir. Methods Nat. Lang. Process. - EMNLP ’02, Association for Computational Linguistics, Morristown, NJ, USA, 2002: pp. 79–86. doi:10.3115/1118693.1118704. [5] Y. Kim, Convolutional Neural Networks for Sentence Classification, (2014). http://arxiv.org/abs/1408.5882 (accessed February 25, 2019).
8
Aleksei Dudchenko et al. / Procedia Computer Science 156 (2019) 150–157 Author name / Procedia Computer Science 00 (2019) 000–000
157
[6] X. Zhang, J. Zhao, Y. Lecun, Character-level Convolutional Networks for Text Classification *, (2015). https://arxiv.org/pdf/1509.01626.pdf (accessed January 9, 2018). [7] A. Tripathy, A. Agrawal, S.K. Rath, Classification of sentiment reviews using n-gram machine learning approach, Expert Syst. Appl. 57 (2016) 117–126. doi:10.1016/J.ESWA.2016.03.028. [8] G. Litjens, T. Kooi, B.E. Bejnordi, A.A.A. Setio, F. Ciompi, M. Ghafoorian, J.A.W.M. van der Laak, B. van Ginneken, C.I. Sánchez, A survey on deep learning in medical image analysis, Med. Image Anal. 42 (2017) 60–88. doi:10.1016/J.MEDIA.2017.07.005. [9] D. Bahdanau, K. Cho, Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, (2014). http://arxiv.org/abs/1409.0473 (accessed November 14, 2018). [10] M. Löpprich, F. Krauss, M. Ganzinger, K. Senghas, S. Riezler, P. Knaup, Automated Classification of Selected Data Elements from Freetext Diagnostic Reports for Clinical Research, Methods Inf. Med. 55 (2016) 373–380. doi:10.3414/ME15-02-0019. [11] F. Chollet, Keras, (2015). https://github.com/keras-team/keras (accessed February 20, 2018). [12] Keras Documentation, (n.d.). https://keras.io/ (accessed November 7, 2018). [13] D.P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, (2014). https://arxiv.org/abs/1412.6980 (accessed November 12, 2018). [14] S. Ruder, An overview of gradient descent optimization algorithms, (2016). doi:10.1111/j.0006-341X.1999.00591.x. [15] R. Kohavi, A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection, International Joint Conference on Articial Intelligence _IJCAI, 1995. http://robotics.stanford.edu/~ronnyk (accessed November 23, 2018). [16] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, n.d. [17] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D. Jackel, Backpropagation Applied to Handwritten Zip Code Recognition, Neural Comput. 1 (1989) 541–551. doi:10.1162/neco.1989.1.4.541. [18] C. Nogueira, D. Santos, M. Gatti, Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts, (n.d.) 69–78. http://anthology.aclweb.org/C/C14/C14-1008.pdf (accessed January 10, 2018). [19] W. De Mulder, S. Bethard, M.-F. Moens, A survey on the application of recurrent neural networks to statistical language modeling, Comput. Speech Lang. 30 (2015) 61–98. doi:10.1016/J.CSL.2014.09.005. [20] M. Auli, M. Galley, C. Quirk, G. Zweig, Joint language and translation modeling with recurrent neural networks, (2013). https://www.microsoft.com/en-us/research/publication/joint-language-and-translation-modeling-with-recurrent-neural-networks/ (accessed November 16, 2018). [21] S. Hochreiter, J. Schmidhuber, Long Short-Term Memory, Neural Comput. 9 (1997) 1735–1780. doi:10.1162/neco.1997.9.8.1735. [22] C. Zhou, C. Sun, Z. Liu, F.C.M. Lau, A C-LSTM Neural Network for Text Classification, (2015). http://arxiv.org/abs/1511.08630 (accessed November 16, 2018). [23] W. Yin, K. Kann, M. Yu, H. Schütze, Comparative Study of CNN and RNN for Natural Language Processing, (2017). doi:10.14569/IJACSA.2017.080657.