Chapter 1
Speech Processing in Healthcare: Can We Integrate? K.C. Santosh Department of Computer Science, The University of South Dakota, Vermillion, SD, United States
Speech recognition—also known as name voice recognition—refers to the translation from speech into words in a machine-readable format [1–3]. Speech processing has been considered for various purposes in the domain, for example, signal processing, pattern recognition, and machine learning [3]. Starting with the improvement of customer service, as well as the role of hospital care in combating crime, among other purposes, we have found that speech recognition has increased its global market from $104.4 billion in 2016 to an estimated $184.9 billion in 2021 (source: https://www.news-medical.net/ whitepaper/20170821/Speech-Recognition-in-Healthcare-a-SignificantImprovement-or-Severe-Headache.aspx). This is not a new trend; for cases in which different languages are needed, speech-to-text conversion is an example that has been widely used. The opposite holds true as well [4, 5]. For example, can we process or reuse speech data that occurred during a telephone conversation a few years previously, in which a client claimed that fraud happened on his or her credit card? Yes, this is possible. Beside other sources of data, speech can be taken as an authentic component to describe an event or scene wherein emotions can be analyzed [6–8]. Examples exist showing how speech analysis can be integrated into healthcare. In Fig. 1.1, a complete healthcare automated scenario has been created, which can be summarized as follows: A patient visits clinical center (hospital), where he/she gets X-rayed, provides sensor-based data (external and internal), and receives (handwritten and machine-printed) prescription(s) and report(s) from the specialist. In these events, a patient and other staff (including the specialists) have gone through different levels of conversation, and, if recorded, they will be able to integrate these with signal processing, pattern recognition, image processing, and machine learning.
In the aforementioned healthcare project for instance, it would be convenient to combine speech and signal processing tools and techniques with image Intelligent Speech Signal Processing. https://doi.org/10.1016/B978-0-12-818130-0.00001-5 © 2019 Elsevier Inc. All rights reserved.
1
2
Intelligent Speech Signal Processing
AI & deep learning Speech processing
Visualization
Data analysis, signal processing & pattern recognition
Consistency checking Convolutional neural networks
Image processing & pattern recognition
Decision making (no bias)
FIG. 1.1 Smart healthcare and the use of speech processing: can we get more information?
analysis-based tools and techniques [9–12]. More specifically, it is important to note that doctors can predict or guess about the presence of tuberculosis, for instance, based on verbal communication (quoted answers to questions, such as “do you sleep well?” and “how are your eating habits?”) before they start the X-ray screening procedure. If this is the case, in the complete project (outlined earlier), speech processing can help. This means that we would be able to come up with the complete information so that further processing can be performed. It is important to note, for instance, that speech and voice cues (before and after the doctor’s visit) can help one understand the patient’s willingness to continue with treatment. Speech and voice can definitely convey emotions over time. Further, pain can simply be read by speech/voice level. We can also automate and check the trends related to how doctors and other staff members behave toward patients. Can speech be a component in helping to find consistency that is evident in other sources of data, as shown in Fig. 1.1? The use of a (proposed) convolutional neural network, as in Fig. 1.1, helps explain the fact that a machine, unlike a human expert, for instance, can make a decision without the bias possible in human choices. Also, visualization is possible that can show how different sources of data are connected. We consider artificial intelligence (AI) and machine learning (ML) tools for automation since data have to be
Speech Processing in Healthcare: Can We Integrate? Chapter
1
3
collected over time. Analyzing big data is extremely important since it is not possible manually because humans are more error-prone and human analysis is costlier. As mentioned earlier, local languages other than English can be considered in healthcare. In one work [13], authors reported the use of the Tamil language, in addition to English, to estimate heartbeat from speech/voice. A few more can be cited, showing how local, regional languages, such as German [14], Malay [15], and Slovenian [16], have helped speech technology progress. In another context [17], emergency medical care often depends on how quickly and accurately field medical personnel can access a patient’s background information and document their assessment and treatment of the patient. What if we could automate speech/voice recognition tools in the field? It is clear that research scientists should come up with precise tools that people can trust. Analyzing speech/voice concurrently with background music or other noise is important. Other works can be referenced for more detailed information [18–21]. As data change and increase, machine learning could help to automate the data retrieval and recording system. The use of the extreme-learning voice activity detection is prominent in the field [22]. As we go further, for example with real-time speech/voice recognition/classification, active learning should be considered since scientists found that learning over time is vital [23]. In general, Fig. 1.1 shows how important different sources of data and speech are as components in things such as sensor-based and image data (X-ray and/or reports that are handwritten or machine printed).
References [1] J. Flanagan, L. Rabiner (Eds.), Speech Synthesis, Dowden, Hutchinson & Ross, Inc., Pennsylvania, 1973. [2] J. Flanagan, Speech Analysis, Synthesis, and Perception, Springer-Verlag, Berlin-HeidelbergNew York, 1972. [3] C. Xian-Yi, P. Yan, Review of modern speech synthesis, in: W. Hu (Ed.), Electronics and Signal Processing. Lecture Notes in Electrical Engineering, vol. 97, Springer, Berlin, Heidelberg, 2011. [4] A. Iida, N. Campbell, Speech database design for a concatenative text-to-speech synthesis system for individuals with communication disorders, Int. J. Speech Technol. 6 (4) (2003) 379–392. [5] R. Bossemeyer, M. Hardzinski, Talking call waiting: an application of text-to-speech, Int. J. Speech Technol. 4 (1) (2001) 7–17. [6] K. Sailunaz, M. Dhaliwal, J. Rokne, R. Alhajj, Emotion detection from text and speech: a survey, Soc. Netw. Anal. Min. 8 (28) (2018). [7] A. Revathi, C. Jeyalakshmi, Emotions recognition: different sets of features and models, Int. J. Speech Technol. (2018). [8] M. Swain, A. Routray, P. Kabisatpathy, Databases, features and classifiers for speech emotion recognition: a review, Int. J. Speech Technol. 21 (1) (2018) 93–120. [9] K.C. Santosh, S. Antani, Automated chest X-ray screening: can lung region symmetry help detect pulmonary abnormalities? IEEE Trans. Med. Imaging 37 (5) (2018) 1168–1177.
4
Intelligent Speech Signal Processing
[10] S. Vajda, A. Karagyris, S. Jaeger, K.C. Santosh, S. Candemir, Z. Xue, S. Antani, G. Thoma, Feature selection for automatic tuberculosis screening in frontal chest radiographs, J. Med. Syst. 42 (8) (2018) 146. [11] A. Karargyris, J. Siegelman, D. Tzortzis, S. Jaeger, S. Candemir, Z. Xue, K.C. Santosh, S. Vajda, S.K. Antani, L.R. Folio, R. George, Thoma: combination of texture and shape features to detect tuberculosis in digital chest X-rays, Int. J. Comput. Assist. Radiol. Surg. 11 (1) (2016) 99–106. [12] K.C. Santosh, S. Vajda, S.K. Antani, G.R. Thoma, Edge map analysis in chest X-rays for automatic abnormality screening, Int. J. Comput. Assist. Radiol. Surg. 11 (9) (2016) 1637–1646. [13] A. Milton, K.A. Monsely, Tamil and English speech database for heartbeat estimation, Int. J. Speech Technol. (2018). [14] The German text-to-speech synthesis system MARY: a tool for research, development and teaching, Int. J. Speech Technol. 6 (4) (2003) 365–377. [15] Y.A. El-Imam, Z.M. Don, Text-to-speech conversion of standard Malay, Int. J. Speech Technol. 3 (2) (2000) 129–146. [16] T. Sˇef, M. Gams, SPEAKER (GOVOREC): a complete Slovenian text-to-speech system, Int. J. Speech Technol. 6 (3) (2003) 277–287. [17] G. Thomas, Holzman: speech-audio interface for medical information management in field environments, Int. J. Speech Technol. 4 (3–4) (2001) 209–226. [18] B.K. Khonglah, S.R.M. Prasanna, Clean speech/speech with background music classification using HNGD spectrum, Int. J. Speech Technol. 20 (4) (2017) 1023–1036. [19] B.K. Khonglah, S.M. Prasanna, Speech/music classification using speech-specific features, Digital Signal Process. 48 (2016) 71–83. [20] E. Scheirer, M. Slaney, Construction and evaluation of a robust multifeature speech/music discriminator, in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, 1997, pp. 1331–1334. [21] T. Zhang, C.J. Kuo, Audio content analysis for online audiovisual data segmentation and classification, IEEE Trans. Speech Audio Process. 9 (4) (2001) 441–457. [22] H. Mukherjee, S.M. Obaidullah, K.C. Santosh, S. Phadikar, K. Roy, Line spectral frequencybased features and extreme learning machine for voice activity detection from audio signal, Int. J. Speech Technol. (2018). [23] M.-R. Bouguelia, S. Nowaczyk, K.C. Santosh, A. Verikas, Active learning in the domain: agreeing to disagree: active learning with noisy labels without crowdsourcing, Int. J. Mach. Learn. Cybern. (2018).