Optik - International Journal for Light and Electron Optics 204 (2020) 164225
Contents lists available at ScienceDirect
Optik journal homepage: www.elsevier.com/locate/ijleo
Original research article
Use of FT-IR spectroscopy combined with SVM as a screening tool to identify invasive ductal carcinoma in breast cancer
T
Jie Liua,1, Hong Chengb,1, Xiaoyi Lvc,a,*, Zhaoxia Zhangd,**, Xiangxiang Zhenge, Guohua Wue, Jun Tangf, Xiaorong Mag, Xiaxia Yuef a
College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China The Affiliated Tumor Hospital of Xinjiang Medical University, Urumqi 830000, China c School of Software, Xinjiang University, Urumqi 830046, China d The First Affiliated Hospital of Xinjiang Medical University, Urumqi 830000, China e School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China f Physics and Chemistry Detecting Center, Xinjiang University, Urumqi 830046, China g School of Physics Science and Technology, Xinjiang University, Urumqi 830046, China b
A R T IC LE I N F O
ABS TRA CT
Keywords: Breast cancer Invasive ductal carcinoma Fourier transform infrared Support vector machine
This study proposes a rapid, noninvasive method for screening invasive ductal carcinoma (IDC) and noninvasive ductal carcinoma (non-IDC) using serum Fourier transform infrared (FT-IR) spectroscopy combined with multivariate statistical methods. Serum samples from 114 healthy patients, 74 IDC patients, and 41 non-IDC patients were examined in this experiment. Tentative assignments of the FT-IR peaks in the measured serum spectra suggested specific biomolecular changes between the groups. Principal component analysis was used for feature extraction to reduce spectral dimension and improve the diagnostic model rate. Linear, polynomial, and radial basis function kernels were used to build support vector machine models for the extracted features. Polynomial kernel achieves the best results with an accuracy of 95.7 %, and sensitivity and specificity of 91.7 % and 100 %, respectively. The results of the study indicate that serum FT-IR spectroscopy combined with multivariate statistical analysis has considerable potential for screening IDC in breast cancer. This technology can be used to develop a portable, rapid screening device for discriminating healthy patients and those with IDC and non-IDC.
1. Introduction Breast cancer is a malignant tumor that seriously affects women’s physical and mental health on a global scale. This type of cancer starts from breast cells [1]. However, it is difficult to predict the early stages of breast cancer before the tumor grows to a reasonable size. According to the International Cancer Research Center’s “Global Cancer Statistics for 2018,” breast cancer ranked first in women’s cancer incidence and mortality in the world in 2018, accounting for 24.2 % and 15.0 %, respectively. As there are different types of tissues in the breast and cancer can occur in most of these tissues. For different types of comprehensive treatments and prognosis, a clinically developed treatment plan should be determined in combination with the pathological type and histological grade [2]. According to the introduction of the “Specifications for the diagnosis and treatment of common malignant tumors in ⁎
Corresponding author at: School of Software, Xinjiang University, Urumqi 830046, China. Corresponding author. E-mail addresses:
[email protected] (X. Lv),
[email protected] (Z. Zhang). 1 These authors contributed equally to this work and should be considered co-first authors. ⁎⁎
https://doi.org/10.1016/j.ijleo.2020.164225 Received 28 November 2019; Received in revised form 30 December 2019; Accepted 13 January 2020 0030-4026/ © 2020 Elsevier GmbH. All rights reserved.
Optik - International Journal for Light and Electron Optics 204 (2020) 164225
J. Liu, et al.
China” [3], the most common classification types in China are noninvasive cancer, early invasive cancer, and invasive cancer. Invasive carcinoma is mainly classified into invasive lobular carcinoma and invasive ductal carcinoma (IDC). IDC accounts for 65 %–80 % of breast cancer, and 80 %–90 % of invasive breast cancer. Therefore, it is necessary to develop a method for early screening of breast cancer and for the screening of IDC. At present, screening methods for breast cancer include breast self-examination, clinical breast examination, mammography, magnetic resonance imaging, and ultrasound [4]. While routine self-examination allows women to detect breast masses early, it does not reduce breast cancer mortality, and screening for early breast cancer is less effective than conventional mammography. Although the mammography imaging technique is a noninvasive method, it causes relatively less pain, and its resolution is high. Although mammography uses low doses of radiation, radiation exposure has cumulative effects on cancer risk. When radiologic screening is started at a young age, risk of cancer is increased [5]. Breast magnetic resonance imaging and ultrasound imaging are safe modalities used for the screening of abnormal changes but they still have issues of detection limits, time, and cost [6]. Furthermore, the analysis and diagnosis of mammography depend mainly on the professional skills and subjective experience of the doctor. Therefore, there is a need to develop a rapid, effective, and economical screening tool for breast cancer. Fourier-transform infrared (FT-IR) spectroscopy, as a fast spectral scan method. Furthermore, FT-IR spectroscopy is a highly reproducible analytical technique for studying the structure and conformation of biomolecules such as proteins and peptides [7], nucleic acids [8], carbohydrates, and lipids [9–11]. As all the infrared spectra of organic compounds have more absorption peaks, information such as the position, shape, and intensity of the peaks can be provided on one spectrum, and it is called “the fingerprint of organic compounds.” Therefore, some research groups have used infrared spectroscopy to study cancers [11–15]. In this study, we use the principal component analysis (PCA)-support vector machine (SVM) algorithm and select different kernel functions to compare and analyze the FT-IR spectral data to distinguish healthy patients, and patients with IDC and noninvasive ductal carcinoma (nonIDC). The purpose of this study is to investigate the feasibility of using serum FT-IR spectroscopy to screen patients with IDC in breast cancer. Thus, we can not only accurately and quickly select different treatments and prognostic measures for different patients, but also obtain early diagnosis of breast cancer. To the best of our knowledge, this is the first study that uses FT-IR spectroscopy for the screening of IDC in breast cancer. 2. Materials and methods 2.1. Experimental materials All serum samples were collected from the Affiliated Tumor Hospital of Xinjiang Medical University. Approval from the ethics committee was obtained to study the human blood serum samples. A total of 229 serum samples was used, including serum samples from 114 healthy patients as a control group, and serum samples from 115 breast cancer patients, including 74 IDC and 41 non-IDC (including mucinous carcinoma, ductal carcinoma in situ, invasive lobular carcinoma, fibroadenomas, and other breast cancers) breast cancer patients. Patient specific information is shown in Table 1. All serum samples were stored in a refrigerator at −20 °C before the test. During the test, the serum samples were naturally thawed in an environment at a constant room temperature of 22 °C. Subsequently, 2 μL samples were pipetted onto an infrared spectrometer using a pipette. After drying naturally for 10 min, an infrared test was performed. 2.2. Experimental instrumentation The infrared spectrum acquisition instrument is a VERTEX 70 infrared spectrometer from Germany BRUKER, with air as the background, the scanning range of 700–4000 cm−1, resolution of 8 cm−1, and scanning times of 32 times. All samples are measured Table 1 Sample clinical information form. Sample Non-IDC
IDC Healthy
mucinous carcinoma Poorly differentiated cancer ductal carcinoma in situ Intraductal papillary carcinoma fibroadenomas invasive lobular carcinoma Breast tumor Mammary glandular carcinoma Medullary carcinoma of the breast Metaplastic cancer Mammary gland adenoid invasive breast cancer Total
Number
Age
Mean age
Tumor size/cm
Number of lymph node metastasis
3 2 19 6 2 1 4 1 1 1 1 41 74 114
49-62 44-45 27-58 34-56 33-40 59 39-65 40 57 56 56 27-65 30-78 32-75
53.33 44.5 45.32 45.17 36.5 59 51.75 40 57 56 56 47.05 48.09 51.17
1.0*1.0–1.5*1.5 2.0*1.0–4.0*1.0 0.3*0.4-3.0*2.0 0.5*0.8-2.0*2.0 1.0*0.9-1.0*1.0 1.0*0.8 3.0*2.0–3.1*3.0 3.0*3.0 1.5*1.0 3.0*2.5 1.0*1.0 0.3*0.4–7.0*4.5 0.5*0.8-7.0*4.5 –
0 1 8 3 0 0 2 1 0 0 0 52 37 –
2
Optik - International Journal for Light and Electron Optics 204 (2020) 164225
J. Liu, et al.
once. Before each measurement, the background data are measured using the OPUS 65 software on a Windows XP system, and then each sample data is measured. After the test is completed, Fourier transform is automatically performed in the software to obtain Fourier infrared spectrum data. For processing and statistical analysis of FT-IR spectral data, the software MATLAB R2016a was used, and all the images were drawn using the software Origin 2018. 2.3. Data analysis and processing The raw spectral data obtained from the serum samples are the Fourier-transformed data after subtracting the background interference at the time of measurement; hence, we normalize each original FT-IR spectral data to reduce the effect of differences in spectral intensity in each sample. The pre-processed FT-IR spectral data contain 855 variables from the 700–4000 cm−1 band. To improve the rate and accuracy of the model diagnosis, it is necessary to reduce the dimensionality of the high-dimensional spectral data. PCA can reduce the number of variables [16]. Unsupervised PCA has been widely used for the dimensionality reduction of spectral data. Although it can effectively extract data features, its recognition ability is low. Therefore, to classify and discriminate FT-IR spectral data, SVM is used on the PCs extracted using PCA for discriminant diagnosis. The SVM was first proposed by Vapnik [17]. Owing to the existence of kernel functions in SVM, it can separate nonlinear data that are not well divided on the plane [18]. Commonly used kernel functions are linear kernel, polynomial kernel, and radial basis function (RBF) kernel. By selecting the appropriate kernel function, satisfactory classification results can be obtained. The accuracy, sensitivity, and specificity of different kernel function models are compared [19]. 3. Results 3.1. FT-IR spectral analysis Fig. 1 shows the mean FT-IR spectra of ± 1 standard deviation of serum from healthy patients, IDC patients, and non-IDC patients, ranging from 700 to 4000 cm−1. The shaded area represents one standard deviation. From the three solid lines (a), (b), and (c) in Fig. 1, it can be observed that the characteristic spectra of serum in healthy patients, and patients with IDC and non-IDC are very similar in shape and intensity, and hence, a data analysis algorithm to distinguish between healthy patients and patients with IDC and non-IDC is required [18]. Fig. 2 shows three types of difference spectra. And we used the independent sample Kruskal-Wallis test to make a significant analysis of the difference spectrum, p value < 0.05, so the difference spectrum has with statistically significant difference. In conclusion, there are some differences in serum spectra between healthy patients, and patients with IDC and non-IDC, but there are few evident differences. Therefore, we attempt to use statistical methods to distinguish healthy patients, and patients with IDC and non-IDC. 3.2. PCA feature extraction PCA is used to reduce the feature dimension of the input variables and extract key information from the spectrum. First, the first three PCs (PC1, PC2, and PC3) with the highest contribution rate to the PCA are selected for analysis. Fig. 3(a)-(c) show twodimensional scatter plots of PC1/PC2, PC1/PC3, and PC2/PC3 for X and Y axes, respectively. It can be observed from the figure that, after PCA dimensionality reduction treatment, non-IDC patients has a certain degree of discrimination between patients with Healthy and IDC in Fig. 3(a) and (c), and IDC patients and Healthy patients overlap more. there is some overlap among healthy patients, and patients with IDC and non-IDC in Fig. 3(b). To derive the exact diagnostic specificity and sensitivity further, SVMs with different
Fig. 1. (a) Mean FT-IR spectrum of serum ± 1 standard deviation of 114 healthy patients. (b) Mean FT-IR spectrum of serum ± 1 standard deviation of 74 IDC patients. (c) Mean FT-IR spectrum of serum ± 1 standard deviation of 41 non-IDC patients. 3
Optik - International Journal for Light and Electron Optics 204 (2020) 164225
J. Liu, et al.
Fig. 2. (a)-(b) Serum spectral differences between healthy patients and IDC patients. (a)-(c) Serum spectral differences between healthy patients and non-IDC patients. (b)-(c) Serum spectral differences between IDC patients and non-IDC patients.
kernel functions are used for diagnostic comparison. In general, the accuracy of the model increases with the number of PCs used [20]. Fig. 4 shows the cumulative variance contribution rate for different PC numbers (from PC1 to PC10) and the accuracy curve of SVM with three different kernel functions. The number of PCs can be selected according to the standard that allows 90 % of the total variance to be retained [20]. When the number of PCs is 7, the cumulative variance contribution rate is 98.4 %, and the model accuracy is the highest; hence, the first seven PCs were selected to build the classification model (PCA-RBF-SVM, PCA-linear-SVM, PCA-polynomial-SVM).
3.3. Model evaluation In the evaluation, 229 FT-IR spectral data of PCA feature extraction were selected, and 159 data were randomly selected as the training set. Among them, healthy patients, IDC patients, and non-IDC patients accounted for 80, 51, and 28 data, respectively. The data of 70 patients, including 34 healthy patients, 23 IDC patients, and 13 non-IDC patients, were used as the test sets to verify the diagnostic accuracy of SVMs with different kernel functions. The software MATLAB 2016a has a toolbox classification learner to select different kernel functions for comparative analysis. Table 2 shows the sensitivity, specificity, and diagnostic accuracy of the three kernel functions. It can be observed that the diagnostic effect is best when the polynomial kernel is selected. The sensitivity is 91.7 %, the specificity is 100 %, and the total accuracy is 95.7 %. To compare the classification performance of SVMs with three kernel functions further, Table 3 provides the confusion matrix for the linear, polynomial, and RBF kernel models. Each column of the matrix represents the predicted class, and each row represents a histopathological reference class [21]. As shown in Table 3, healthy patients under the polynomial kernel were classified as having a reasonable accuracy of 100 %, and IDC and non-IDC were identified in two cases and one case, respectively. The linear kernel has the worst classification among the three kernel functions, and the polynomial kernel is the best.
4. Discussion Here, we report the use of serum infrared spectroscopy combined with multivariate statistical methods to screen breast cancer patients. By analyzing the mean spectrum of IDC, non-IDC, and healthy patients, we observed specific differences in the infrared spectra of the three groups of serum. Owing to the difference in the contents of specific substances, the peak intensity and spectral shape of the infrared spectra of the three groups of serum may change. Owing to these changes, serum infrared spectroscopy can be combined with statistical methods for screening and diagnosing breast cancer patients. The main spectral changes between breast cancer patients and healthy patients are recorded in the CH stretching vibration, CeO ribose, ribose skeleton, and PeO vibration region [22]. Table 4 lists tentative assignments for the dominant spectral bands according to the relevant literature [15,22–29]. From Figs. 1 and 2, we can observe that there is a significant difference among healthy patients, IDC patients, and non-IDC patients in the range of 1539–1650 cm−1, and the serum infrared intensity of healthy patients is the largest, hence, it is preliminarily believed that breast cancer patients and healthy people have differences in amide II and amide I. From Fig. 2, we can observe that the difference spectrum of curve (a)-(b) and (a)-(c) is more obvious than that of curve (b)-(c), because both IDC and non-IDC belong to breast cancer patients, so the difference between them is smaller. Except in 1539−1650 cm−1 band, the spectra of patients of healthy, IDC and non-IDC samples are different, the difference between IDC and non-IDC is more obvious at the band of 3560−3880 cm−1, and serum infrared intensity of IDC patients is larger. Therefore, we think that the occurrence of IDC patients may be caused by the corresponding biochemical substances at 3560−3880 cm−1. At present, we are unable to clearly clarify the mechanism of these substances on breast cancer, but we will study further in the follow-up work. 4
Optik - International Journal for Light and Electron Optics 204 (2020) 164225
J. Liu, et al.
Fig. 3. Two-dimensional scatter plot of the first three PCs; (a) is a scatter plot of three types of samples under PC1 and PC2, (b) is a scatter plot of three types of samples under PC1 and PC3, and (c) is a scatter plot of three types of samples under PC2 and PC3.
5. Conclusion In this study, we analyzed 74 patients with IDC, 41 patients with non-IDC, and 114 healthy patients using serum FT-IR spectroscopy. The results of the experiment show that the spectra obtained from serum samples of healthy patients, IDC patients, and nonIDC patients are very similar in terms of spectral shape and intensity, but there are subtle differences, which indicates that we can initially screen patients with IDC among breast cancer patients through powerful data analysis algorithms. To improve the efficiency of the SVM diagnostic model and verify the accuracy under different kernel functions, we normalize the 5
Optik - International Journal for Light and Electron Optics 204 (2020) 164225
J. Liu, et al.
Fig. 4. PC cumulative variance contribution rate and diagnostic accuracy of different kernel functions under different PC numbers. Table 2 Performance comparison of linear, polynomial, and RBF kernels. kernel
Sensitivity (%)
Specificity (%)
Accuracy (%)
Linear Polynomial RBF
69.4 91.7 88.9
88.2 100 97.1
78.6 95.7 92.9
Table 3 Confusion matrix of linear, polynomial, and RBF kernels.
Linear
IDC
Non-IDC
Healthy
IDC
Non-IDC
Healthy
IDC
Non-IDC
Healthy
IDC Non-IDC Healthy
12 0 4
0 13 0
11 0 30
21 0 0
0 12 0
2 1 34
21 0 1
0 11 0
2 2 33
Polynomial
RBF
Table 4 Peak positions and assignments of the major infrared bands of human serum according to some references. Wavenumber (cm−1)
Assignment
855 910 930 1035 1100 1550 1650 1744 2853 2925 3561 3611
δ(PeO) δ(riribose) δ(CeCeN) δ(CeO) νas(PO− 2 ) of Nucleic acid δ(NeH) of amide II band amide I ν(C]O) of lipids νs(CH2) νs(CH3) ν(OeH) ν(OeH) and ν(NeH)
ν: stretching vibration; δ: bending vibration; νs: symmetric stretch; as: asymmetric.
original spectral data and use PCA for feature extraction to improve the model diagnosis rate, and use the linear, polynomial, and RBF kernels to verify the diagnosis results of SVM under different kernel functions. The results show that, under the polynomial kernel, SVM is more suitable for these experimental data. Finally, we plan to collect more samples from patients to verify the reliability of this model. This rapid, noninvasive, and low-cost detection technology is important for the screening and prevention of early breast cancer.
6
Optik - International Journal for Light and Electron Optics 204 (2020) 164225
J. Liu, et al.
Declaration of Competing Interest The authors declare that there are no conflicts of interest related to this article. Acknowledgments This work was supported by the National Natural Science Foundation of China (NSFC) (No. 61765014); Reserve Talents Project of National High-level Personnel of Special Support Program (QN2016YX0324); Urumqi Science and Technology Project (No. P161310002 and Y161010025); and Reserve Talents Project of National High-level Personnel of Special Support Program (Xinjiang [2014]22). References [1] C.K. Osborne, J. Shou, S. Massarweh, et al., Crosstalk between estrogen receptor and growth factor receptor pathways as a cause for endocrine therapy resistance in breast cancer, Clin. Cancer Res. 11 (2) (2005) 865. [2] A. Sanguinetti, A. Polistena, R. Lucchini, et al., Male breast cancer, clinical presentation, diagnosis and treatment: twenty years of experience in our Breast Unit, Int. J. Surg. Case Rep. 20 (2016) 8–11. [3] Ministry of Health of the People’s Republic of China, Diagnosis and Treatment of Common Malignant Tumors in China, Beijing Medical University, China Union Medical University, United Press, Beijing, 1991. [4] B. Lauby-Secretan, C. Scoccianti, D. Loomis, et al., Breast-cancer screening - viewpoint of the IARC working group, N. Engl. J. Med. 372 (24) (2015) 2353–2358. [5] https://www.msdmanuals.com/professional/gynecology-and-obstetrics/breast-disorders/breast-cancer. [6] M. Bilal, M. Bilal, S. Tabassum, et al., Optical screening of female breast Cancer from whole blood using Raman spectroscopy, Appl. Spectrosc. 71 (5) (2016) 1004–1013. [7] H. Pi, The conformational analysis of peptides using Fourier transform IR spectroscopy, Biopolymers 4 (37) (1995). [8] G.I. Dovbeshko, N.Y. Gridina, E.B. Kruglova, et al., FTIR spectroscopy studies of nucleic acid damage, Talanta 53 (1) (2000) 233–246. [9] M. Tsuboi, Application of infrared spectroscopy to structure studies of nucleic acids, Appl. Spectrosc. Rev. 3 (1) (1970) 45–90. [10] M. Kačuráková, R.H. Wilson, Developments in mid-infrared FT-IR spectroscopy of selected carbohydrates, Cheminform 44 (4) (2001) 291–303. [11] F. Elmi, A.F. Movaghar, M.M. Elmi, et al., Application of FT-IR spectroscopy on breast cancer serum analysis, Spectrochim. Acta A. Mol. Biomol. Spectrosc. 187 (2017) 87–91. [12] L.V. Bel’skaya, Use of IR spectroscopy in cancer diagnosis. A review, J. Appl. Spectrosc. 86 (2) (2019) 187–205. [13] S. Kar, D.R. Katti, K.S. Katti, Fourier transform infrared spectroscopy based spectral biomarkers of metastasized breast cancer progression, Spectrochim. Acta A. Mol. Biomol. Spectrosc. 208 (2019) 85–96. [14] S. Rehman, Z. Movasaghi, J.A. Darr, et al., Fourier transform infrared spectroscopic analysis of breast Cancer tissues; identifying differences between normal breast, invasive ductal carcinoma, and ductal carcinoma in situ of the breast, Appl. Spectrosc. Rev. 45 (5) (2010) 355–368. [15] J. Backhaus, R. Mueller, N. Formanski, et al., Diagnosis of breast cancer with infrared spectroscopy from serum samples, Vib. Spectrosc. 52 (2) (2010) 173–177. [16] W. Liu, Z. Sun, J. Chen, C. Jing, J. Spectrosc. 2016 (2016) 1–6. [17] V.N. Vapnik, The Nature of Statistical Learning Theory, Wiley, New York, 1998. [18] X. Zheng, G. Lv, Y. Zhang, et al., Rapid and non-invasive screening of high renin hypertension using Raman spectroscopy and different classification algorithms, Spectrochim. Acta A. Mol. Biomol. Spectrosc. 215 (2019) 244–248. [19] N.C. Dingari, I. Barman, A. Saha, et al., Development and comparative assessment of Raman spectroscopic classification algorithms for lesion discrimination in stereotactic breast biopsies with microcalcifications, J. Biophotonics 6 (4) (2013) 371–381. [20] X. Li, T. Yang, S. Li, et al., Different classification algorithms and serum surface enhanced Raman spectroscopy for noninvasive discrimination of gastric diseases, J. Raman Spectrosc. 47 (8) (2016) 917–925. [21] R. Gautam, S. Vanga, A. Madan, et al., Raman spectroscopic studies on screening of Myopathies, Anal. Chem. 87 (4) (2015) 2187–2194. [22] Y. Manavbasi, E. Suleymanoglu, Arch. Pharm. Res. 30 (2007) 1027. [23] R. Davis, L. Mauer, Fourier transform infrared (FT-IR) spectroscopy: a rapid tool for detection and analysis of foodborne pathogenic bacteria, Curr. Res. Technol. Educ. Top. Appl. Microbiol. Microbial Biotechnol. 2 (2010) 1582–1594. [24] M. Jackson, H. Mantsch, D. Chapman, Infrared Spectroscopy of Biomolecules, Wiley-Liss, 1996, pp. 314–316. [25] C. Koch, M. Brandstetter, P. Wechselberger, et al., Ultrasound-enhanced attenuated total reflection mid-infrared spectroscopy in-line probe: acquisition of cell spectra in a bioreactor, Anal. Chem. 87 (2015) 2314–2320. [26] H. Schulz, M. Baranska, Identification and quantification of valuable plant substances by IR and Raman spectroscopy, Vib. Spectrosc. 43 (1) (2007) 13–25. [27] G.I. Dovbeshko, N.Y. Gridina, E.B. Kruglova, et al., FTIR spectroscopy studies of nucleic acid damage, Talanta 53 (1) (2000) 233–246. [28] J.G. Wu, Y.Z. Xu, C.W. Sun, et al., Distinguishing malignant from normal oral tissues using FTIR fiber-optic techniques, Biopolymers 62 (4) (2010) 185–192. [29] C. Chen, G. Du, D. Tong, et al., Exploration research on the fusion of multimodal spectrum technology to improve performance of rapid diagnosis scheme for Thyroid Dysfunction, J. Biophotonics (2019) e201900099-e.
7