Diagnosis of colorectal cancer by near-infrared optical fiber spectroscopy and random forest

Diagnosis of colorectal cancer by near-infrared optical fiber spectroscopy and random forest

Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 135 (2015) 185–191 Contents lists available at ScienceDirect Spectrochimica Acta...

2MB Sizes 0 Downloads 25 Views

Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 135 (2015) 185–191

Contents lists available at ScienceDirect

Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy journal homepage: www.elsevier.com/locate/saa

Diagnosis of colorectal cancer by near-infrared optical fiber spectroscopy and random forest Hui Chen a, Zan Lin b, Hegang Wu c, Li Wang b, Tong Wu b, Chao Tan b,d,⇑ a

Hospital, Yibin University, Yibin, Sichuan 644007, China Department of Chemistry and Chemical Engineering and Key Lab of Process Analysis and Control of Sichuan Universities, Yibin University, Yibin, Sichuan 644007, China c The First People’s Hospital of Yibin , Yibin, Sichuan 644000, China d Computational Physics Key Laboratory of Sichuan Province, Yibin University, Yibin, Sichuan 644007, China b

h i g h l i g h t s

g r a p h i c a l a b s t r a c t

 Major spectral feature of colon tissues

were captured.  Random forest was used for

constructing diagnostic models.  Such a simulation procedure was fast

and convenient.

a r t i c l e

i n f o

Article history: Received 21 April 2014 Received in revised form 16 June 2014 Accepted 2 July 2014 Available online 10 July 2014 Keywords: Chemometrics Spectrometry Biodiagnostics Random forest

a b s t r a c t Near-infrared (NIR) spectroscopy has such advantages as being noninvasive, fast, relatively inexpensive, and no risk of ionizing radiation. Differences in the NIR signals can reflect many physiological changes, which are in turn associated with such factors as vascularization, cellularity, oxygen consumption, or remodeling. NIR spectral differences between colorectal cancer and healthy tissues were investigated. A Fourier transform NIR spectroscopy instrument equipped with a fiber-optic probe was used to mimic in situ clinical measurements. A total of 186 spectra were collected and then underwent the preprocessing of standard normalize variate (SNV) for removing unwanted background variances. All the specimen and spots used for spectral collection were confirmed staining and examination by an experienced pathologist so as to ensure the representative of the pathology. Principal component analysis (PCA) was used to uncover the possible clustering. Several methods including random forest (RF), partial least squares-discriminant analysis (PLSDA), K-nearest neighbor and classification and regression tree (CART) were used to extract spectral features and to construct the diagnostic models. By comparison, it reveals that, even if no obvious difference of misclassified ratio (MCR) was observed between these models, RF is preferable since it is quicker, more convenient and insensitive to over-fitting. The results indicate that NIR spectroscopy coupled with RF model can serve as a potential tool for discriminating the colorectal cancer tissues from normal ones. Ó 2014 Elsevier B.V. All rights reserved.

⇑ Corresponding author at: Department of Chemistry and Chemical Engineering and Key Lab of Process Analysis and Control of Sichuan Universities, Yibin University, Yibin, Sichuan 644007, China. Tel./fax: +86 831 3551080. E-mail address: [email protected] (C. Tan). http://dx.doi.org/10.1016/j.saa.2014.07.005 1386-1425/Ó 2014 Elsevier B.V. All rights reserved.

Introduction Cancer is a disease characterized by uncontrollable growth and differentiation of cells and is one of the principal causes of death in

186

H. Chen et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 135 (2015) 185–191

the world [1–3]. Colorectal cancer is a type of cancer due to uncontrolled cell growth in the colon, rectum, or appendix. Despite the progress in diagnostic techniques, unfortunately, the vast majority of colorectal cancers, more than 90%, have been either advanced or metastasized by the time they are diagnosed. Hence, there is an urgent need to develop accurate, fast, convenient, and inexpensive diagnostic method to detect the malignancy in the earlier stage for increasing the survival probability [4]. Even if host genes, bacterial virulence and environmental factors have been observed in oncogenic process, the underlying molecular mechanism is still poorly understood [5]. Nowadays, there exist several available diagnostic methods for colorectal cancer [6]. However, based on the conventional screening methods such as a white-light endoscope, it is difficult to probe early neoplasia or subtle lesions. FOBT coupled with subsequent colonoscopy is a popular choice for early detection of colorectal cancer. Biopsy followed by pathological assessment remains the gold standard, but it involves a complex procedure composed of fixation, dehydration, embedding, slicing and staining. In short, conventional diagnostic methods are time-consuming, labor-intensive, require experienced experts, and are strongly dependent on the experts’ ability and subjective judgment [7]. In recent years, optical spectroscopic methods have been considerably investigated for cancer and precancer diagnosis and evaluation [8–10]. Among these methods, near-infrared (NIR) spectroscopy has shown huge potential since it can provide rich information of the molecular composition and structures of biological tissues. It has been used in some cancer researches including stomach [11], lung [12], breast [13], cervix [14], prostate [15], etc. Also, NIR-related methods are reagent-free, can rapidly detect changes of cells and tissues at the molecular level, particularly during carcinogenesis. It is known that biological tissues usually comprise DNA/RNA, proteins, carbohydrates, lipids and water as the main constituents; all of these can contribute meaningfully to the NIR absorption profile and therefore provide the informative basis for diagnosing cancer [16]. Even if the NIR is defined as encompassing the spectral range of 780–2500 nm, it is convenient to subdivide further this region into short and long NIR subregions. The short NIR region mainly reflects the signal of heme proteins and cytochromes, and provides rich information about tissue blood flow, as well as oxygen saturation and consumption. Long NIR region is associated to the combinations and overtones of hydrogen containing groups and thus captures valuable information on the chemical composition of tissues. Cancerous tissues are different from normal ones in composition and histology. Therefore, any alteration in the composition of the tissue can be reflected in NIR spectrum and used for diagnostic purposes. Several research groups have confirmed the advantages of NIR spectroscopy for malignant studies in both animal and human tissues. However, despite the merits, the NIR spectrum is an overlapped, broad and weak signal without distinct signature of individual components [17,18]. Under this situation, it is necessary to apply appropriate modeling methods to extract the subtle valuable information for clinical application. The modeling methods are of great importance and directly determine if the NIRbased applications can success. More recently, the so-called ‘‘ensemble’’ strategy has attracted more attention in various fields [19–21]. The main advantage of ensemble is that it can increase the accuracy and robustness of the predictor by a cooperation of many individual predictors. Random forest (RF) [22], a relatively new ensemble-based modeling technique, has attracted increasing interest of researchers. It combines Breiman’s bagging idea and the random selection of input variables. RF holds many attractive features including a small number of tunable parameters, automatic calculation of generalization errors and variable importance, automatic handling of missing data, insensitive to over-fitting.

In the present work, the qualitative NIR spectral differences between colorectal cancer and healthy tissues in surgically resected specimens were investigated. A Fourier transform NIR spectroscopy instrument equipped with a fiber-optic probe was used to mimic in situ clinical measurements. A total of 186 spectra were collected and preprocessed by standard normalize variate (SNV) for removing unwanted background variances. All the specimens and spots used for spectral collection were confirmed by an experienced pathologist so as to ensure the representative of the pathology. Principal component analysis (PCA) was used to discover the possible clustering. Several methods including RF, partial least squares-discriminant analysis (PLSDA), K-nearest neighbor models and classification and regression tree (CART) were used to extract spectral features and to construct the diagnostic models. By comparison, it reveals that, even if no obvious difference of misclassified ratio (MCR) was observed between these models, RF is preferable since it is quicker, more convenient and insensitive to over-fitting. The results indicate that NIR spectroscopy coupled with RF model can serve as a potential tool for discriminating the colorectal cancer tissues from normal ones.

Theory and methods Random forest Random forest (RF) is one of potential algorithm for building classifiers and has been first introduced by Breiman [22]. RF exhibits some attractive features such as a small number of tunable parameters, automatic calculation of generalization errors, automatic handling of missing values, scale invariance and strong resistance to overfitting. It is actually an ensemble of unpruned decision trees by injecting randomness in both selecting samples for the training set and selecting variables for best splitting. At first, bagging (bootstrap aggregating) was introduced to decision trees for adding randomness of selecting training samples. So, dissimilar trees can be generated by re-sampling with replacement. The sensitivities of constructed trees to the constituent of the training set was reduced. Amit and Geman extended Breiman’s concept by adding the randomness in selecting the best variables for splitting at each node [23]. Furthermore, Dietterich proposed the concept of random split selection that trees grow with a random subset of the best K variables at each node [24]. These attempts and the ideas were combined and evolved to the RF algorithm. As an ensemble of models, RF uses majority voting for classification tasks and averaging for regression to make a final prediction. Also, in the frame of ensemble, there exist two necessary and sufficient conditions for ensuring the ensemble to be superior to its members. That is, the member models must be better than random guessing and be appropriately diverse. Models can be considered diverse only if their errors on unseen data are uncorrelated. The RF algorithm proceeds as follows: (1) From the training set of n samples, draw a bootstrap sample, i.e., randomly sampling with replacement. (2) For the bootstrap sample, produce a tree based on the following modification: choose the best split among a randomly selected subset rather than all variables at each node. The tree is grown to the maximum size without pruning. (3) Repeat the above steps until ntree (an approximate large number) CART models are grown. In other words, each tree corresponds to a particular bootstrap sample and a total of ntree bootstrap samples will be drawn from the training dataset. (4) Predict the class membership/labels of new samples by majority vote of the predictions from all ntree outputs.

H. Chen et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 135 (2015) 185–191

RF is considered to be user-friendly in the sense that it has only two parameters – the number of variables in the random subset at each node (mtry) and the number of trees in the forest (ntree), and is usually not very sensitive to their values. If the best split at each node is chosen among all variables, the RF algorithm is the same as Bagging. The tree growing algorithm used in RF is CART. The samples not included in the bootstrap set, i.e., out-of-bag, and therefore not used in the process of constructing the trees, are used as the test set to calculate an unbiased estimate of the prediction accuracy. Once a new sample needs to be classified, it is applied to each of the K classification trees starting from the root and is assigned to a label corresponding to the leaf and the decisions of all trees are combined by majority voting. On average, at the end of the run, each samples of the original dataset is out-of-bag in one-third of the K-tree constructing iterations. Or, each element of the original dataset is classified by one-third of the trees. The proportion of misclassifications over all out-of-bag samples is called the out-ofbag (OOB) error. For a more detailed description of how random forest works, the reader is referred to [25–27]. Classification and regression tree Classification and regression tree (CART) is a popular type of decision tree, which can construct statistical model from a given training dataset for classification or regression task. Given a training set with n samples, consisting of m predictor/input variables and a response variable, the CART algorithm recursively partitions the input space to produce a tree prediction. Starting with the entire input space, CART attempts to seek a binary partition to increase the response purity in the subspaces formed by the partition. The partition is expressed as a hyperplane perpendicular to one of the coordinate axes corresponding to the input variables. The purity of the resulting subspaces relies on the homogeneity of the response classes. A widely used measure of homogeneity is the Gini impurity measure for classification and mean squared-error for regression. The process of binary partitioning is repeated in each subspace until subspace response homogeneity is achieved. The prediction for a particular subspace is the majority vote for classification and the average for regression. Partial least squares-discriminant analysis Partial least squares-discriminant analysis (PLSDA) is a classic classification method that combines the properties of partial least squares regression (PLSR) with the discrimination power of a classification technique [28]. For classification, PLSDA is based on the PLS regression algorithm which searches for latent variables (LVs) with a maximum covariance with the response variables. The main advantage of PLSDA is that the relevant sources of data variability are modeled by the so-called LVs, which are linear combinations of the original variables. Thus, it is convenient for visualization and understanding of different data patterns by LV scores and loadings. Loadings are the coefficients of variables in the linear combinations and can be interpreted as the influence of each variable on each LV, while scores act as the coordinates of samples in the LV projection hyperspace. To make a class assignment, the probability that a sample belongs to a specific class can be calculated. Thus, a sample can be assigned to the class that has the highest probability. K-nearest neighbor The K-nearest neighbor (KNN) is an intuitive method used widely for classification [29]. It is actually a simple distance-based learning approach whereby an unknown sample is classified

187

according to the majority vote of its K-nearest neighbors in the training set. The nearness can be measured by an appropriate distance metric, but the Euclidean distance is typical. The main advantage of KNN is that no explicit training step is required. The selection of K value is of importance since larger K can reduce the effect of noise, but make boundaries between classes less distinct. The standard KNN method works as follows: (1) Calculate distances between an unknown samples and all the samples in the training set. (2) Select K samples from the training set most similar to the unknown sample, according to the calculated distances. (3) Classify the samples to the class to which the majority of the K samples belongs. The parameter K is selected by optimization through the classification of a test set of samples or by cross-validation.

Experimental Samples collection Colorectal cancer samples were collected from 20 randomly selected patients (eight female and twelve male) who underwent partial colorectal resection for tumor removal at the First People’s Hospital of Yibin and the Affiliated Hospital of North Sichuan Medical College. A total of 186 spectra, 9–12 from each patient according to the available sample size, were collected. Considering that the region with 5–10 cm distance from the cancer edge was to be normal, for each patient, several spots located gradually from normal to cancerous sites were used to collect spectra. Each position was confirmed by pathologic results. The thickness of sample was the natural thickness of tissues. The size of each tissue was much larger that the area of the probe so as to ensure to obtain enough cancerous and normal available regions for spectra collection. The mean age of the patients was 54 years with the oldest 71 years and the youngest 31 years. The study was approved by the local ethics committee and the informed consent for use of samples was obtained from all patients. All spectra were divided into two subsets: the training set and the test set. Both subsets have the same number of samples. Each subset was composed of 39 cancerous spectra and 54 normal spectra. For classification purposes, each spectrum was assigned a class label (1 for cancer and 2 for normal/control).

NIR measurements A Fourier Transform NIR spectrometer (Thermofisher, USA) equipped with an indium gallium arsenide (InGaAs) detector and a fiber-optic probe (SabIR) was used for spectral collection. The probe enables remote non-destructive sampling. The measurement was performed in diffuse reflectance mode. The outer and light spot diameters of the fiber optic probe were 20 mm and 3 mm, respectively. So, the sampling area was appropriately 7.0 mm2. The spectrometer was controlled by the accompanied software package, i.e., Result 3.0. The sample tissue was kept in a plastic plane, mucosa surface upwards and the fiber-optic probe was attached to a clamp. The tip of the probe was brought into contact with the position of interest in mucosa surface so as to mimic in situ clinical measurements. Spectra were taken as an average of 32 successive scans to increase the signal to noise ratio. The measured spectra covered the wavenumber range 4000– 10,000 cm 1. The spectral resolution was at 4 cm 1 and each spectrum contained 1537 variables. The format of spectrum was set as log (1/R), where R was the sample diffuse reflectance).

188

H. Chen et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 135 (2015) 185–191

Preprocessing and software Due to the large variation in the spectra of different samples, it was difficult to extract useful information from the original spectra. Therefore, spectral pre-processing steps were necessary before modeling. Several methods such as standard normal variate (SNV), mean-centering, multiple signal collection, first derivative, and second derivative, were tried. The SNV proved to improve the results when combined with the subsequent modeling. Thus, all spectra were preprocessed by SNV. All preprocessing and other computations were performed in MATLAB of version 7.0 for Windows.

Results and discussion Analysis of spectral features Fig. 1 shows the original spectra of specimen of a patient with colorectal cancer. It can be noted in Fig. 1 that the NIR spectral peaks are abroad, heavily overlapped and is almost impossible to perform a quantification analysis of the tissue. Even if the NIR spectrum does not exhibit more specific differences of absorbance peaks compared to mid-infrared (MIR) spectroscopy, it still includes useful information on biochemical composition of the tissue. Fig. 2 shows the mean spectra of cancer and normal groups preprocessed by standard normal variate (SNV). Each NIR spectrum is a mixture of the response profiles of various components such as water, lipids, proteins and carbohydrates, all of which are the necessary parts of the colon tissues. The distinct differences in the spectra were observed at three intervals: the CH-, NH-, and OHcombinations (4500–5000 cm 1), CH-stretching first overtone (5500–6000 cm 1) and the CH-stretching second overtone (7500– 9000 cm 1) regions, which illustrate the major abnormalities. It seems that the most significant spectral difference occurs in the 5500–6000 cm 1. Generally, the CH stretching vibrations include: CH3 antisymmetric stretching, which belongs to proteins and lipids; CH2 antisymmetric stretching, which belongs mainly to lipids and low signals from proteins; CH3 symmetric stretching, which belongs mainly to proteins and low signals from lipids; and CH2 symmetric stretching, which belongs mainly to lipids and low signals from proteins. Also, the NIR spectrum can be partitioned into shortwave and long-wave intervals. The former provide information on tissue blood flow, oxygen saturation and consumption, and the redox status of the enzymes while the latter contains information on concerning the biochemical composition of the tissue.

Fig. 1. The original spectra of specimens from a patient with colorectal cancer.

Fig. 2. The preprocessed mean spectra of colorectal cancer and normal groups by standard normal variate (SNV).

Spectra from the cancerous tissues are different in some region from normal ones, which reflected changes in the levels of various biochemicals due to the disease. For example, it has been reported that carbohydrate level is generally reduced in cancerous tissues in comparison with normal tissues. The phosphate content of normal tissues is higher than the malignant ones while the RNA/DNA ratio of the malignant tissues was higher than that of the normal ones by at least 70%. Since the colon tissue will undergo a long multistep process from its normal status to the advanced stage of the disease, it is therefore possible to use NIR spectrum to predict the stage of cancer. Furthermore, Fig. 3 gives a group of representative spectra related to the specimens with different cancerous stages. It is noted that there exists a process of gradual deterioration for these corresponding sites and therefore provide the possibility of cancer staging The relatively high variation between the spectra could be due to different tumor stages of the tissues, as well as the differences in thickness which affects the penetration depth of NIR photon. Principal component analysis Principal component analysis (PCA) was used to examine the possible clustering of samples and to indicate the extent to which NIR spectrum can differentiate cancerous and normal tissues. The first four principal components (PCs) account for about 97% of total

Fig. 3. A group of representative spectra related to specimen of different cancerous stages.

H. Chen et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 135 (2015) 185–191

189

Fig. 4. Loading vector plots of the first three principal components (PCs) corresponding to the dataset.

Fig. 6. The influence of the number of trees in random forest classifiers on the misclassification ration (MCR).

variance in the spectra. PC1, PC2, PC3 and PC4 explain 73%, 17%, 6% and 2% of the total variance, respectively. Furthermore, loadings of the first three PCs were plotted, as shown in Fig. 4. Loading vectors can be seen as the bridge between the original variables space and the PCs space. Each PC is actually a linear combination of the entire original variables; that is, of all the wavenumbers. The coefficients of combination are called loadings, and each wavenumber corresponds to a loading. The higher the loading is, the bigger the influence of the corresponding wavenumber in the explanation of the data variance. As can be noted in Fig. 4, the range of 5000–6000 cm 1 is the most important due to higher loading in the region. Fig. 5 gives the scatter plot of the first three PCs of all spectra. The variance explained by each principal component is also indicated in parenthesis. As shown in Fig. 5, even if there are two main groups related to cancerous and normal samples, respectively, they show considerably overlapped. Often, an obvious overlap means that the data structure or relationship is complex and nonlinear and therefore that a linear classifier may not be suitable. Thus, great emphasis should be placed on constructing a nonlinear classifier/ model capable of enhancing the discrimination between malignant and normal tissues. This needs to resort to chemometrics.

cancers. For PLSDA modeling, the optimal number of factors was found by cross-validation of the training set. In KNN modeling, the choice of K was optimized by calculating the predictive ability with different K value and the optimal K proved to be 9. When building RF classifiers, there are two parameters needed to be optimized. One is the number of trees (ntree) or the number of bootstrap sample. The other is the number of split variables at each node (mtry), which was set as the square root of the total number of variables, i.e., 39. Even if some reports argued that a RF model would not overfit with the including of large number trees, the optimization of the number of trees was performed. Fig. 6 shows the influence of the number of trees in RF classifiers on the misclassification ration (MCR). As can be seen in Fig. 6, on the training set, either only one tree or an ensemble of 13 trees achieved the optimal results with a MCR of 2.1%. However, using only one tree is undesirable because it not only make random forest retrograde into a single tree, also leads to relatively high MCR on the test set. With further increasing number of trees, the MCR of both the training and test sets become larger. Thus, it is appropriate to set ntree as 13. To provide further insight into RF modeling, the so-called variable importance, which indicates the most valuable variables for constructing the RF model, was examined [30]. Such a measure

Construction of diagnostic models Four kinds of classification algorithms, i.e., PLSDA, KNN, CART and RF, were selected to construct models for diagnosing colon

Fig. 5. Scatter plot of the first two principal components (PC1–PC3) of all spectra. The variance explained by each principal component is indicated in parenthesis.

Fig. 7. The first derivative of mean spectra of cancerous and normal tissues (a) and variable importance (Varimp) in the wavenumber range of 4000–10,000 cm 1 for random forest modeling (b).

190

H. Chen et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 135 (2015) 185–191

Table 1 Confusion matrix of four kinds of classifiers for both the training and test sets. Model

Training set

Test set

1

2

MCR (%)

1

2

MCR (%)

PLSDA

1 2

38 4

1 50

2.6 8.0

34 2

5 52

13 3.8

KNN

1 2

37 0

2 54

5.1 0.0

34 2

5 52

13 3.8

CART

1 2

37 0

2 54

5.1 0.0

36 0

4 54

10 0.0

RF

1 2

37 0

2 54

5.1 0.0

35 0

4 54

10 0.0

Note: PLSDA, KNN, CART and RF represent partial least squares-discriminant analysis, K-nearest neighbor, classification and regression tree and random forest classifiers. The ‘‘1’’ and ‘‘2’’ denote cancerous and normal, respectively. MCR represents the misclassification ratio, the MCR values in the first and second rows corresponding to each classifier are sensitivity and specificity, respectively.

was calculated in the course of training as following: as each tree is grown, make predictions of the OOB data using that tree. At the same time, each variable/wavenumber in the OOB data is randomly permuted, one at a time, and each such modified dataset is also predicted by the tree. At the end of training process, the margins for each sample are calculated based on the OOB prediction and the OOB predictions with each variable permuted. Assume AM be the average margin based on the OOB prediction and AMJ the average margin based on the OOB prediction with the jth variable permuted. Then the measure of importance for the jth variable is simply AM AMJ. In fact, the variable importance of a specific wavenumber is the difference in predictive error before and after randomization of variable order. A larger difference in error means that the variable plays important roles in classification. Fig. 7 shows the first derivative of mean spectra of cancerous and normal tissues as well as variable importance (Varimp) in the wavenumber range of 4000–10,000 cm 1 for RF modeling. As the bar plot shows, the region of 4250–4350 cm 1 contains rich information since variable importance is substantially high, which match well with the positions of major absorption bands in the first derivative or original spectra. The variables of 5353 cm 1 and 5761 cm 1 are also important and exactly correspond to two spectral peaks in 5000–5500 cm 1 and 5500–6000 cm 1 region, respectively. In such regions, why only a variable show higher importance is maybe ascribed to the fact that the signal strength of variables located in these regions are considerably correlated. Only a variable can represent the information distributed in its surrounding variables. Hence, the calculation of variable importance provides a novel way of identifying those variables which are important toward creating the final diagnostic model. Table 1 summarizes the confusion matrix of four kinds of classifiers for both the training and the test sets. The ‘‘1’’ and ‘‘2’’ denote cancerous and normal, respectively. The MCR values in the first and second rows associated to each classifier are sensitivity and specificity, respectively. Sensitivity is defined as the ratio of true positive cases (cancer) to the sum of true positive (TP) and false negative (FN) cases: TP/(TP + FN). Specificity is defined as the ratio of true negative cases (normal) to the sum of true negative (TN) and false positive (FP) cases: TN/(TN + FP). Overall, due to the nonlinearity and complexity of data structure in spectra, the PLSDA model performs worse with the highest MCR and lowest specificity. The KNN, CART and RF models have the same performance on the training set, but the CART and RF models exhibit lower MCR values, higher sensitivity and specificity. Furthermore, it can be observed that the CART and RF models achieve the identical sensitivity and specificity on either the training set or the test set. Fig. 8 gives the prediction performance of the final RF classifier on both the training and the test sets. Only four cancer spectra and no normal samples were misclassified. One may suspect that the

Fig. 8. The prediction performance of the final random forest (RF) classifier on both the training and test sets.

computational complexity of constructing the RF model with 13 trees is probably 13 times that of a single tree (CART). This is actually not true. The RF algorithm is relatively efficient, especially when the number of variable is very large. Compared to growing a single decision tree (CART), the efficiency of RF comes from two differences. First, in the usual algorithm of growing trees, all variables are tested for their splitting ability at each node, while RF only tests a small number of variables. Second, to construct a single decision tree, it is necessary to use cross-validation to prune the tree for controlling the model complexity of optimal prediction, which can show a large computation cost. Conclusions The generation and progression of cancer manifest themselves at the molecular level before morphologic changes take place, which cannot be discovered by conventional techniques, even pathologic examination. Fourier transform NIR spectroscopy, as a powerful tool to detect changes at the molecular level, on the other hand, can rapidly capture small changes in molecular compositions and structures. With the help of chemometric model, it is capable of predicting the generation and progression of malignancy. The present work confirms the feasibility of combining NIR spectroscopy and random forest for a rapid and automatic discrimination between cancerous and normal colon tissues. However, only a very

H. Chen et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 135 (2015) 185–191

limited number of samples were collected in this study; considerable work needs to be accomplished for future clinical applications. Acknowledgements This work was supported by the National Natural Science Foundation of China (21375118), the Applied Basic Research Programs of Science and Technology Department of Sichuan Province of China (2013JY0101), Scientific Research Foundation of Sichuan Provincial Education Department of China (12ZA201 and 13ZB0300) and Yibin Municipal Innovation Foundation (2013GY018), Innovative Research and Teaching Team Program of Yibin University (Cx201104). References [1] M. Khanmohammadi, M.A. Ansari, A. Bagheri, G. Garmarudi, G. Hassanzadeh, G. Garoosi, Cancer Invest. 25 (2007) 397–404. [2] A. Salman, S. Argov, J. Ramesh, J. Goldstein, I. Sinelnikov, H. Guterman, S. Mordechai, Cell. Mol. Biol. 47 (2001) 159–166. [3] V.R. Kondepati, T. Oszinda, H.M. Heise, K. Luig, R. Mueller, O. Schroeder, M. Keese, J. Backhaus, Anal. Bioanal. Chem. 387 (2007) 1633–1641. [4] M. Khanmohammadi, A.B. Garmarudi, K. Ghasemi, H.K. Jaliseh, A. Kaviani, Med. Oncol. 26 (2009) 292–297. [5] A.H. Colagar, M.J. Chaichi, T. Khadjvand, J. Biosci. 36 (2011) 669–677. [6] V.R. Kondepati, M. Keese, R. Muellera, B.C. Manegold, J. Backhaus, Vib. Spectrosc. 44 (2007) 236–242. [7] X. Zhang, Y. Xu, Y. Zhang, L. Wang, C. Hou, X. Zhou, X. Ling, Z. Xu, J. Surg. Res. 171 (2011) 650–656.

191

[8] E. Widjaja, W. Zheng, Z. Huang, Int. J. Oncol. 32 (2008) 653–662. [9] Z. Huang, A. McWilliams, H. Lui, D. McLean, S. Lam, H. Zeng, Int. J. Cancer 107 (2003) 1047–1052. [10] X. Sun, Y. Xu, J. Wu, Y. Zhang, K. Sun, J. Surg. Res. 179 (2013) 33–38. [11] W.S. Yi, D.S. Cui, Z. Li, L.L. Wu, A.G. Wu, J.M. Hu, Spectrochim. Acta, Part A 101 (2013) 127–131. [12] M.P.L. Bard, A. Amelink, V.N. Hegt, W.J. Graveland, H.J.C.M. Sterenborg, H.C. Hoogsteden, J.G.J.V. Aerts, Am. J. Respir. Crit. Care Med. 171 (2005) 1178–1184. [13] Y.Q. Gu, W.R. Chen, M.N. Xia, S.W. Jeong, H.L. Liu, Photochem. Photobiol. 81 (2005) 1002–1009. [14] Y.N. Mirabal, S.K. Chang, E.N. Atkinson, A. Malpica, M. Follen, R. RichardsKortum, J. Biomed. Opt. 7 (2002) 587–594. [15] J.H. Ali, W.B. Wang, M. Zevallos, R.R. Alfano, Technol. Cancer Res. Treat. 3 (2004) 491–497. [16] V.R. Kondepati, H.M. Heise, J. Backhaus, Anal. Bioanal. Chem. 390 (2008) 125– 139. [17] L. Liang, B. Wang, Y. Guo, H. Ni, Y.L. Ren, Vib. Spectrosc. 49 (2009) 274–277. [18] C. Tan, T. Wu, Z.H. Xu, W.Y. Li, K.S. Zhang, Vib. Spectrosc. 58 (2012) 44–49. [19] A. Tsymbal, M. Pechenizkiy, P. Cunningham, Inf. Fusion 6 (2005) 83–98. [20] G. Brown, J. Wyatt, P. Tino, J. Mach. Learn. Res. 6 (2005) 1621–1650. [21] C. Tan, X. Qin, M.L. Li, Anal. Bioanal. Chem. 392 (2008) 515–521. [22] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32. [23] Y. Amit, D. Geman, Neural Comput. 9 (1997) 1545–1588. [24] T.G. Dietterich, Mach. Learn. 40 (2000) 139–157. [25] M. Liu, M.J. Wang, J. Wang, D. Li, Sens. Actuators, B 177 (2013) 970–980. [26] V. Svetnik, A. Liaw, C. Tong, J.C. Culberson, R.P. Sheridan, B.P. Feuston, J. Chem. Inf. Comput. Sci. 43 (2003) 1947–1958. [27] B. Li, Y. Wei, H. Duan, L. Xi, X. Wu, Vib. Spectrosc. 62 (2012) 17–22. [28] D. Ballabio, V. Consonni, Anal. Meth. 5 (2013) 3790–3798. [29] D.S. Cao, J.H. Huang, J. Yan, L.X. Zhang, Q.N. Hu, Q.S. Xu, Y.Z. Liang, Chemom. Intel. Lab. Syst. 114 (2012) 19–23. [30] S. Lee, H. Choi, K. Cha, H. Chung, Microchem. J. 110 (2013) 739–748.