Selecting relevant Fourier transform infrared spectroscopy wavenumbers for clustering authentic and counterfeit drug samples

Selecting relevant Fourier transform infrared spectroscopy wavenumbers for clustering authentic and counterfeit drug samples

SCIJUS-00440; No of Pages 6 Science and Justice xxx (2014) xxx–xxx Contents lists available at ScienceDirect Science and Justice journal homepage: w...

463KB Sizes 8 Downloads 70 Views

SCIJUS-00440; No of Pages 6 Science and Justice xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

Science and Justice journal homepage: www.elsevier.com/locate/scijus

Selecting relevant Fourier transform infrared spectroscopy wavenumbers for clustering authentic and counterfeit drug samples Michel J. Anzanello a,⁎, Flavio S. Fogliatto b, Rafael S. Ortiz c, Renata Limberger d, Kristiane Mariotti d a

Department of Industrial Engineering, Federal University of Rio Grande do Sul, Av. Osvaldo Aranha, 99-5° andar, Porto Alegre, RS, Brazil CRETIES — Centro de Referência em Avaliação de Tecnologias e Insumos Estratégicos em Saúde, Av. Osvaldo Aranha, 99-6o andar, Porto Alegre, RS 90035-190, Brazil Rio Grande do Sul Technical and Scientifical Division, Brazilian Federal Police, Avenida Ipiranga 1365, 90160-093 Porto Alegre, RS, Brazil d Department of Pharmacy, Universidade Federal do Rio Grande do Sul, Av. Ipiranga, 2752, 90610-000 Porto Alegre, RS, Brazil b c

a r t i c l e

i n f o

Article history: Received 17 February 2014 Received in revised form 28 April 2014 Accepted 30 April 2014 Available online xxxx Keywords: Wavenumber selection Counterfeit medicines Clustering Principal components analysis Fourier transform infrared spectroscopy

a b s t r a c t This paper proposes a novel method for selecting subsets of wavenumbers provided by attenuated total reflectance by Fourier transform infrared (ATR-FTIR) spectroscopy able to improve the clustering of medicine samples into two groups; i.e., authentic or fraudulent. For that matter, we apply principal components analysis (PCA) to ATR-FTIR data, and derive two variable importance indices from the PCA parameters. Next, an iterative variable (i.e. wavenumbers) elimination procedure and sample clustering through k-means and Fuzzy C-means techniques are carried out; clustering performance is assessed by the Silhouette Index (SI). The performance of the proposed method is compared with a greedy variable selection method, the “leave one variable out at a time” approach, in terms of clustering quality, percent of retained variables, and computational time. When applied to Viagra ATR-FTIR data, our propositions increased the average SI from 0.5307 to 0.8603 using 0.61% of the original 661 wavenumbers; as for Cialis ATR-FTIR data, clustering quality increased from 0.7548 to 0.8681 when 1.21% of the original wavenumbers were retained in the procedure. The retained wavenumbers, located in the 1091–1046 cm−1 region, comprise the lactose typically hailed as key substance to discriminate between authentic and counterfeit samples. © 2014 Forensic Science Society. Published by Elsevier Ireland Ltd. All rights reserved.

1. Introduction Counterfeiting of phosphodiesterase type 5 (PDE-5) inhibitors for the treatment of erectile dysfunction has grown significantly due to easy purchasing of unauthentic versions from fraudulent websites [1,2], and falsifiers' straightforward access to technologies necessary to forge original medications [3]. PDE-5 counterfeit versions do not rely on proper manufacturing conditions and pharmaceutical dosage forms, offering serious risks to public health. Sildenafil (SLD), tadalafil (TAD), and vardenafil are among the PDE-5 inhibitors with the highest rates in seizures due to their great market success [4], high commercial cost, and embarrassment associated with the pathology. Ortiz et al. [5] applied ultra performance liquid chromatography (UPLC) to the unauthentic samples analyzed in Section 3 of this paper, suggesting the presence of active pharmaceutical ingredients (API) other than those specified on the package (TAD and SLD). Additionally, high concentrations of TAD and SLD were detected in unauthentic samples when

⁎ Corresponding author. Tel.: +55 51 33084423; fax: +55 51 33084007. E-mail addresses: [email protected] (M.J. Anzanello), [email protected] (F.S. Fogliatto), [email protected] (R.S. Ortiz), [email protected] (R. Limberger), [email protected] (K. Mariotti).

compared to commercial products. Finally, some excipients expected to be found in authentic Viagra (microcrystalline cellulose, calcium phosphate dibasic, croscarmellose sodium, and magnesium stearate) and in authentic Cialis (croscarmellose sodium, hydroxypropyl cellulose, hypromellose, iron oxide, lactose monohydrate, magnesium stearate, microcrystalline cellulose, sodium laurilsulfate, triacetin, and titanium dioxide) were not properly identified. Samples of suspect PDE-5 are routinely sent to the Brazilian Federal Police (FP) for forensic analysis; 80% of reports issued by the FP from January, 2007 to September, 2010 included unauthentic samples of Cialis and Viagra [6]. Tadalafil and sildenafil are successfully identified by a large variety of analytical techniques, including tablet images [7], physical control of tablets [8], inorganic profile by X-ray fluorescence spectrometry (XRF) [6], organic profile by electrospray ionization mass spectrometry (ESI-MS) [9], infrared spectroscopy (ATR-FTIR) profile [10], Raman spectroscopy [11,12], and nuclear magnetic resonance spectroscopy (1H NMR, 13C and 13N) [13]. In spite of the useful information provided by these techniques, they also generate a large number of irrelevant, noisy and correlated variables that tend to affect and reduce the performance of several multivariate analysis techniques usually applied to such data. In light of that, it becomes crucial to select the most important variables generated by those analytical techniques for further analysis. Variable selection approaches are tailored to identify a reduced subset of relevant variables (wavenumbers, in the case of ATR-FTIR)

http://dx.doi.org/10.1016/j.scijus.2014.04.005 1355-0306/© 2014 Forensic Science Society. Published by Elsevier Ireland Ltd. All rights reserved.

Please cite this article as: M.J. Anzanello, et al., Selecting relevant Fourier transform infrared spectroscopy wavenumbers for clustering authentic and counterfeit drug samples, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2014.04.005

2

M.J. Anzanello et al. / Science and Justice xxx (2014) xxx–xxx

capable of improving the precision of grouping (clustering) techniques. In forensic and analytical scenarios, the accurate insertion of samples into authentic and unauthentic groups is of utmost importance, since they may reveal common features of seized samples aimed at helping investigative forces to interrupt counterfeiting operations [14]. Variable selection procedures are of particular interest when dealing with spectroscopic data, since the number of variables is typically larger than the number of samples available. Focusing on a particular subset of variables (or specific wavenumber region) from the original data makes it easier to understand and interpret the origin of seized samples, enabling the connection of information from different seizures. In addition, variable selection also leads to more cost-effective analyses [15,16]. This paper proposes a method for selecting the most relevant subsets of ATR-FTIR wavenumbers for clustering medicine samples into two groups: authentic or fraudulent. The method first applies a multivariate technique, principal component analysis (PCA), to ATR-FTIR data, and two variable importance indices are derived from the parameters provided by PCA. Those indices are used to guide a backward variable elimination: after removing each variable, the samples are assigned to one of two groups using two non-hierarchical clustering techniques (k-means and Fuzzy C-means); grouping performance is assessed by the Silhouette Index (SI). The subset of wavenumbers yielding the maximum average SI is chosen as the recommended subset for fraud identification. We compare the performance of the proposed indices with a greedy variable selection method, the “leave one variable out at a time” approach, in terms of grouping quality, percent of retained variables, and computational time. In addition, we also assess the proposed method when clustering is performed on PCA scores instead of the original variables. In that way, this paper intends to contribute with the chemical and forensic fields by highlighting benefits of wavenumber selection in exploratory analysis of seized samples, as recently proposed by Anzanello et al. [17] for sample classification purposes. The propositions in this paper differ from Anzanello et al. [17] in three fundamental points. First, the present manuscript relies on unsupervised multivariate techniques, i.e. clustering, while Anzanello's et al. [17] framework uses supervised techniques for accomplishing sample classification. When using clustering techniques, one aims at finding similar structures on data that enable the insertion of observations (samples) into groups with similar features; the analyst does not acknowledge beforehand how these samples will be grouped. After grouping is done, it is typical to find out the reasons why samples were inserted in their final groups. In this paper; we decided to form two groups aimed at assessing the ability of clustering techniques for splitting authentic and unauthentic samples into proper groups. In Anzanello et al. [17], a supervised technique (KNN — k-nearest neighbor) is used to classify samples into two classes; for that matter, it is mandatory to know in advance the class each sample belongs to in order to train the classification tool. Second, we here use the Silhouette Index, a recent clustering performance measure assesses whether a sample was properly inserted into the final group. In Anzanello et al. [17], more traditional performance measures are used, including accuracy, specificity and sensibility. Third, this paper modifies the wavenumber importance index suggested in Anzanello et al. [17]; although the two indices here proposed resemble Anzanello's et al. [17] index, we emphasize that small modifications in the structure of indices lead to significant chances in terms of measuring wavenumber importance. When applied to Viagra ATR-FTIR data, the recommended course of analysis increased the average SI from 0.5307 to 0.8603 using 0.61% of the original 661 variables (wavenumbers). Similar results were obtained using Cialis ATR-FTIR data: clustering quality increased from 0.7548 to 0.8681 when 1.21% of the original variables were retained in the procedure.

2. Materials and multivariate techniques 2.1. ATR-FTIR analyses The ATR-FTIR spectra were obtained using a Nicolet 380 FTIR Spectrometer (Nicolet Instrument Co., Madison, Wisconsin State, USA) equipped with a DTGS (deuterated triglycine sulfate) detector and a Smart Orbit single reflection diamond ATR sampling device. We analyzed 25 authentic Viagra tablets and 28 authentic Cialis tablets, as follows: 6 authentic Viagra® tablets containing 50 mg of SLD supplied by Pfizer Ltda Laboratories; 8 authentic Cialis® tablets containing 20 mg of TAD supplied by Eli Lilly do Brasil Ltda Laboratories; 20 authentic Cialis® tablets (TAD, 20 mg) from 8 distinct batches, and 19 Viagra® authentic tablets (SLD, 50 mg) from 6 distinct batches purchased in local pharmacies. The counterfeit samples consisted of 104 tablets seized by the Brazilian Federal Police. Twenty five (25) milligrams of each sample was positioned on the ATR crystal, and transmittance converted to absorption. Genuine and counterfeit tablets were crushed in a porcelain mortar, and the resulting powder was tested in the ATR-FTIR device; no additional handling of samples was performed. In case tablet coating consisted of film, the fragments of this film were removed after crushing. As for samples presenting no film coating, the coating became part of the sample in the homogenized powder. Each mixture was sampled in triplicate, and further discussion regarding variation on the spectra provided by authentic and unauthentic samples can be found in [10]. The same pressure from the ATR anvil was used in all measurements. Each spectrum comprises 16 co-added scans measured at a spectral resolution of 4 cm− 1 in the 4000–525 cm−1 range. Spectral data were acquired with EZ OMNIC software, version 7.2a (Nicolet Instrument Co.). The ATR crystal was cleaned with acetone after each measurement. An hourly background spectrum was obtained against air with a clean and dry ATR element using the same instrumental conditions as the samples. No spectra pretreatments, including baseline correction or normalization, were employed. Fig. 1 presents representative spectra of genuine and counterfeit Viagra and Cialis samples. The most important peaks for TAD are associated to C\O bonds in the 1700 cm−1 band, and C\C bonds associated to the ketone group in 1280 and 1172 cm−1 bands. As for the SLD, the 1676 cm−1 peak is correlated to C\N stretching (1690–1640 cm−1); N\H bonding appears at 1647 cm−1; C\N bonds in the O\C\N functional group absorb at 1400 cm−1, accounting for the 1402 cm−1 absorbance; and the aryl C\N bonds are responsible for the 1269 cm−1 peak. In 1048 cm−1, 909 cm−1 and 890 cm−1, there are characteristic infrared absorption peaks for lactose (excipient of authentic Cialis). 2.2. Multivariate techniques PCA is a technique for dimensional reduction of datasets that linearly combines the original variables (wavenumbers) generating the so called principal components. Data reduction occurs when only a subset of components is deemed significant for representing the original data. Consider a matrix X comprised of N samples (in the rows) described by L variables (in the columns); sample i is represented by a vector xi (xi1, xi2,…, xiL). PCA constructs L independent linear combinations, tic = w1cxi1 + w2cxi2 + … + wLcxiL, of the variables [18]; a subgroup of combinations, for example C, is usually sufficient to explain most of the variability in the original data, such that C b L. According to Rencher [18], the number of components to be retained may be defined based on the amount of explained variance. In addition, the weight associated to variable l, wlc, is determined such that the variance between the components is maximized; the amount of variance explained by each retained component is represented by λc. Data clustering is an unsupervised multivariate technique that assigns observations (samples) to classes (clusters) so that observations in the same cluster are as similar as possible, and items in different

Please cite this article as: M.J. Anzanello, et al., Selecting relevant Fourier transform infrared spectroscopy wavenumbers for clustering authentic and counterfeit drug samples, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2014.04.005

M.J. Anzanello et al. / Science and Justice xxx (2014) xxx–xxx

3

0,15 0,14 0,13 0,12

Absorbance

0,11 0,10 0,09 0,08 0,07 0,06 0,05 0,04 0,03 0,02 0,01 1800

1600

1400

1200

1000

800

600

Wavenumbers (cm-1) Fig. 1. Representative spectra of genuine Viagra (in green), counterfeit Viagra (in blue), genuine Cialis (in purple), and counterfeit Cialis (in red). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

clusters are as dissimilar as possible [19,20]. The first clustering technique we test is the k-means algorithm [21], which inserts observations into the cluster with the nearest centroid by minimizing the sum of the Euclidean distances between observations and centroids [22]. The number of clusters, k, is user-defined. The second clustering technique is the Fuzzy C-means (FCM); differently from the k-means technique, in FCM each observation is given a “membership grade” that measures how much that observation belongs to a cluster rather than pertaining entirely to a specific cluster. That grade is inversely related to the distance from observation i to the clusters' centroids around that observation [23]; observation i is inserted into the cluster that presents the higher probability of owing that observation (i.e. the cluster with the highest “membership grade”). Additional details on FCM are available in Ahmed et al. [23], and Nock and Nielsen [24]. The quality of the clustering results provided by either k-means or FCM may be assessed by the Silhouette Index (SI), as in Eq. (1), which measures how similar an observation is with respect to observations in its own cluster, compared to observations in other clusters [20,25, 26]. In Eq. (1), a(i) is the average distance from the i-th observation to all others in its cluster, and b(i) is the average distance from the i-th observation to all others assigned to the nearest neighboring cluster. A SI value that ranges from + 1 to − 1 is calculated for each observation; the closer to +1, the more distant the observation is to those in neighboring clusters, meaning a proper grouping procedure. According to Kaufman and Rousseeuw [20], the global quality of a clustering procedure can be assessed by averaging the SIs over all clustered observations. SIi ¼

bðiÞ−aðiÞ maxfbðiÞ; aðiÞg

ð1Þ

2.3. Proposed method There are two operational steps to select the most relevant ATR-FTIR wavenumbers for clustering samples into authentic or counterfeit groups: (i) apply PCA to data and generate variable importance indices, and (ii) group samples using a clustering technique and eliminate irrelevant wavenumbers using a backward procedure. These steps are now detailed. In the first step, we apply PCA to the dataset; the outputs of interest are the percentage of variance, λc, explained by each retained component c (c = 1,…,C), and the components' weights wlc. We then derive

two variable importance indices to guide the removal of variables based on λc and wlc. The first index, h(1), relies on wlc as in Eq. (2). Since variables with high absolute weights are deemed relevant in explaining variability in the principal components, we propose using weights wlc to order variables according to their ability to explain variance in the original data. The second variable importance index we propose is h(2), in which weights are calibrated by the percentage of variance explained by each retained component; see Eq. (3). In our proposition, variables with high wlc associated to components with large λc are preferred, suggesting that variables with higher hl enable better sample grouping [17]. ð1Þ

hl

ð2Þ

hl

¼

¼

Xc c¼1

Xc

jwlc j l ¼ 1; …; L

λ jwlc j l c¼1 c

¼ 1; …; L

ð2Þ

ð3Þ

In the second step, assign samples in one of two groups, i.e., authentic and unauthentic, applying a clustering technique using the complete set of L wavenumbers as clustering variables, and assess the quality of resulting clusters through the Silhouette Index (SI). Since there is a SIi value associated with each sample i we recommend averaging SIi over all clustered samples to evaluate the quality of the clustering technique [20]. Next, remove the variable with the smallest hl, perform a new grouping based on the remaining variables (L-1), and re-calculate the average SI. This iterative procedure is repeated until a single variable is left. The progression of the variable removal procedure may be viewed in a graph consisting of the average SI in the vertical axis, and percentage of retained variables in the horizontal axis. We look for the subset of variables yielding the maximum average SI in that graph. We test both variable importance indices, h(1) and h(2), and clustering techniques, k-means and FCM, in the aforementioned procedure. As part of our propositions, we also test a “leave one variable at a time” approach in order to identify a reduced wavenumber subset, which may then be compared with the one obtained in the method proposed above. For that matter, one variable (wavenumber) at a time is left out of the clustering procedure, and the average SI is computed for each instance. Once all variables have been omitted, the variable responsible for the maximum average SI is eliminated as the one that contributes the least in inserting samples into groups; i.e., SI is increased when that wavenumber is left out of clustering wavenumber subset [25]. The iterative process is then repeated for the remaining variables,

Please cite this article as: M.J. Anzanello, et al., Selecting relevant Fourier transform infrared spectroscopy wavenumbers for clustering authentic and counterfeit drug samples, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2014.04.005

4

M.J. Anzanello et al. / Science and Justice xxx (2014) xxx–xxx

h(1)), and wavenumber retention (0.61% for h(2) and 0.91% for h(1)). Such difference is due to the percentage of variance (λ) explained by each retained component included in h(2), which affects the PCA weights and improves the sequencing for wavenumber removal. Although PCA scores t reduced the processing time compared to grouping on the original data x, it did not significantly impact on clustering precision or percentage of retained wavenumbers when using indices h(1) and h(2). In addition, there is a clear disadvantage of the “leave one out at a time” technique in terms of clustering precision and percentage of retained variables, especially when PCA scores are used as clustering variables: it led to smaller SIs, yielded unstable percentages of retained wavenumbers, and required significantly more computational time than the propositions based on h(2) and h(1). In light of the aforementioned results, we recommend combining the k-means clustering technique (due to its simpler mathematical fundamentals compared to FCM) to the h(2) index and the original x as input data for selecting the most relevant wavenumbers for clustering procedures. Such combination increased the average SI for the Viagra data from 0.5307 using all wavenumbers, to 0.8603 using only 0.61% of the original 661 wavenumbers. The retained wavenumbers are included in the 1012–995 cm−1 region, where excipients microcrystalline cellulose and croscarmellose exhibit strong intensity bands. Figs. 3 and 4 depict the SI graphs when clustering is carried out on all wavenumbers and on the recommended subset of clustering wavenumbers, respectively; cluster 1 is comprised of counterfeit samples, and cluster 2 of authentic ones. Note that each sample is represented by a horizontal bar; the closer the SI for each sample is to 1, the better that sample is allocated to the final cluster. There is a clear improvement in grouping quality when using the selected wavenumbers: clustering on all the 661 wavenumbers yields a significant number of samples inappropriately inserted in cluster 2, identified by negative SI values; no sample misplacing is verified when using the recommended subset (all SIs are positive). We retained three principal components explaining 79% of total variance for the Cialis data based on the Scree Graph [18]; Table 2 depicts the average SI, percentage of retained wavenumbers and processing time. Wavenumber importance indices h(1) and h(2) perform identically in terms of clustering quality, and quite similarly regarding the percentage of wavenumbers retained and processing time. Similarly to the Viagra data, clustering on PCA scores using the leave one wavenumber out at a time approach yields significantly smaller SI and larger percent of retained wavenumbers. Also corroborating results from the Viagra data, the clustering techniques performed similarly in terms of grouping precision and percentage of retained wavenumbers. In light of these results, we also recommend using the k-means on x input data integrated to the h(2) index for the Cialis ATR-FTIR spectrum. Such combination of grouping technique, wavenumber importance index, and input data increases the clustering quality from 0.7548 to 0.8681 when using 1.21% of the original wavenumbers. Our results give evidence that discrimination between authentic and counterfeit samples relies heavily on information from lactose absorption bands. Commercial tablets are complex powder mixtures comprised of active ingredients and adjuvants, the latter usually in larger amounts. As we already remarked in a previous work on the subject (Ortiz et al. [10], p. 282): “In pharmaceutical powder mixtures, the discrimination obtained via PCA is related not only to drug presence in the

0.9

Average Silhouette Index

0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5

0

100

200

300

400

500

600

700

Number of retained wavenumbers Fig. 2. Average SI profile as variables is removed using k-means on input data x and wavenumber importance index h(2) in the Viagra dataset.

and the average SI is again evaluated after each variable is omitted. We repeat this procedure until there is a single variable left. The subset of variables yielding the maximum average SI is recommended. Finally, we also test the two approaches above (variable selection guided by variable importance indices, and the “leave one out at a time” approach) on the PCA scores. We will refer the clustering on the original ATR-FTIR wavenumbers as “input data x”, and clustering on the PCA scores as “input data t”. The goal here is to assess whether PCA scores lead to better clustering results than the original wavenumbers, as claimed in several studies in the forensic field; e.g. [10,27]. 3. Results The proposed method was applied to 28 samples of authentic Cialis, 25 samples of authentic Viagra, and 104 counterfeit samples sent to the Brazilian FP for forensic analysis. All computational experiments were performed in Matlab 7.8 on a 2.4 GHz computer. Three principal components explaining 85% of total variance were retained for the Viagra dataset based on the Scree Graph [18]. Fig. 2 depicts the average SI profile as variables were removed using the kmeans clustering technique on input data x and wavenumber importance index h(2). To analyze the results in Fig. 2 recall that variable removal takes place backwards. The removal of the first 300 wavenumbers (right to left in the horizontal axis) does not impact significantly on the average SI, which sit around SI = 0.53. After that point, wavenumber removal systematically improves clustering quality until it reaches the maximum SI = 0.8603 when 8 wavenumbers are retained (see Table 1). Similar profiles were obtained for other combinations of importance index and clustering technique, as presented in Table 1. Both k-means and FCM clustering techniques yielded similar results regarding grouping precision (assessed through the SI) and percent of retained wavenumbers when using the proposed wavenumber importance indices. There is a slight advantage of h(2) compared to h(1) in terms of clustering accuracy (SI = 0.8568 for h(2) and SI = 0.8152 for

Table 1 Average SI profile, percent of retained variables and processing time for all combinations of clustering techniques, input data and wavenumber importance indices for the Viagra dataset. Clustering technique

Input data

Average Silhouette Index (1)

k-Means Fuzzy C-means

x t x t

(2)

Retained wavenumbers (%) (1)

(2)

Processing time (min)

h

h

Leave one out at a time

h

h

Leave one out at a time

h(1)

h(2)

Leave one out at a time

0.8183 0.8183 0.8120 0.8120

0.8603 0.8603 0.8532 0.8532

0.8599 0.7212 0.8347 0.6387

0.91% 0.91% 0.91% 0.91%

0.61% 0.61% 0.61% 0.61%

0.30% 89.71% 0.45% 73.15%

1.17 0.78 1.73 1.12

1.15 0.78 1.73 1.09

1279.68 738.87 1354.70 831.25

Please cite this article as: M.J. Anzanello, et al., Selecting relevant Fourier transform infrared spectroscopy wavenumbers for clustering authentic and counterfeit drug samples, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2014.04.005

M.J. Anzanello et al. / Science and Justice xxx (2014) xxx–xxx

5

1083 e 1092 cm−1, i.e. within the retained regions. According to Ortiz et al. [5], lactose is present in large percentage in authentic Cialis samples, but not in forged ones. It is thus natural that such substance plays a key role in the classification process. Finally, one must have in mind that these results are valid only for the samples analyzed, which are rather small in size. For further insights into the potential practical forensic impacts of the analyses proposed here, we direct the reader to the work of Ortiz et al. [10], where the same datasets are investigated although not focusing on the wavelength selection problem, which is our focus here.

Cluster

1

2

4. Conclusion

-0.2

0

0.2

0.4

0.6

0.8

ATR-FTIR spectrum analysis, widely known as a useful technique for detecting fraudulent medicines, usually relies on a large number of wavenumbers that tend to decrease the performance of multivariate techniques. In this paper, a novel method for selecting the most relevant subsets of wavenumbers for clustering samples into two classes, authentic or fraudulent, is suggested. The proposed method applies principal component analysis (PCA) to ATR-FTIR data. PCA parameters are used to develop two wavenumber importance indices aimed at guiding a backward variable elimination; after each variable is eliminated, samples are assigned to one of two classes using k-means and Fuzzy C-means clustering techniques, and the performance is evaluated by the Silhouette Index (SI). The proposed method is compared with the “leave one variable out at a time” greedy approach, and with clustering performed on PCA scores instead of the original variables. When applied to Viagra ATR-FTIR data, the recommended combination of clustering technique (k-means), wavenumber importance index, (h(2)), and input data (x) increased the average SI to 0.8603 from 0.5307 using only 0.61% of the original wavenumbers. Similar results were given by the Cialis data, where 1.21% of the original wavenumbers increased the clustering quality from 0.7548 to 0.8681. Results also provided evidence that discrimination between authentic and counterfeit samples relies heavily on the information from lactose absorption bands. Future developments include replacing the unsupervised techniques by supervised ones, including the support vector machine data mining technique. Another promising approach will use the parameters yielded by the partial least squares (PLS) regression to give rise to more robust wavenumber importance indices; such indices could be integrated to the method presented in this paper.

1

Silhouette Value Fig. 3. Silhouette index graph using all wavenumbers.

Cluster

1

2 0

0.2

0.4

0.6

0.8

1

Silhouette Value Fig. 4. Silhouette index graph using the recommended wavenumber subset.

samples, but also to the various technological adjuvants present in commercial tablets. This occurs because the FTIR spectrum of commercial tablets corresponds to a mixture, i.e., drug + adjuvants. As the adjuvants are generally in higher quantity, e.g., each Cialis® tablet contains 20 mg of TAD and 245 mg of lactose monohydrate, the resulting spectrum is very similar to that of the pure adjuvant.” In the Viagra dataset wavenumbers retained were in the 1012–995 cm−1 region, while in the Cialis dataset retained wavenumbers were in the 1091–1046 cm−1 region. The most significant absorption bands of the pharmacological active ingredients in those samples are not in those regions. The IR spectrum of TAD can be characterized by absorption peaks at 1673 cm−1 (C\O amide), 1644 cm−1 (C\C aromatic), cm−1 (C\N stretch), and 745 cm−1 (benzene); Vyas et al. [28]. SLD is characterized by absorption peaks at 1698 cm−1 (C\O carbonyl group), and 939 cm−1 (C\H aromatic out-of-plane deformation), in addition to peaks at 1171, 754, 618, 587, and 554 cm−1; Issa et al. [29]. Lactose, on the other hand, can be characterized by absorption peaks at 1070,

References [1] P.-Y. Sacré, E. Deconinck, T. de Beer, P. Courselle, R. Vancauwenberghe, P. Chiap, J. Crommen, J.O. de Beer, Impurity fingerprints for the identification of counterfeit medicines—a feasibility study, J. Pharm. Biomed. Anal. 53 (2010) 445–453. [2] A.L. Rodomonte, M.C. Gaudiano, E. Antoniella, D. Lucente, V. Crusco, M. Bartolomei, P. Bertocchi, L. Manna, L. Valvo, F. Alhaique, N. Muleri, Counterfeit drugs detection by measurement of tablets and secondary packaging colour, J. Pharm. Biomed. Anal. 53 (2010) 215–220. [3] F.M. Fernandez, D. Hostetler, K. Powell, H. Kaur, M. Green, D.C. Mildenhall, P.N. Newton, Poor quality drugs: grand challenges in high throughput detection, countrywide sampling, and forensics in developing countries, Analyst (2010) 3073–3082. [4] U. Holzgrabe, M. Malet-Martino, Analytical challenges in drug counterfeiting and falsification—the NMR approach, J. Pharm. Biomed. Anal. 55 (2011) 679–687.

Table 2 Average SI profile, percentage of retained variables and processing time for all combinations of clustering techniques, input data and wavenumber importance indices for the Cialis dataset. Clustering technique

Input data

Average Silhouette Index (1)

k-Means Fuzzy C-means

x t x t

(2)

Retained wavenumbers (%) (1)

(2)

Processing time (min)

h

h

Leave one out at a time

h

h

Leave one out at a time

h(1)

h(2)

Leave one out at a time

0.8681 0.8682 0.8622 0.8622

0.8681 0.8682 0.8621 0.8621

0.8437 0.5566 0.8437 0.5566

1.21% 1.06% 1.82% 1.54%

1.21% 1.06% 1.36% 1.25%

0.30% 51.24% 0.60% 36.57%

5.90 1.92 7.48 2.25

6.40 1.92 7.87 2.29

4197.33 1589.52 4197.33 1589.52

Please cite this article as: M.J. Anzanello, et al., Selecting relevant Fourier transform infrared spectroscopy wavenumbers for clustering authentic and counterfeit drug samples, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2014.04.005

6

M.J. Anzanello et al. / Science and Justice xxx (2014) xxx–xxx

[5] R. Ortiz, K. Mariotti, M. Holzschuh, W. Romão, R. Limberger, P. Mayorga, Profiling counterfeit Cialis, Viagra and analogs by UPLC–MS, Forensic Sci. Int. 229 (2013) 13–20. [6] R.S. Ortiz, K.C. Mariotti, N.V. Schwab, G.P. Sabin, W.F.C. Rocha, E.V.R. Castro, R.P. Limberger, P. Mayorga, M.I.M.S. Bueno, W. Romão, Fingerprinting of sildenafil citrate and tadalafil tablets in pharmaceutical formulations via X-ray fluorescence spectrometry (XRF), J. Pharm. Biomed. Anal. 58 (2012) 7–11. [7] C.R. Jung, R.S. Ortiz, R. Limberger, P. Mayorga, A new methodology for detection of counterfeit coated tablets by image processing and statistical analysis: application to Viagra® and Cialis® tablets, Forensic Sci. Int. 216 (2012) 92–96. [8] R.S. Ortiz, K.C. Mariotti, R.P. Limberger, P. Mayorga, Physical profile of counterfeit tablets Viagra and Cialis, Braz. J. Pharm. Sci. 48 (2012) 1–9. [9] R.S. Ortiz, K.C. Mariotti, W. Romão, M.N. Eberlin, R.P. Limberger, P. Mayorga, Chemical fingerprinting of counterfeits of Viagra and Cialis tablets and analogues via electrospray ionization mass spectrometry, Am. J. Anal. Chem. 2 (2011) 919–928. [10] R.S. Ortiz, K.C. Mariotti, Bruna Fank, R.P. Limberger, M. Anzanello, P. Mayorga, Counterfeits Cialis and Viagra fingerprinting by ATR-FTIR spectroscopy with chemometry: can the same pharmaceutical powder mixture be used to falsify two medicines? Forensic Sci. Int. 226 (2013) 282–289. [11] M. de Veij, A. Deneckere, P. Vandenabeele, D. de Kaste, L. Moens, Detection of counterfeit Viagra® with raman spectroscopy, J. Pharm. Biomed. Anal. 46 (2008) 303–309. [12] S. Trefi, C. Routaboul, S. Hamieh, V. Gilard, M. Malet-Martino, R. Martino, Analysis of illegally manufactured formulations of tadalafil (Cialis) by 1H NMR, 2D DOSY 1H NMR and Raman spectroscopy, J. Pharm. Biomed. Anal. 47 (2008) 103–113. [13] I. Wawer, M. Pisklak, Z. Chilmonczyk, 1H, 13C, 15N NMR analysis of sildenafil base and citrate (Viagra) in solution, solid state and pharmaceutical dosage forms, J. Pharm. Biomed. Anal. 38 (2005) 865–870. [14] M. Lopes, J. Wolff, J. Bioucas-Dias, M. Figueiredo, Determination of the composition of counterfeit Heptodin image tablets by near infrared chemical imaging and classical least squares estimation, Anal. Chim. Acta. 641 (2009) (2009) 46–51. [15] Z. Xiaobo, Z. Jiewen, M.J.W. Povey, M. Holmes, M. Hanpin, Variables selection methods in near-infrared spectroscopy, Anal. Chim. Acta. 667 (2010) 14–32. [16] R.M. Balabin, S.V. Smirnov, Variable selection in near-infrared spectroscopy: benchmarking of feature selection methods on biodiesel data, Anal. Chim. Acta. 692 (2011) 63–72.

[17] M. Anzanello, R. Ortiz, R. Limberger, P. Mayorga, A multivariate-based wavenumber selection method for classifying medicines into authentic or counterfeit classes, J. Pharm. Biomed. Anal. 83 (2013) 209–214. [18] A. Rencher, Methods of Multivariate Analysis, Wiley Interscience, New York, 1995. [19] J. Jobson, Applied Multivariate Data Analysis, V. II: Categorical and Multivariate Methods, Springer-Verlag, New York, 1992. [20] L. Kaufman, P. Rousseuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley Interscience, New Jersey, 2005. [21] A. Jain, R. Dubes, Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, 1988. [22] H. Taboada, D. Coit, Data clustering of solutions for multiple objective system reliability optimization problems, Qual. Technol. Quant. Manag. J. 4 (2) (2007) 191–210 (3). [23] M. Ahmed, S. Yamany, N. Mohamed, A., T. Moriarty, A modified Fuzzy C-means algorithm for bias field estimation and segmentation of MRI data, IEEE Trans. Med. Imaging 21 (2002) 193–199. [24] R. Nock, F. Nielsen, On weighting clustering, IEEE Trans. Pattern Anal. 28 (8) (2006) 1–13. [25] M. Anzanello, F. Fogliatto, Selecting the best clustering variables for grouping masscustomized products involving workers learning, Int. J. Prod. Econ. 130 (2011) 268–276. [26] M. Lopes, J. Wolff, Investigation into classification/sourcing of suspect counterfeit Heptodin tablets by near infrared chemical imaging, Anal. Chim. Acta. 633 (2009) 149–155. [27] K. Dégardin, Y. Roggo, F. Been, P. Margot, Detection and chemical profiling of medicine counterfeits by Raman spectroscopy and chemometrics, Anal. Chim. Acta. 705 (1–2) (2011) 334–341. [28] V. Vyas, P. Sancheti, P. Karekar, M. Shah, Y. Pore, Physicochemical characterization of solid dispersion systems of tadalafil with poloxamer 407, Acta Pharma. 59 (2009) 453–461. [29] Y.M. Issa, W.F. El-Hawary, A.F.A. Youssef, A.R. Senosy, Synthesis and structural study of the ion-associates of sildenafil citrate with chromotropic acid azo dyes, Eur. Chem. Bull. 1 (6) (2012) 205–209.

Please cite this article as: M.J. Anzanello, et al., Selecting relevant Fourier transform infrared spectroscopy wavenumbers for clustering authentic and counterfeit drug samples, Sci. Justice (2014), http://dx.doi.org/10.1016/j.scijus.2014.04.005