Machine learning method using position-specific mutation based classification outperforms one hot coding for disease severity prediction in haemophilia ‘A’

Machine learning method using position-specific mutation based classification outperforms one hot coding for disease severity prediction in haemophilia ‘A’

Genomics 112 (2020) 5122–5128 Contents lists available at ScienceDirect Genomics journal homepage: www.elsevier.com/locate/ygeno Original Article ...

2MB Sizes 0 Downloads 75 Views

Genomics 112 (2020) 5122–5128

Contents lists available at ScienceDirect

Genomics journal homepage: www.elsevier.com/locate/ygeno

Original Article

Machine learning method using position-specific mutation based classification outperforms one hot coding for disease severity prediction in haemophilia ‘A’

T

Vikalp Kumar Singha, Neha Shree Mauryab, Ashutosh Manib, , Rama Shankar Yadava ⁎

a b

Department of Computer Science and Engineering, Motilal Nehru National Institute of Technology Allahabad, UP 211004, India Department of Biotechnology, Motilal Nehru National Institute of Technology Allahabad, UP 211004, India

ARTICLE INFO

ABSTRACT

Keywords: Machine learning Mutation Position specific mutation One-hot encoding Haemophilia Factor VIII

Haemophilia is an X-linked genetic disorder in which A and B types are the most common that occur due to absence or lack of protein factors VIII and IX, respectively. Severity of the disease depends on mutation. Available Machine Learning (ML) methods that predict the mutational severity by using traditional encoding approaches, generally have high time complexity and compromised accuracy. In this study, Haemophilia ‘A’ patient mutation dataset containing 7784 mutations was processed by the proposed Position-Specific Mutation (PSM) and One-Hot Encoding (OHE) technique to predict the disease severity. The dataset processed by PSM and OHE methods was analyzed and trained for classification of mutation severity level using various ML algorithms. Surprisingly, PSM outperformed OHE, both in terms of time efficiency and accuracy, with training and pre­ diction time improvement in the range of approximately 91 to 98% and 80 to 99% respectively. The severity prediction accuracy also improved by using PSM with different ML algorithms.

1. Introduction Haemophilia is a rare X-linked genetic disorder that prevents blood clotting. As the Haemophilia gene was initially passed from Queen Victoria, the Queen of England, it is also known as “Royal Disease” [1]. Haemophilia has two most common forms ‘A’ and ‘B’ which occur due to absence or low level of protein factors VIII and IX, respectively [2]. The factor VIII gene is present at ‘Xq28’ position on chromosome X and has a length of more than 180 Kb. FVIII gene consists of 26 exons en­ coding a 2351 amino acids (AA) long precursor protein. Since the blood does not clot properly at the site of injury in hae­ mophilic patients, it can cause huge amount of bleeding and may lead to death in extreme cases. The risk levels of the disease can be classified into mild, moderate and severe categories [2,3]. The native FVIII pro­ tein consists of six domains: A1-A2-B-A3-C1-C2 and is cleaved into a heavy chain and a light chain structure by thrombin [4]. The European Association for Haemophilia and Allied Disorders (EAHAD) public re­ pository maintains the record of a single gene variants for Haemophilia ‘A’,’B’ and other bleeding disorders [5]. Various machine learning classifiers that are used to classify the Single Nucleotide Polymorphism (SNP) data suffer from high time complexity in training and prediction due to high dimensionality of data and inefficient encoding techniques



in terms of execution speed. The traditional One-Hot Encoding (OHE) approach encodes the features and generates high dimensional data causing high time complexity for prediction. On the other hand, the proposed Position-Specific Mutation (PSM) method generates low di­ mensional data with efficient severity prediction. Agajanian et al., had integrated random forest classifiers and deep convolutional neural networks for prediction of cancer driving muta­ tion in the genomic datasets where feasibility of the CNNs in classifying the mutation data was explored by OHE [6]. Lovino et al., used deep learning for screening of oncogenic gene fusions in human. They con­ verted protein sequences into data matrices without introducing bias in the classifier by using OHE strategy. Later on CNNs were used to clas­ sify the gene fusion transcripts into oncogenic/non-oncogenic category [7]. To compare the speed and efficiency of SNP dataset severity pre­ diction, we pre-processed dataset using PSM and OHE approach before analysis. During pre-processing, non-numerical values are encoded by various existing techniques i.e. One Hot Encoding, Label Encoding and Dictionary mapping. Label encoding is useful when the variation is very low in any feature while one hot encoding is preferred for better pre­ diction in high variation features. Four different machine learning al­ gorithms, named as K-Nearest Neighbors (KNN), AdaBoost, Support

Corresponding author. E-mail address: [email protected] (A. Mani).

https://doi.org/10.1016/j.ygeno.2020.09.020 Received 23 June 2020; Received in revised form 10 August 2020; Accepted 8 September 2020 Available online 11 September 2020 0888-7543/ © 2020 Elsevier Inc. All rights reserved.

Genomics 112 (2020) 5122–5128

V.K. Singh, et al.

Table 1 Representation of some attributes of EAHAD mutation dataset for Haemophilia ‘A’ disease. Case ID

Type

Effect

cDNA

Sequence context

AA (HGVS)

AA (Legacy)

Domain

Location in gene

Mutation

ProteinChange

Severity

7393

Point

Missense

5564

1855

1836

A3

Exon 16

c.5564C > A

p.Ala1855Asp

Moderate

7394

Point

Missense

5573

1858

1839

A3

Exon 16

c.5573C > T

p.Ser1858Phe

Severe

1290

Insertion

Inframe

1503

GCT GAT TCT TTT

501

482

A2

Exon 10

Point

Missense

5101

1701

1682

a3

Exon 14

p.Asp501_Val502insIle IleAsp p.Glu1701Lys

Mild

2892

c.1503_1504ins ATCATCGAT c.5101G > A

9209

Duplication

Frameshift

6963

2323

2304

C2

Exon 26

p.Arg2323Aspfs*63

Severe

1316

Point

Silent

1569

523

504

A2

Exon 11

c.6963_696 dupGACT c.1569G > T

p.Leu523Leu

Mild

916

Point

Silent

1086

362

343

a1

Exon 8

c.1086G > A

p.Ala362Ala

Severe

4963

Indel

Inframe

1564

522

503

A2

Exon 11

p.Ile522Tyr

Severe

8324 8325 850

Deletion Deletion Point

Inframe Frameshift Nonsense

6923 6953 1003

2308 2318 335

2289 2299 316

C2 C2 A1

Exon 26 Exon 26 Exon 7

c.1564_156 delATinsTA c.6923_6925delCCT c.6953delC c.1003C > T

p.Ser2308del p.Pro2318Hisfs*4 p.Gln335*

Mild Severe Severe

GAG AAG CTG CTT GCG GCA ATT TAT CAA TAA

Vector Machine (SVM) and Random Forest (RF) were implemented on the processed dataset. These algorithms were trained to predict the disease severity for a particular type of mutation by minimizing time complexity with higher accuracy.

Mild

for feature selection and data filtering to reduce the number of attri­ butes of lower importance and noisy data. This pre-processing helped in avoiding the redundancy and cleaning of the dataset. To predict the severity of the disease, irrelevant fields like comments, references, publishing lab, date added and redundant features like sequence con­ text were dropped. For applying ML on Haemophilia ‘A’ dataset the finally selected features were label encoded and the ‘ProteinChange’ & ‘Mutation’ feature was encoded by OHE [8–14] and PSM approach.

2. Methodology 2.1. Dataset collection The mutation dataset available at the public repository of EAHAD (http://f8-db.eahad.org/) is a multidisciplinary association of health­ care professionals who provide care for individuals with Haemophilia and other bleeding disorders. Haemophilia ‘A’ patient raw dataset has a total of 7784 multiple valued data as shown in Table 1. (See Fig. 1.)

2.3. Statistical behaviour analysis In machine learning, dimensionality reduction is the process of re­ ducing the number of random variables under consideration by ob­ taining a set of principal variable by feature selection and feature ex­ traction. These two approaches are used for dimensionality reduction in the pre-processed dataset but in this paper, we focus on raw mutation data set of haemophilia disease that contains “protein change” and “mutation” fields that are encoded by one hot encoding technique and

2.2. Data pre-processing The above mentioned EAHAD dataset of Haemophilia ‘A’ were used

Fig. 1. Flowchart of the present study to predict the disease severity using OHE and PSM by applying ML classifiers. OHE: One-Hot Encoding, PSM: Position-Specific Mutation, ML: Machine Learning, EAHAD: European Association for Haemophilia and Allied Disorders. 5123

Genomics 112 (2020) 5122–5128

V.K. Singh, et al.

Fig. 2. (a). Distribution of mutation types (Point, Deletion, Duplicate, Insertion and Indel) in the dataset of Haemophilia ‘A’. (b). Distribution of different effects of point mutation (Missense, Nonsense and Silent). Table 2 Position based frequency analysis of mutation. Mutation

Frequency of mutation (dataset)

Mutation position

Mild cases

Mutation position

Moderate cases

Mutation position

Severe cases

Arg– > Cys

601 (380 mild + 113 moderate + 108 severe)

Arg– > His

490 (211 mild + 192 moderate + 87 severe)

Arg– > Gln

239 (155 mild + 42 moderate + 42 severe)

Arg– > stp⁎

429(1 mild + 22 moderate + 406 severe)

612 2178 550 2169 550 1708 1985 2326 1960 509 0 0

207 71 62 77 55 21 85 30 22 1 0 0

1708 550 2323 2169 1800 391 2228 2326 1960 1960 2135 602

26 20 17 81 44 27 16 15 11 4 4 4

301 2182 2323 2182 301 1800 2228 1960 1715 2228 2166 355

27 26 21 39 21 10 36 2 1 50 49 41



stp refers to stop codon.

patients [16,17].After filtering the dataset ‘ProteinChange’ and ‘Muta­ tion’ features were encoded by using two approaches; OHE and PSM. OHE is highly used with machine learning to convert the field of interest into unique label encoding. It represents categorical variables as binary vectors. First, the categorical values are mapped to integer values. Then each integer value is represented as a binary vector where everything has zero value except the index of the integer, which is marked as 1. Encoding with this approach will create ‘n-1’ columns for ‘n’ variations in a feature for the “ProteinChange” and “Mutation” fea­ ture. The advantage of OHE is a vector representation where all the elements of the vector are zero except one which has 1 as its value. In other words, it transforms the category feature to a format that works better with a classification algorithm. OHE is implemented at the preprocessing state of the dataset. OHE creates high dimension data with a large number of variations in a feature. When the number of variations are high, the representation size grows with the corpus which requires to extensive computation and it does not contain any contextual or semantic information embedded in this approach. Here, the limitation is the speed of execution in encoding techniques that are used in ma­ chine learning analysis. The proposed PSM approach is based on statistical analysis of ‘ProteinChange’ features. This method considers not only the mutation but also the position of mutation which plays an important role in the functioning of factor VIII and determines the severity of the disease. The column named ‘ProteinChange’ (p.Lys76Asn) had been divided into three individual columns, named as wild residue, the position of mu­ tation and new residue for predicting the disease severity.

Fig. 3. Classification accuracy of disease severity prediction using KNN, AdaBoost, SVM and Random Forest for OHE and PSM based approaches on preprocessed dataset. The maximum classification accuracy for the prediction of disease severity is provided by the Random Forest classifier & SVM for PSM based dataset which is 73.96% and 73.86% respectively.

proposed position-specific mutation(PSM) approach. To find the most frequent mutations responsible for the increased severity of the disease, the dataset was analyzed. Human Genome Variation Society (HGVS) protein change representation format of “p.wildresidue_position_ne­ wresidue” (for example p.Lys76Asn) and mutation representation in the format of “c.” for a coding DNA sequence (like c.41A > C) were used [15]. Missense mutations are the most common in the Haemophilia ‘A’ 5124

Genomics 112 (2020) 5122–5128

V.K. Singh, et al.

Fig. 4. Precision of different ML classifiers (KNN, Adaboost, SVM, Random Forest) for Haemophilia ‘A’ dataset in OHE and PSM based approach for (a) mild cases, (b) moderate cases and (c) severe cases.

Fig. 5. Recall of different ML classifiers (KNN, Adaboost, SVM, Random Forest) for Haemophilia ‘A’ dataset in OHE and PSM based approach for (a) mild cases, (b) moderate cases and (c) severe cases.

Fig. 6. (a) Training time comparison between different ML classifiers to train the OHE & PSM based Haemophilia ‘A’ dataset. (b) Prediction time is taken by ML classifiers.

letter code). Step 4. Separate the selected feature into three distinguish column (Lys|76|Asn). df[“prochange”] = new [1] x = list (df [“prochange “]) // store values into ‘x’ arr = [],arr1 = [], arr2 = [] // declare arrays

The steps involved in the PSM approach are as follows for the ‘ProteinChange’. Step 1. Select the ‘proteinchange’ feature (for example p.Lys76Asn). Step 2. Remove p. from the feature (Lys76Asn) and store into pro­ change column. Step 3. Replace‘*’ (stop codon in nonsense) by ‘stp’ in the dataset. (3 5125

Genomics 112 (2020) 5122–5128

V.K. Singh, et al.

Step4. Perform label encoding on both actual nucleotide and mu­ tant nucleotide of the mutation.

Table 3 Training time comparison in ML approaches. Approach

KNN

AdaBoost

SVM

Random forest

OHE training time (in sec) PSM training time (in sec) Training time improvement (In %)

2.433 0.0362 98.51%

158.399 5.2450 96.68%

114.34 4.543 91.75%

117.49 3.72 96.83%

2.4. Machine learning analysis In this work, classification accuracy, training time, prediction time, recall and precision values [18–20] of the OHE & PSM based approach datasets were employed to evaluate the performance of these four ML algorithms/classifiers. The ML classifiers which were used to classify the Haemophilia dataset through OHE and PSM approach are KNN [21–23], AdaBoost [24–26], SVM [24,27–29] and Random Forest [30–33]. The processed Haemophilia ‘A’ patient dataset had 6286 data samples and was divided into training and testing datasets. The para­ meters which were used to compare the efficiency of OHE and PSM approach by employing ML algorithms are as follows:

Table 4 Prediction time comparison in ML approaches. Approach

KNN

AdaBoost

SVM

Random Forest

OHE prediction time (in sec) PSM prediction time (in sec) Prediction time improvement (In %)

76.671 0.2755 99.64%

20.52 0.9569 95.33%

47.43 1.2839 97.29%

2.39 0.46 80.75%

• Classification Accuracy: The percentage of correctly classified samples. • Training Time: The total time elapsed for the training machine learning classifier, measured in seconds. • Prediction Time: It is the total time taken for prediction by ma­ chine learning classifier, measured in seconds. • Recall: It is the fraction of samples of a particular class X correctly

For i in range (0 to len(x)): // for loop for all values store in ‘x’ l = len(x[i]). arr.append(x[i][3:l-3]) // extract position value arr1.append(x[i][0:3]) // extract wild residue arr2.append(x[i][l-3:l]) // extract new residue data = pd.DataFrame(arr) df[“pos”] = data // store position value into ‘pos’ column data = pd.DataFrame(arr1). df[“wrf”] = data // store position value into ‘wrf’ column data = pd.DataFrame(arr2) df[“new”] = data // store position value into ‘new’ column Step 5. Convert 3 letter code (wild/new) into 1 letter protein code. (L|76|A). Step6. Perform label encoding on both Wild Residue and new re­ sidues of the protein.

• •



The steps involved in the PSM approach are as follows for the ‘Mutation’ feature. Step 1. Select the Mutation feature (for example c.1A > G). Step 2. Remove c. from the feature (1A > G) and store into new column. Step 3. Separate the selected feature into three distinguish column (A|1|G). df[“mutant”] = new [1] x = list (df [“mutant “]) // store values into ‘x’ arr = [],arr1 = [], arr2 = [] // declare arrays For i in range (0 to len(x)): // for loop for all values store in ‘x’ l = len(x[i]) arr.append(x[i][l-3]) // extract actual nucleotide arr1.append(x[i][0:l-3]) // extract position value arr2.append(x[i][l-1:l]) // extract mutant nucleotide data = pd.DataFrame (arr) dml[“f”] = data // store actual nucleotide into ‘f’ column data = pd.DataFrame (arr1) dml[“p”] = data // store position value into ‘p’ column data = pd.DataFrame (arr2) dml[“l”] = data // store mutant nucleotide into ‘f’ column

classified as belonging to that class X. It is equivalent to the True Positive Rate (TPR). Precision: It is the fraction of the samples which truly have class X among all those which were classified as class X. Training Time Improvement: It is the actual change/difference between the OHE and PSM training time for OHE training time, multiplying by these ratios by 100 they can be expressed as pre­ centages so the terms percentage change in Training Time. Actual Change(x, xreference) = x − xreference % Change in T.T.# = (Actual Change/x reference) × 100 = ((OHE T.T. − PSM T.T.)/OHE T.T.) × 100 Prediction Time Improvement: It is the actual change/difference between the OHE and PSM prediction time for OHE prediction time, multiplying by these ratios by 100 they can be expressed as pre­ centages so the terms percentage change in Prediction Time. Actual Change(x, xreference) = x − xreference % Change in P.T.## = (Actual Change/x reference) × 100 = ((OHE P.T. − PSM P.T.)/OHE P.T.) × 100 # Training Time ## Prediction Time

3. Results 3.1. Dataset overview The EAHAD dataset of Haemophilia ‘A’ patients suggests that the point mutation, deletion, duplication, indel and insertions are re­ sponsible for the disease in descending order as depicted in Fig. 2(a). However, it is clear from Fig. 2(b) that the point mutations significantly contribute to a missense mutation. The point mutation can cause three types of effects: silent, nonsense and missense. In the above dataset, point mutation holds for 83.5%, 14.4% and 2.1% cases from missense, nonsense and silent effect re­ spectively as shown in Fig. 2(a). These findings validate earlier reports

Table 5 Space complexity comparison for PSM and OHE approach. Factor VIII raw dataset

Dataset after cleaning

7784*17 6286*11 Memory Space (in MB)

OHE approach on mutation and protein change feature

PSM approach on mutation and protein change feature

Improvement in terms of space

6286*3115 74.8

6286*15 0.64

99.15% less space occupied in PSM Approach

5126

Genomics 112 (2020) 5122–5128

V.K. Singh, et al.

that missense mutations are the most common disease-causing forms of the changes, in the Haemophilia ‘A’ patients [16,17].

dimension of the dataset before the pre-processing was relatively higher (7784*17) however after filtering it was significantly reduced (6286*11). The data was processed using two encoding methods: first one is traditional OHE and the other one is proposed PSM. The PSM method generates relatively lesser and fixed data in comparison to OHE which gives an advantage to PSM, as shown in Table 5. Space complexity was reduced by almost 99.15% by PSM approach as compared to OHE. PSM analyzes the relationship between the mu­ tation of amino acids, frequency and position which helps in giving better insights about the dataset. Haemophilia solicits attention for gene therapy due to the mono­ genic nature of inheritance where even minimal increase in clotting factor activity can significantly improve quality of life. Statistical analysis in this study had provided headway for the se­ lection of appropriate cases for gene therapy. The most frequent mu­ tation in a particular position may be potential candidate case for gene therapy. The PSM approach does it statistically by keeping the in­ formation of the mutation position into account. ML classifiers KNN, AdaBoost, SVM and Random Forestwere applied with the data obtained from both approaches for predicting disease severity. The parameters such as training time, prediction time, preci­ sion and recall were calculated for both the approaches. The results indicate that in most of the cases PSM had performed better than OHE in terms of prediction and training both. Both the prediction and training time were improved using the PSM approach. The analysis revealed that not only the mutation type but the position of the mu­ tation also plays a significant role in determining the disease severity. It is evident from the results that for biological mutation data, di­ mensionality as well as the time elapsed in training and prediction is very crucial when a dataset is large.

3.2. Pre-processed dataset The dataset was pre-processed to select only point mutations for further analysis. Out of 7784 sample data, 6286 data were selected with 11 columns to predict the disease severity. 3.3. Position specific mutation analysis The four most frequent ‘ProteinChange’ counts in the terms of amino acid mutation from wild residue to a new residue are shown in Table 2. The results indicate that ‘ProteinChange’ from Arg➔Cys had oc­ curred maximum i.e. 601 times (380 mild, 113 moderate and 108 se­ vere), while Arg➔ His mutation occurred total 490 times (211 mild, 192 moderate and 87 severe) in the EAHAD dataset of Haemophilia ‘A’ patients. The results suggest that alone mutation does not play a role in the severity of the disease but there can be some other factors involved. To assess the role of ‘Position’ where the mutation occurred, top 3 positions were sorted on behalf of most frequent mutation in each mild, moderate and severe cases. The analysis has shown that the ProteinChange (Mutation) position in the isoform of Factor VIII may be the critical factor for determining the severity of the disease. The traditional OHE approach on feature ‘ProteinChange’ gives data for (n-1) columns, where n is the number of variation in that feature where the proposed PSM approach breaks down the ‘ProteinChange’ feature into three fixed number of columns for any number of variation present into that feature. 3.4. Machine learning analysis

5. Conclusion

Most popular machine learning algorithms were implemented to analyze the mutation data of Haemophilia ‘A’ patients based on many parameters. These machine learning classifiers were used to predict the effect of mutation on the intensity of the disease i.e. mild, moderate, or severe. The classification accuracy, precision, recall of the implemented ML algorithms were analyzed for both the OHE and PSM approache as shown in Figs. 3–5, respectively. Random Forest and SVM ML classifiers give better precision and recall value for most of the PSM based ap­ proaches in comparison to OHE. Training and prediction time taken by each ML classifier by using OHE and PSM based approaches is shown in Fig. 6. The PSM had reduced the training time of the ML classifiers by more than 98% as compared to the OHE for the Haemophilia ‘A’ dataset as shown in Table 3. The improvement in prediction time has been shown in Table 4, though KNN has the highest prediction time improvement but it shows less accuracy when compared to other methods. Although Random forest has the least prediction time improvement the prediction time is lesser as compared to most of them. The results obtained until now shows that the PSM approach gives better performance in terms of classification accuracy and training time/prediction time as compared to OHE.

The statistical analysis indicates that severity of Haemophilia ‘A’ does not depend only on the mutation but also on the position of the mutation. From the available dataset, we developed a new PSM based prediction method that is based on statistical analysis for finding the relationship between mutation and its position. From this analysis, it is evident that the PSM is more efficient in terms of accuracy, time and space complexity, however the severity prediction still requires more attention to improve accuracy. PSM approach can be applied to many other large mutation data for quick and efficient analysis. The proposed PSM approach will be potentially useful for lethality or severity pre­ dictions in other mutation-based diseases too. Declaration of Competing Interest All the authors declare that they have no conflict of interest. Acknowledgements All the authors are thankful to MNNIT Allahabad for providing necessary facilities. NM is thankful to MNNITA for a PhD fellowship. VKS is thankful to QIP for facilitating masters studies.

4. Discussion Haemophilia ‘A’ is a genetic disorder caused by a mutation in factor VIII gene and depending on the mutations the disease can be classified into severe, moderate and mild types. The dataset for Haemophilia ‘A’ is available at the European Association for Haemophilia and Allied Disorders (EAHAD) which was used to predict disease severity using different ML approaches. Before predicting the ML algorithms efficiency, the dataset was preprocessed and filtered to remove the redundant and noisy data. The

Appendix A. Supplementary data Supplementary data to this article can be found online at https:// doi.org/10.1016/j.ygeno.2020.09.020. PSM processed data is available at link given below: https://github. com/ashutomani/PSM-processed OHE processed data is available at link given below: https://github. com/ashutomani/OHE_processed 5127

Genomics 112 (2020) 5122–5128

V.K. Singh, et al.

References

[17] Z. Guo, L. Yang, X. Qin, X. Liu, Y. Zhang, Spectrum of molecular defects in 216 Chinese families with hemophilia a: identification of noninversion mutation hot spots and 42 novel mutations, Clin. Appl. Thromb. Hemost. 24 (1) (2018) 70–78. [18] T.T. Nguyen, G. Armitage, A survey of techniques for internet traffic classification using machine learning, IEEE Commun. Surv. Tutor. 10 (4) (2008) 56–76. [19] K. Singh, S. Agrawal, Internet traffic classification using RBF Neural Network, International Conference on Communication and Computing Technologies (ICCCT2011), 2011, February, pp. 39–43 Jalandhar, India. [20] K. Singh, S. Agrawal, Comparative analysis of five machine learning algorithms for IP traffic classification, 2011 International Conference on Emerging Trends in Networks and Computer Communications (ETNCC), IEEE, 2011, April, pp. 33–38. [21] E. Marchiori, Class dependent feature weighting and k-nearest neighbor classifi­ cation, IAPR International Conference on Pattern Recognition in Bioinformatics, Springer, Berlin, Heidelberg, 2013, June, pp. 69–78. [22] P. Cunningham, S.J. Delany, K-Nearest Neighbour Classifiers, (2020) arXiv preprint arXiv:2004.04523. [23] Y.L. Cai, D. Ji, D. Cai, A KNN research paper classification method based on shared nearest neighbor, NTCIR, 2010, June, pp. 336–340. [24] X.L. ZHANG, F. REN, Study on combinability of SVM and AdaBoost algorithm [J], App. Res. Comput. 1 (2009). [25] H.X. Jia, Y. Zhang, Fast Adaboost training algorithm by dynamic weight trimming, Chin. J. Comput. 32 (2009) 336–341. [26] P. Wu, H. Zhao, Some analysis and research of the AdaBoost algorithm, International Conference on Intelligent Computing and Information Science, Springer, Berlin, Heidelberg, 2011, January, pp. 1–5. [27] Y. Yang, J. Li, Y. Yang, The research of the fast SVM classifier method, 2015 12th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), IEEE, 2015, December, pp. 121–124. [28] M. Song, S. Rajasekaran, A greedy algorithm for gene selection based on SVM and correlation, Int. J. Bioinforma. Res. Appl. 6 (3) (2010) 296–307. [29] Y.J. Lee, O.L. Mangasarian, SSVM: a smooth support vector machine for classifi­ cation, Comput. Optim. Appl. 20 (1) (2001) 5–22. [30] L. Breiman, Bagging predictors, Mach. Learn. 24 (2) (1996) 123–140. [31] L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32. [32] L. Breiman, J. Friedman, C.J. Stone, R.A. Olshen, Classification and Regression Trees, CRC press, 1984. [33] A. Sarica, A. Cerasa, A. Quattrone, Random Forest algorithm for the classification of neuroimaging data in Alzheimer’s disease: a systematic review, Front. Aging Neurosci. 9 (2017) 329.

[1] G.I. Ingram, The history of haemophilia, J. Clin. Pathol. 29 (6) (1976) 469. [2] P.H. Bolton-Maggs, K.J. Pasi, Haemophilias a and b, Lancet 361 (9371) (2003) 1801–1809. [3] G.2. White, Factor VIII and Factor IX subcommittee. Definitions in hemophilia. Recommendation of the scientific subcommittee on factor VIII and factor IX of the scientific and standardization committee of the International Society on Thrombosis and Haemostasis, Thromb. Haemost. 85 (2001) 560. [4] J. Gitschier, W.I. Wood, T.M. Goralka, K.L. Wion, E.Y. Chen, D.H. Eaton, G.A. Vehar, D.J. Capon, R.M. Lawn, Characterization of the human factor VIII gene, Nature 312 (5992) (1984) 326–330. [5] J.H. McVey, P.M. Rallapalli, G. Kemball-Cook, D.J. Hampshire, M. Giansily-Blaizot, K. Gomez, S.J. Perkins, C.A. Ludlam, The European association for haemophilia and allied disorders (EAHAD) coagulation factor variant databases: important resources for haemostasis clinicians and researchers, Haemophilia 26 (2) (2020) 306–313. [6] S. Agajanian, O. Oluyemi, G.M. Verkhivker, Integration of random forest classifiers and deep convolutional neural networks for classification and biomolecular mod­ eling of cancer driver mutations, Front. Mol. Biosci. 6 (2019) 44. [7] M. Lovino, G. Urgese, E. Macii, S. Di Cataldo, E. Ficarra, A deep learning approach to the screening of oncogenic gene fusions in humans, Int. J. Mol. Sci. 20 (7) (2019) 1645. [8] H. Alkharusi, Categorical variables in regression analysis: a comparison of dummy and effect coding, Int. J. Educ. 4 (2) (2012) 202. [9] K.J. Berry, P.W. Mielke Jr., H.K. Iyer, Factorial designs and dummy coding, Percept. Mot. Skills 87 (3) (1998) 919–927. [10] M.J. Davis, Contrast coding in multiple regression analysis: strengths, weaknesses, and utility of popular coding structures, J. Data Sci. 8 (1) (2010) 61–73. [11] J. Cohen, P. Cohen, S.G. West, L.S. Aiken, Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, Routledge, 2013. [12] J.L. Myers, A. Well, R.F. Lorch, Research Design and Statistical Analysis, Routledge, 2010. [13] K.E. O’Grady, D.R. Medoff, Categorical variables in multiple regression: some cautions, Multivar. Behav. Res. 23 (2) (1988) 243–260. [14] B.B. Brown, I. Altman, Territoriality, defensible space and residential burglary: an environmental analysis, J. Environ. Psychol. 3 (3) (1983) 203–220. [15] J.T.D. Dunnen, S.E. Antonarakis, Mutation nomenclature extensions and sugges­ tions to describe complex mutations: a discussion, Hum. Mutat. 15 (1) (2000) 7–12. [16] Y. Bromberg, G. Yachdav, B. Rost, SNAP predicts effect of mutations on protein function, Bioinformatics 24 (20) (2008) 2397–2398.

5128