The prediction of good physicians for prospective diagnosis using data mining

The prediction of good physicians for prospective diagnosis using data mining

Informatics in Medicine Unlocked 12 (2018) 120–127 Contents lists available at ScienceDirect Informatics in Medicine Unlocked journal homepage: www...

484KB Sizes 0 Downloads 7 Views

Informatics in Medicine Unlocked 12 (2018) 120–127

Contents lists available at ScienceDirect

Informatics in Medicine Unlocked journal homepage: www.elsevier.com/locate/imu

The prediction of good physicians for prospective diagnosis using data mining

T

Nfongourain Mougnutou Rémya,∗ , Tekinzang Tedondjio Martiala, Tayou Djamegni Clémentinb a b

Department of Mathematics and Computer Science, FS, University of Ngaoundere, Ngaoundere, Cameroon Department of Mathematics and Computer Science, University of Dschang, Dschang, Cameroon

A R T I C LE I N FO

A B S T R A C T

Keywords: Data mining Open data Logistic regression Multidisciplinary diagnosis

This work provides a predictive model for selecting the most appropriate health care practitioners, particularly physicians, to diagnose a patient. In the context of a multidisciplinary diagnosis, this paper provides a data mining model to identify a specialist physician who can participate in such a diagnosis and thus reduce the risk of errors. First, the model identifies the specialists who can diagnose a patient. Second, the model uses the calculated probabilities to provide a ranking of specialist physicians capable of making a good diagnosis. This ranking can be used to construct a group of specialists who can participate in the multidisciplinary diagnosis. A sample of 58177 patients (52% women) consulted by 11 different medical specialists was extracted from the SPARCS database. The work is based on the analysis of open health data, specifically, diseases that keep patients stable. The result of the data mining is a multinomial logistic regression model. The 10-fold cross-validation results indicate that the model provides good predictive capability for the selected data, with an average accuracy, sensitivity, specificity, and precision of 80% , 79% , 97.3%, and 82.8%. Our results show that a patient's characteristics influence the selection of a physician. In conclusion, we assert that all selected specialists are able to diagnose the patient and that some specialists have a greater ability to diagnose the disease than do others.

1. Introduction Medicine has undergone changes with the rapid progress of science and new approaches by physicians, which have resulted in modern medicine. Despite these advances, diagnostic errors persist in medicine [27]. To reduce the risk of incorrect diagnosis by a single physician, a multidisciplinary approach to diagnosis can be considered [7]. In the case of multidisciplinary diagnosis, multiple actors from different fields collaborate to provide a single diagnosis. The group work required by multidisciplinary diagnosis makes it possible not only to obtain value but to facilitate the identification and analysis of the causes of the patient's problem. However, an expert's opinion may be considered to be relevant only in his or her areas of expertise. In other words, one physician's opinion about a patient's problem may be more welcome than the opinion of another physician. This raises the problem of choosing the physicians to participate in multidisciplinary diagnosis, as illustrated in Fig. 1. The objective of this work is to construct a model that can describe the relationship between the perspective of a good diagnosis provided by a physician and a patient's profile. To achieve this objective, an analysis of health data was conducted, and a statistical representation of the



physician-patient relationship was developed. This paper presents a model for selecting medical experts based on several characteristics of a patient. The selected specialists can thus participate in a multidisciplinary diagnosis. To the best of our knowledge, this is the first study of its kind on physician prediction based on a patient's profile using data mining. 2. Background 2.1. Data mining Data mining is the art of finding information or knowledge in a large amount of data. Like statistics, data mining is becoming increasingly common in companies and organizations that want to extract relevant information from their databases, which they can use for their own needs [31]. Data mining tasks can, in general, be classified as tasks of description and prediction [30], [24]. To understand the discovery goal, it is vital to understand the difference between descriptive and predictive tasks. Data mining technology is applied in an increasing variety of fields. In descriptive data mining, the goal is to produce a descriptive

Corresponding author. E-mail addresses: [email protected] (N.M. Rémy), [email protected] (T.T. Martial), [email protected] (T.D. Clémentin).

https://doi.org/10.1016/j.imu.2018.07.005 Received 2 March 2018; Received in revised form 13 July 2018; Accepted 29 July 2018 Available online 02 August 2018 2352-9148/ © 2018 Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/BY-NC-ND/4.0/).

Informatics in Medicine Unlocked 12 (2018) 120–127

N.M. Rémy et al.

estimate the probability of selecting each of the categories and the effect of independent variables on the outcome. 2.4. Open data The world of data is becoming increasingly competitive every day, as observed in terms of volume, variety and value. Open data adds richness and new dimension to data warehouses and analysis to unlock new forms of innovation [3]. The sharing and opening up of data make it possible to make essential data available online and to improve the analysis of many decision-makers, thus improving the ability to make more informed decisions in various sectors including medicine [26]. This therefore means the creation of large sets of reference data shared by all stakeholders and the encouragement of the development of several high value-added services. Open data means that these data are available for access, exploitation and reuse by any interested party (companies, scientists, etc.). This work was performed using medical information from the Health Statewide Planning and Research Cooperative System (SPARCS) database. SPARCS is a database of patient characteristics, diagnoses, treatments and services of patients whose lesional and/or functional status is considered to be stable (e.g., angina). The French Clinical Classification of Emergency Patients (CCMU) [32] commonly used for care prioritization from level 1 to level 5 is:

Fig. 1. The problem of selecting physicians.

approximation or model of the process that generates the data. The goal of prediction is to find a model to estimate the values of future cases. Medical data mining is an essential component of clinical decision support systems. Data mining can extract hidden information in the data of the medical domain and exploit it as patterns for clinical diagnosis [14], [33]. Similar to [14] and [33], most of the work in the literature focuses on patients' diseases and disregards physicians. The aim of this work is to propose a model with the probability of a patient's profile that suits the modality of a physician, in other words, to predict the physicians who can provide a good diagnosis based on the patient's characteristics.

• Level 1: Clinical condition considered to be stable. Simple clinical

2.2. Predictive data mining In most data mining applications, a target variables on which we will learn is necessary [25]. A predictive model can be understood as data learning. In this context, we also need to know the value of the target variables for a set of examples (i.e., patient records).

• • • •

2.3. Algorithms Access to electronic patient records (EHRs) opens new possibilities for medical data mining. Many different supervised machine learning algorithms can be used for analysing datasets. Some of the techniques of data mining that are successfully used in healthcare today are decision trees (DTs), artificial neural networks and logistic regression. DTs are one of the most powerful and popular tools to extract information. They also have several advantages [4]. A considerable asset of a DT is that it has the advantage of being a highly interpretable model that represents a set of rules. However, other machine learning algorithms, such as the support vector machine (SVM) [19], may provide better accuracy yet builds less interpretable models. Artificial neural networks [16] are derived from the analysis and information processing of the human brain. They represent knowledge as a network of units, or neurons, that are present in the brain. ANNs have been successfully used in applications in clinical medicine, such as diagnosis in medical images [8]. The method has been tested on several problems and compared with several existing methods, and it obtained performance comparable to that of SVM. However, compared to SVM, artificial neural networks have much longer execution times and do not explain their results. Logistic regression (LR) is one of the most widely used methods for statistical modelling of binary response variables. LR predicts the probability of the target variable, denoted by p. The target variable can have a value of 1 (success) or a value of 0 (failure, 1 − p). LR has been used extensively in the medical and social sciences [20]. Multinomial logistic regression (MLR) is an alternative to binomial logistic regression [15]. Multinomial logistic regression has an advantage; it does not assume a linear relationship among the dependent variable and each independent variable. MLR is used in situations in which there is no ordering of K values of dependent variables, which is the case in our study. In this work, the dependent variable is nominal and consists of more than 02 categories. We focus on multinomial regression to

examination. Abstention of complementary diagnostic or therapeutic procedures. Level 2: Level 1 and decision of additional diagnostic procedure (e.g., blood test). Level 3: Clinical condition may worsen without any life-threatening prognosis. Level 4: Life-threatening risk without starting immediate resuscitation procedures. Level 5: Vital prognosis engaged involving starting resuscitation procedures.

In this work, all selected diseases have one thing in common, they maintain the patient's clinical condition and/or functional prognosis stable. 3. Design of the predictive model 3.1. Data understanding and data preparation 3.1.1. Patient problem The clarification of the patient's problem involves performing the patient's history. The collection of a patient's demographic data is the starting point of a patient's history. The next step is the development of the patient's profile and the patient's chief complaint [29]. Demographic data and the patient's profile are important because they provide a representation of the patient and his or her condition, including, age and gender. This information can also help to identify other medical problems. A patient's profile is a summary of the patient's characteristics and problems that lead to the patient's current condition. In this paper, we want to identify suitable physicians who can work together to solve a patient's problem. The process of selecting physicians is illustrated in Fig. 2. The core of the process is the predictive model that must be able to predict which physicians can provide a good diagnosis based on a patient's profile. 3.1.2. Preprocessing The SPARCS 2014 database used in this work has more than 1,000,000 observations and 39 variables in its raw state, which includes many missing, redundant and irrelevant values. All variables not correlated with the objectives of the study were removed. No grouping was 121

Informatics in Medicine Unlocked 12 (2018) 120–127

N.M. Rémy et al.

mortality, that is, minor or moderate mortality. Only the significant variables among the 09 candidate variables should be retained [2]. In addition, we want to reduce the number of explanatory variables to be retained to make the model easier to interpret and more robust. The ICD-09 variable describes the diagnosis associated with the patient's illness. However, we attempt to predict the appropriate specialist who will be able to conduct the diagnosis, which makes the inclusion of the variable ICD-09 incoherent in the model. Similarly, mortality risk and disease severity can only be determined by a physician. In the end, the only candidate variables included in the model are: Age, Sex, Symptom1, Symptom2, Symptom3 and Symptom4. 3.2. Prediction model Fig. 2. The process of selecting physicians to participate in multidisciplinary diagnosis.

The goal is to predict which physicians are likely to provide an appropriate diagnosis for a patient based on the patient's profile. The objective is to predict the values taken by variable Y = {y1 , y2 , .., yK } , where K is the number of physicians who can who can offer an appropriate diagnosis. The patient's profile includes p descriptors {X1 , X2 , …, Xp } . For each patient, the predictive variable, that is, medical specialists or physicians Y, can be modelled by Ref. [9]:

done according to the interval or scales of measurement of the input variables. This grouping had already been done, as is the case of Patient's age range. To solve the problem of missing values, we have used the simple and direct approach that consists of the complete elimination of entries that have missing values. The data on which the work has been carried out concerns more than 50,000 patients. This information includes the patient's profile, the diagnosis and the service that provided the care. The dataset was divided into two categories: learning set and test set. The learning set was used for estimation and validation and represents approximately 90% of the total data set. The test set was used for evaluation and represents 10% of the dataset. The variable Y with K unordered modalities must be predicted from P explanatory variables {X1 , X2 , …, Xp } . There are several independent observations noted:

x , …, x p,1, y1 ⎞ ⎛ 1,1 Obs = ⎜ ⋮ ⎟ x , , x p, n , yn ⎠ ⎝ 1, n …

Yk = β0(k ) + β1(k ) X1 + ⋯+βp(k ) Xp where

(2)

are the parameters to be determined.

HM1. A patient suffers from only one disease at a time. HM2 . A set of 04 symptoms is sufficient to describe the manifestation of a disease.

(1)

3.3. Variables of the model We want to test the role of all candidate variables with respect to the variables to be predicted. To verify the relevance of the candidate variables, significant independent variable tests were implemented. The objective was to study the links between the explanatory variables and the studied variable to eliminate variables that are not very discriminating [1]. The following hypothesis was considered:

H0 : β1(k ) = β2(k ) = ⋯βp(k ) = 0 for all k ∈ {1, …, K } .

3.1.3. Preselection of variables After preprocessing, each patient record contained 09 features. To characterize a patient's profile, 09 candidate variables were preselected:

H1: ∃ j ∈ {1, ⋯, J }/ βj(k ) ≠ 0 , there is at least one nonzero coefficient. This test is based on the difference in the likelihood ratios of the complete model and the model under H0 . The results of the significant explanatory variables tests are presented in Table 1. The p-value can be used to quantify the statistical significance of the evidence and is used in the context of null hypothesis testing. Let x be an observed instance; the p-value is denoted p = P (x H0) .

• Patient's age range. This variable describes the age range of the pa-

• •

⋯, βp(k )

3.2.1. The assumptions In medical research, data mining begins with a hypothesis [5]. In modern medicine, many symptoms are generally observed for a disease [18]. For reasons of generality and simplicity, only the 04 main symptoms were selected. For the development of the model, the following assumptions are made:

A total of 17 diseases were selected. These diseases have no prior links, such as migraine, asthma, allergic reaction, malnutrition, abdominal herniation, and polio. These diseases are not linked to one another. However, all these diseases belong to level 1 or level 2 of the French Clinical Classification of Emergency Patients (CCMU). Eleven types of physicians corresponded with the diagnoses of the above diseases, namelygeneral practitioner, otolaryngologist, pulmonologist, paediatrician, rheumatologist, gastroenterologist, urologist, cardiologist, dermatologist, neurologist, and allergist.

• • •

β0, β1(k ),

tient at the time of diagnosis. There are 5 ranges: 0 to 17 years old, 18 to 29 years old, 30 to 49 years old, 50 to 69 years old and 70 years old and above. Gender. This variable indicates the sex of the patient, male or female. To simplify, we use the Sex variable instead of the gender variable. Code ICD-09. This variable specifies the disease codes defined according to the International Classification of Diseases ICD-9 [12]. Symptom. A set of 04 variables that describe the manifestation of a disease. Each variable Symptom1, Symptom2, Symptom3 and Symptom4 contains a set of clinical signs. Severity of the disease. This variable describes the severity of the disease, whether minor or moderate. Mortality risk. This variable indicates the risk of mortality associated with the disease. This study focuses on diseases with a low risk of

3.4. The parameters of the model The goal is to develop an appropriate predictor using the selected subset of data. The number of independent patient observations is n = 58177 .The goal is to predict the physicians who can provide a correct diagnosis for a patient based on Age, Sex, and Symptoms. In a way, this is a question of highlighting the existence of an underlying functional link between physicians that can provide a good diagnosis and a set of characteristics specific to a patient, that is, the patient's 122

Informatics in Medicine Unlocked 12 (2018) 120–127

N.M. Rémy et al.

the interval [30; 49], 30.41% had an age within the interval [50; 69] and 20.55% had an age within the interval [70; Older ]. Additionally, 52% of the patients were women, 48% were men, and all suffered from a disease that kept them in a stable state. For any patient x, his profile can be defined by 06 characteristics. The model can be written as:

Table 1 Results of the significant explanatory variables tests: Values of the chi-squared test and their P-Values. Ind. Var. Xp

LR Chisq

Df

P-value

Significance

Age

24756.3

40

<2e−16

***

Sex

163.9

10

***

Symptom1 Symptom2 Symptom3 Symptom4

−1.5 −1.5 −2 −2.8

150 160 160 150

<2e−16 1 1 1 1

logit (pk (x )) = β0(k ) + β1(k )Age + β2(k )Sex + β3(k )Symptom1 + β4(k )Symptom2 + β5(k )Symptom3 + β6(k )Symptom4

The maximum likelihood estimates of the β parameters were obtained using data on all significant independent variables [13] and R statistical software. Multinomial logistic regression works by choosing a class as the reference category. The physician category Allergy was used as the reference category in our model. The results for the other variables are understood with respect to the reference category. The coefficients of the model are represented in Table 2. Sx represents the explanatory variables Symptomx. The list of symptoms used in this work is show in Table 3. Consider the patients presented in Table 4. Equations (4) and (6) are the basis for the calculation of probabilities in this study. The predicted probabilities, Pk (x ) , for different values of x i can be used to interpret the effect of the independent variables Xp on the probability of being in category k. Thus, for a patient x, the specialist with the highest probability is the final prediction, that is, the specialist most capable of providing a good diagnosis. The estimated probabilities of the patients considered are presented in Table 5. The underlined values in Table 5 denote a difference between the specialist predicted by the model and the actual specialist that diagnosed the patient. Based on the calculated probabilities, an order can be established between physicians. Physicians with the lowest probability are less suitable for diagnosing the patient. Conversely, a physician with the highest probability is best placed according to the model to diagnose the patient. We applied the multinomial logistic regression model to the dataset. The final model obtained was χ 2 = 10.82 with P-value = 2.2e−16 . The pvalue of the final model was below the significance level of 0.05, thereby supporting the existence of a relationship between patient characteristics (06 parameters) and physician speciality (dependent variable). The odds ratios (ORs) for the independent variables can be interpreted. When the odds ratio is greater than 1, the likelihood of being in one category against being in the reference group is higher. When the odds ratio is less than 1, there is a decrease in the likelihood of being in one category against the reference group. For example, OR (S1=Coughing)=7.19 for Y=Pulmonologist means that a patient with coughing as S1 has an approximately seven times higher chance of being diagnosed by a Pulmonologist. Other ORs also have high values and similar interpretations. For example, OR(S1=Coughing)=5.26 for Y=General practitioner. Another interpretation of the OR is possible. For Y = yk with k ∈ {2, ⋯, 11} , OR(S3=Itching) < 1. The interpretation of these odds ratios means that any patient with itching as S3 is likely to be diagnosed by an Allergist.

The p-value obtained for the Age and Sex variables is below the 0.05 threshold. We can conclude that these variables are discriminating; hence, these variables are included in the model. With regard to the Symptom1, Symptom2, Symptom3 and Symptom4 variables, the value obtained does not allow us to make a presumption about the discriminating power. However, these variables are those that characterize a patient's profile (according to hypothesis HM2 ); thus, they must also be included in the model. In conclusion, hypothesis H0 is rejected, so all the candidate variables are included in the model.

profile. This problem can be expressed by:

Physician = f (Age , Sex , Symptom1, Symptom2, Symptom3, Symptom 4) (3) Let (y1, ⋯, yK ) be the K modalities taken by Y, the physician variable to predict. For all k ∈ {1, ⋯, K } , based on the dataset, we want to estimate the unknown probability for every i ∈ {1, ⋯, n} . We have:

pk (x i ) = P ({Yi = yk } {(X1, ⋯, Xp ) = x i}), x i = (x1, i, ⋯, x p, i )

(4)

The multinomial logistic regression model can be written as:

p (x ) ⎞ (k ) (k ) (k ) logit (pk (x )) = log ⎜⎛ k ⎟ = β 0 + β1 x1 + ⋯+ βp x p , x = (x1, ⋯, x p ) ⎝ p1 (x ) ⎠ (5) where β0(k ), β1(k ), ⋯, βp(k ) are unknown real coefficients. Therefore, for all k ∈ {2, ⋯, K } , we have:

exp ⎛⎜β0(k ) + β1(k ) x1+⋯+βp(k ) x p⎞⎟ ⎝ ⎠

pk (x ) =

K 1 + ∑k = 2 exp ⎛⎜β0(k ) + β1(k ) x1+⋯+βp(k ) x p⎞⎟ ⎝ ⎠

(6)

We estimate a multinomial logistic model to predict the probability of each of the 11 types of outcomes. Because there are 11 different categories of physicians, the model estimates 10 parameters for each explanatory variable. Odds ratios (ORs) were also calculated for the interpretation of these coefficients. An odds ratio can be used in several ways; for example, the odds ratio can measure the effect of the input variable [11]. Let x a and xb two categories of variable X1. The odds ratio is:

OR (x a , xb) =

odds (x a) = odds (xb)

P (x a) 1 − P (xb) P (xb)) 1 − P (xb)

(8)

4.2. Statistical analysis (7)

To avoid overfitting and to ensure the statistical validity of the results, experimental evaluation was conducted using 10-fold cross validation [6]. Many different parameters have been used to estimate performance indices [23] and have been reported for the model. These performance indices are presented in Table 6. The reported results include overall accuracy (number of correct predictions divided by total number of test sequences), accuracy for each medical specialist class (number of correct predictions divided by number of sequences in that class), and the Matthews correlation coefficient (MCC) for each medical specialist class. If there is no relationship between the predicted specialist and the actual specialist, the MCC should be very low. By contrast, the MCC value would increase as the strength of the relationship between the predicted specialist and the actual specialist increased.

with P (x a) = P ({Y = 1} {(X1 = x a, ⋯, Xp )) .We use the odds ratio to measure the effect of the input of a patient's characteristics on the physicians. 4. Results 4.1. Prediction results The following results are presented according to the recommendations of the STARD 2015 statement [21]. Among the 58177 patients retained for our study, 22.62% had an age within the interval [0; 17], 07.92% had an age within the interval [18; 29], 18.5% had an age within 123

Informatics in Medicine Unlocked 12 (2018) 120–127

N.M. Rémy et al.

Table 2 The parameters βp(k ) of the multinomial logistic regression model. a) Parameters β0(k ), β1(k ), β2(k ) et β3(k ) Var. Yk

Intercept

Age = [18; 29]

Age = [30; 49]

Age = [50; 69]

Age = [70; Older]

sex = M

S1 = 1

S1 = 16

S1 = 4

Cardio. Dermato. Gene. Gastro. Neuro. Oto. Paedia. Pulmo. Rheumato. Uro.

−1.271 −0.651 3.871 0.190 −0.953 −0.169 −2.531 2.080 −0.487 −1.376

3.752 −0.105 1.520 1.228 1.062 1.631 −8.328 2.533 1.523 1.525

−2.312 −0.179 −2.740 −1.926 −0.809 −2.224 7.201 −1.020 −0.056 −0.876

0.721 −0.192 1.856 0.896 0.324 1.173 −3.420 −0.256 0.497 0.175

−0.112 −0.226 −0.473 −0.355 −0.148 −0.079 1.538 0.325 −0.127 −1.172

−0.138 −0.225 −0.566 −0.608 −0.344 −0.670 −0.041 −0.163 0.025 1.394

−0.169 −0.133 3.108 −0.027 −0.109 −0.146 −1.292 −0.641 −0.163 −0.218

−0.026 −0.042 −0.418 −0.058 −0.017 −0.066 −0.001 −0.368 1.157 −0.080

−0.105 −0.206 −0.240 −0.133 −0.089 −0.092 −1.174 2.714 −0.130 −0.119

Var. Yk

S1 = 5

S1 = 10

S1 = 15

S1 = 3

S1 = 9

S1 = 13

S1 = 14

S1 = 12

S1 = 6

S1 = 2

S1 = 8

S1 = 11

Cardio. Dermato. Gene. Gastro. Neuro. Oto. Paedia. Pulmo. Rheumato. Uro.

−0.596 −0.462 1.661 −0.529 −0.423 −0.581 2.048 1.973 −0.554 −0.631

−0.131 −0.075 −0.920 −0.382 −0.083 −0.145 0.542 −1.216 −0.118 2.704

−0.143 −0.945 −1.223 −0.664 −0.673 −0.514 −2.804 −1.047 −0.773 −0.678

−0.318 −0.243 −0.285 −0.273 −0.324 −0.296 −0.593 3.249 −0.311 −0.318

2.617 −0.276 −1.040 2.596 −0.281 −0.433 −0.997 −0.398 −0.344 −0.448

−0.057 −0.087 0.704 −0.080 −0.030 −0.126 0.488 −0.231 −0.095 −0.221

−0.121 0.004 −1.373 −0.362 2.377 −0.097 0.467 −0.512 −0.090 −0.087

−0.630 −0.203 1.399 −0.862 −0.106 −0.134 1.211 −0.505 −0.194 0.104

−0.048 −0.059 −0.642 −0.071 −0.063 −0.120 0.012 −1.016 2.312 −0.103

−0.437 −0.288 1.923 −0.276 −0.121 3.463 −2.236 −0.309 −0.313 −0.202

−0.064 −0.062 −0.483 −0.116 −0.084 −0.065 −0.794 1.820 −0.050 −0.043

−0.111 2.839 −0.327 −0.937 0.006 −0.035 −0.794 −0.327 −0.063 −0.092

b) Parameters β4(k ) et β5(k ) Var. Yk

S2 = 30

S2 = 25

S2 = 33

S2 = 26

S2 = 19

S2 = 32

S2 = 24

S2 = 29

S2 = 31

S2 = 27

S2 = 23

S2 = 20

Cardio. Dermato. Gene. Gastro. Neuro. Oto. Paedia. Pulmo. Rheumato. Uro.

−0.057 −0.087 0.704 −0.080 −0.030 −0.126 0.488 −0.231 −0.095 −0.221

2.617 −0.276 −1.040 2.596 −0.281 −0.433 −0.997 −0.398 −0.344 −0.448

−0.026 −0.042 −0.418 −0.058 −0.017 −0.066 −0.001 −0.368 1.157 −0.080

−0.131 −0.075 −0.920 −0.382 −0.083 −0.145 0.542 −1.216 −0.118 2.704

−0.318 −0.243 −0.285 −0.273 −0.324 −0.296 −0.593 3.249 −0.311 −0.318

−0.143 −0.945 −1.223 −0.664 −0.673 −0.514 −2.804 −1.047 −0.773 −0.678

−0.064 −0.062 −0.483 −0.116 −0.084 −0.065 −0.794 1.820 −0.050 −0.043

−0.630 −0.203 1.399 −0.862 −0.106 −0.134 1.211 −0.505 −0.194 0.104

−0.121 0.004 −1.373 −0.362 2.377 −0.097 0.467 −0.512 −0.090 −0.087

−0.111 2.839 −0.327 −0.937 0.006 −0.035 −0.794 −0.327 −0.063 −0.092

−0.446 −0.276 −1.517 3.190 −0.234 −0.449 2.249 −0.302 −0.359 −0.581

−0.105 −0.206 −0.240 −0.133 −0.089 −0.092 −1.174 2.714 −0.130 −0.119

Var. Yk

S2 = 17

S2 = 18

S2 = 22

S2 = 21

S3 = 30

S3 = 48

S3 = 50

S3 = 35

S3 = 54

S3 = 52

S3 = 3

S3 = 55

Cardio. Dermato. Gene. Gastro. Neuro. Oto. Paedia. Pulmo. Rheumato. Uro.

−0.169 −0.133 3.108 −0.027 −0.109 −0.146 −1.292 −0.641 −0.163 −0.218

−0.437 −0.288 1.923 −0.276 −0.121 3.463 −2.236 −0.309 −0.313 −0.202

−0.048 −0.059 −0.642 −0.071 −0.063 −0.120 0.012 −1.016 2.312 −0.103

−0.596 −0.462 1.661 −0.529 −0.423 −0.581 2.048 1.973 −0.554 −0.631

−0.057 −0.087 0.704 −0.080 −0.030 −0.126 0.488 −0.231 −0.095 −0.221

−0.108 −0.135 2.463 −0.108 −0.108 −0.107 −1.285 −0.242 −0.108 −0.107

−0.382 −0.305 −0.768 −0.388 −0.408 −0.361 −1.387 5.069 −0.361 −0.360

−0.131 −0.075 −0.920 −0.382 −0.083 −0.145 0.542 −1.216 −0.118 2.704

−0.111 2.839 −0.327 −0.937 0.006 −0.035 −0.794 −0.327 −0.063 −0.092

2.617 −0.276 −1.040 2.596 −0.281 −0.433 −0.997 −0.398 −0.344 −0.448

−0.488 −0.138 3.544 −0.829 −0.697 −0.334 1.137 −0.804 −0.399 −0.364

−0.143 −0.945 −1.223 −0.664 −0.673 −0.514 −2.804 −1.047 −0.773 −0.678

c) Parameters β5(k ) and β6(k ) Var. Yk

S3 = 23

S3 = 56

S3 = 51

S3 = 8

S3 = 53

S3 = 18

S3 = 44

S3 = 21

S4 = 40

S4 = 35

S4 = 36

S4 = 41

Cardio. Dermato. Gene. Gastro. Neuro. Oto. Paedia. Pulmo. Rheumato. Uro.

−0.630 −0.203 1.399 −0.862 −0.106 −0.134 1.211 −0.505 −0.194 0.104

−0.026 −0.042 −0.418 −0.058 −0.017 −0.066 −0.001 −0.368 1.157 −0.080

−0.048 −0.059 −0.642 −0.071 −0.063 −0.120 0.012 −1.016 2.312 −0.103

−0.596 −0.462 1.661 −0.529 −0.423 −0.581 2.048 1.973 −0.554 −0.631

−0.061 0.002 0.645 0.082 −0.001 −0.040 −0.007 −0.399 −0.055 −0.111

−0.121 0.004 −1.373 −0.362 2.377 −0.097 0.467 −0.512 −0.090 −0.087

−0.446 −0.276 −1.517 3.190 −0.234 −0.449 2.249 −0.302 −0.359 −0.581

−0.105 −0.206 −0.240 −0.133 −0.089 −0.092 −1.174 2.714 −0.130 −0.119

−0.446 −0.276 −1.517 3.190 −0.234 −0.449 2.249 −0.302 −0.359 −0.581

−0.437 −0.288 1.923 −0.276 −0.121 3.463 −2.236 −0.309 −0.313 −0.202

−0.318 −0.243 −0.285 −0.273 −0.324 −0.296 −0.593 3.249 −0.311 −0.318

−0.064 −0.062 −0.483 −0.116 −0.084 −0.065 −0.794 1.820 −0.050 −0.043

Var. Yk

S4 = 23

S4 = 42

S4 = 34

S4 = 43

S4 = 17

S4 = 46

S4 = 47

S4 = 37

S4 = 44

S4 = 39

S4 = 45

Cardio. Dermato. Gene.

−0.131 −0.075 −0.920

2.617 −0.276 −1.040

−0.108 −0.135 2.463

−0.061 0.002 0.645

−0.111 2.839 −0.327

−0.143 −0.945 −1.223

−0.026 −0.042 −0.418

−0.105 −0.206 −0.240

−0.608 −0.134 2.171

−0.048 −0.059 −0.642

−0.687 −0.290 2.103

(continued on next page) 124

Informatics in Medicine Unlocked 12 (2018) 120–127

N.M. Rémy et al.

Table 2 (continued) Var. Yk

S4 = 23

S4 = 42

S4 = 34

S4 = 43

S4 = 17

S4 = 46

S4 = 47

S4 = 37

S4 = 44

S4 = 39

S4 = 45

Gastro. Neuro. Oto. Paedia. Pulmo. Rheumato. Uro.

−0.382 −0.083 −0.145 0.542 −1.216 −0.118 2.704

2.596 −0.281 −0.433 −0.997 −0.398 −0.344 −0.448

−0.108 −0.108 −0.107 −1.285 −0.242 −0.108 −0.107

0.082 −0.001 −0.040 −0.007 −0.399 −0.055 −0.111

−0.937 0.006 −0.035 −0.794 −0.327 −0.063 −0.092

−0.664 −0.673 −0.514 −2.804 −1.047 −0.773 −0.678

−0.058 −0.017 −0.066 −0.001 −0.368 1.157 −0.080

−0.133 −0.089 −0.092 −1.174 2.714 −0.130 −0.119

−1.191 1.680 −0.431 1.604 −1.316 −0.489 −0.451

−0.071 −0.063 −0.120 0.012 −1.016 2.312 −0.103

−0.942 −0.137 −0.260 1.698 −0.736 −0.289 −0.116

Table 3 The List of Symptoms and Codes Associated with the Diseases Used in this Work. Symptoms

No.

Symptoms

No.

Symptoms

No.

Symptoms

No.

Abscess Runny nose Fever Cold-like symptoms Coughing Pain Abdominal pain Shortness of breath Heart attack Decreased urine Skin problems One-sided headache Loss of appetite Mild fever

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Digestive problems Big toe pain Skin rashes Sore throat Chills Painful cough Wheezing Swelling Nausea Dyspnoea Angina Breathlessness Hives Abdominal swelling

15 16 17 18 19 20 21 22 23 24 25 26 27 28

Frontal headache Amnesia Headache Diarrhoea Big toe swelling Poor skin healing Cough Dry cough Throat pain Chest tightness Warmth Constipation Fatigue Panic attack

29 30 31 32 33 34 35 36 37 38 39 40 41 42

Skin boils Vomiting Weakness Sneezing Swollen ball of foot Athlete's foot Aching muscles Chest pain Redness Esophagitis Skin ulcers Eczema Itching Pain in ball of foot

43 44 45 46 47 48 49 50 51 52 53 54 55 56

Table 4 The Characteristics of Some Patients Suffering from Diseases that Keep them in a Stable Condition. Patient No.

Age

Sex

Symptom1

Symptom2

Symptom3

Symptom4

1 2 3 4 5 6 7 8 9 10 11 12 13

[0; 17] [50; 69] [30; 49] [50; 69] [0; 17] [18,29] [30; 49] [50; 69] [30; 49] [30; 49] [0; 17] [70; Older] [50; 69]

M M F F F M F F M F M F M

Digestive Problem Heart Attack Skin Problem One Side Headache Abscess Abdominal Pain Mild Fever Coughing Fever Runny Nose Abdominal Pain Pain Decreased Urine

Diarrhoea Angina Hives Frontal Headache Skin rashes Nausea Headache Wheezing Chills Sore Throat Nausea Swelling Breathlessness

Itching Esophagitis Eczema Nausea Athletes Foot Vomiting Sore Throat Shortness of Breath Chest Pain Aching Muscles Vomiting Redness Cough

Sneezing Panic Attack Skin Rashes Weakness Poor Skin Healing Constipation Vomiting Chest Tighness Dry Cough Cough Constipation Warmth Nausea

The likelihood ratio test found that the final model had a significant deviance (Deviance = 21445.18 and p-value <0.005) from the null model. The model showed good discrimination results (Table 7), with an overall accuracy of 80% (95%CI : 73.82 %− 85.45%). The multinomial model predicted 80% of the physicians correctly. To evaluate and

compare the behaviour of the model globally, we evaluated the area below the ROC curves (AUC ROC = 0.88). The proposed model had excellent agreement (Kappa = 0.77) and high correlation between the predicted and observed class of physicians (MCC = 0.78).

Table 5 The Estimated Probabilities of each Specialist for the Perspective of a Good Diagnosis Using the Proposed Model. Patient No.

True Spec.

PAller.

PCardio.

PDermato.

PGene.

PGastro.

PNeuro.

POto.

PPaedia.

PPulmo.

PRheumato.

PUro.

Prediction

1 2 3 4 5 6 7 8 9 10 11 12 13

Aler. Cardio. Dermato. Gene. Gene. Gastro. Neuro. Pulmo. Pulmo. Oto. Paedia. Rheumato. Uro.

0.711 0.5 0.5 0.5 0.5 0.5 0.5 0.5. 0.5 0.5 0.5 0.5 0.5

0.5 0.621 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

0.502 0.5 0.73 0.5 0.5 0.55 0.5 0.5 0.5 0.5 0.5 0.5 0.5

0.502 0.5 0.501 0.731 0.731 0.55 0.501 0.624 0.5 0.588 0.5 0.5 0.5

0.501 0.623 0.5 0.5 0.5 0.73 0.5 0.5 0.5 0.5 0.5 0.5 0.5

0.501 0.5 0.5 0.5 0.5 0.5 0.729 0.5 0.5 0.5 0.5 0.5 0.5

0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.655 0.5 0.5 0.5

0.507 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.73 0.5 0.5

0.503 0.5 0.5 0.5 0.5 0.5 0.5 0.619 0.731 0.5 0.5 0.5 0.5

0.502 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.73 0.5

0.503 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.503 0.731

Aller. Gastro. Dermato. Gene. Gene. Gastro. Neuro. Gene. Pulmo. Oto. Paedia. Rheumato. Uro.

125

Informatics in Medicine Unlocked 12 (2018) 120–127

N.M. Rémy et al.

Table 6 The Model Performance Measures Used in our Study. TPk TPk + FNk TPk + TNk TPk + TNk + FPk + FNk

TPk TPk + FPk Se + Spk AUCk = k 2

Sek = Rl = Acck =

Pr k =

Spk =

TNk TNk + FPk

MCCk =

TPk × TNk − FPk × FNk (TPk + FPk ) × (TPk + FNk ) × (TNk + FPk ) × (TNk + FNk )

TP: true positive; FP: false positive; TN: true negative; FN: false negative; Se: sensitivity; Sp: specificity; Rl: recall; Acc: accuracy; MCC: Matthews correlation coefficient; Pr: precision; AUC: area under the receiver operating characteristic (ROC) curve. Table 7 The results of the proposed prediction model. Perf. measures

Aller.

Cardio.

Dermato.

Gene.

Gastro.

Neuro.

Oto.

Paedia.

Pulmo.

Rheumato.

Uro.

Accuracy Sensitivity Specificity MCC Pr AUC

0.86 0.75 0.98 0.75 0.81 0.85

0.77 0.54 0.99 0.65 0.85 0.917

0.93 0.86 0.99 0.88 0.92 0.904

0.82 0.74 0.91 0.63 0.67 0.883

0.92 0.90 0.93 0.77 0.73 0.893

0.94 0.88 0.99 0.89 0.94 0.901

0.89 0.78 0.99 0.83 0.93 0.906

0.90 0.82 0.98 0.80 0.82 0.805

0.87 0.77 0.98 0.78 0.85 0.827

0.89 0.80 0.99 0.79 0.80 0.906

0.92 0.85 0.98 0.78 0.75 0.913

5. Discussion

our study has a relatively small sample of patients whose demography is mostly American. This sample is not sufficiently diverse to represent patients presenting to other hospitals.

For some diseases, it is sometimes difficult for one physician to provide an accurate diagnosis. In such situations, it is desirable that the diagnosis be made by a team of heterogeneous physicians or a team of multidisciplinary physicians [22], [28]. Rather than developing a single inaccurate diagnosis of a patient, physicians should work together to improve diagnostic accuracy. In most of the literature addressing multidisciplinary diagnosis for a given disease, a team of specialists is put in place and does not change. This is the case in Ref. [28], which was focused on idiopathic pulmonary fibrosis. In that work, the team of specialists was composed of pneumologists, radiologists and pathologists. However, a specialist or group of specialists may be unavailable [10], [17]. In this case, it is necessary to make the team of specialists more flexible. The group of specialists must be flexible and replaceable by the best possible option depending on the availability of specialists. In this work, we examine the possibility of forming a team of physicians using a probabilistic model. As this paper shows based on data mining techniques, models can be developed to predict which physicians can provide a good diagnosis. An interesting observation from the previous results is that the values of some independent variables influence the selection of a physician. This selection is not unique, so several physicians of different specialities can be chosen based on the probabilities estimated by the regression model. We have ∀ k , P (Yk ) ≥ 0.5. This suggests that all specialists are able to diagnose the patient. However, P (Y1) > P (Y2) and P (Y1) > P (Y3) mean that specialist Y1 is more likely to diagnose the disease than are Y2 and Y3 . Moreover, in the case of the unavailability of a specialist or group of specialists, they can be replaced by another with a similar probability. The main limitation of this study is that it does not take into account the complete patient history when developing patient profiles. To solve patient problems, physicians sometimes trace the medical history and history of the complaint, the patient's current pain, and the results of the various explorations already made and even the treatments undertaken by other physicians. The consideration of medical history and the history of the complaint can be crucial to the selection of the appropriate medical specialist for a successful diagnosis. We must also recognize that symptoms describing the manifestation of a disease retained in this work constitute a limitation. This limit is at least 02 levels. The first is the size: the number of clinical signs retained for a disease is 04 which is relatively small. The second is the order in which symptoms appear. The proposed model is influenced by the order of symptoms. However, in routine clinical practice, the order of symptoms is not necessarily taken into account by physicians when making a diagnosis. Finally, given the number of diseases (17) and doctors (11),

6. Conclusion Probability models are powerful tools with which to predict probabilities in the medical field. They have not been widely used for making predictions on individual physicians but, they can be used in this way. The data mining technology adopted in this study is multinomial logistic regression. Based on our experimental results, we believe that choosing an appropriate physician will be possible with an appropriate predictive model. In this paper, we estimated the model's accuracy, tested its quality of fit and estimated its precision. In addition, we performed cross-validation of the model, which provided an efficient estimate of model accuracy compared to that of the commonly used split-sample validation. The multinomial logistic regression model accuracy was 80% . The model proved to be appropriate for the datasets. The result shows that all medical specialists are able to diagnose the selected diseases. However, some specialists are probably better suited than others. Through the estimated probabilities, the model offers a ranking of medical specialists who can offer good diagnosis. This ranking is a good basis for forming a flexible group of specialists who can participate in a multidisciplinary diagnosis. Acknowledgments This work is dedicated to the memory of our mentor, Paul Mougnutou, who died, hastily, in 2011. The authors wish to thank the other members of FASEG and SJD at Ngaoundere and Douala for their help throughout the course of this work; in particular J. G. Ndjock Ndangwa and N. Gorba Gnowa provided many useful discussions. We thank Lilian Mupan and Grace Tabah for English grammatical structure and phraseology. References [1] Rice John A. Mathematical statistics and data analysis. China machine press Beijing; 2003. [2] Aminot I, Damon MN. Régression logistique: intérêt dans l’analyse de données relatives aux pratiques médicales. Rev Med Assur Mal 2002;33:134–43. [3] Molloy Jennifer C. The open knowledge foundation: open data means better science. PLoS Biol 2011;9. e1001195. [4] Carlos Ordonez. Comparing association rules and decision trees for disease prediction. Proceedings of the international workshop on Healthcare information and knowledge management. ACM; 2006. p. 17–24. [5] Cut Fiarni. Design of personalized asthma management system with data mining

126

Informatics in Medicine Unlocked 12 (2018) 120–127

N.M. Rémy et al.

[6] [7]

[8] [9] [10] [11] [12]

[13] [14] [15] [16]

[17]

[18]

[19] Wang Lipo. Support vector machines: theory and applications. Springer Science and Business Media; 2005. [20] Lu J, Q Z. The elements of statistical learning: data mining, inference, and prediction. J Roy Stat Soc 2010;173:693–4. [21] Bossuyt Patrick M, Reitsma Johannes B, Bruns David E, Stard Al. An updated list of essential items for reporting diagnostic accuracy studies. BMJ 2015;351:h5527. 2015. [22] Scarpa M, Almássy Z, Al. Mucopolysaccharidosis type ii: European recommendations for the diagnosis and multidisciplinary management of a rare disease. Orphanet J Rare Dis 2011;6:72. [23] Marina Sokolova, Guy Lapalme. A systematic analysis of performance measures for classification tasks. Inf Process Manag 2009;45:427–37. [24] Mehmed Kantardzic. Data mining: concepts, models, methods, and algorithms. John Wiley and Sons; 2011. [25] Nir Y, Jens G, Qian-fei W. Prediction of phenotype information from genotype data. Commun Inf Syst 2010;2:99–114. [26] Kostkova P, Brewer H, Al. Who owns the data? open data for healthcare. Frontiers in public health 2016;4(7). [27] Éric Galam. L’erreur médicale. Rev Du Prat - Med Gen 2003;626:1231–4. [28] Tomassetti S, Wells Athol U, Al. Bronchoscopic lung cryobiopsy increases diagnostic confidence in the multidisciplinary diagnosis of idiopathic pulmonary fibrosis. Am J Respir Crit Care Med 2016;193:745–52. [29] Shelledy, Peters Jay I. Respiratory care: patient assessment and care plan development. Jones and Bartlett Publishers; 2014. [30] Tufféry Stéphane. Data mining and statistics for decision making. Wiley Chichester; 2011. [31] Usama F, Piatetsky-Shapiro G. From data mining to knowledge discovery in databases. AI Mag 1996;17:37. [32] Fourestié V, Roussignol D. Clinical classification of emergency patients: definition and reproducibility. Classification clinique des malades des urgences. Définition et reproductibilité. Réanimation Urgences 1994;3. 573–8. [33] Vikas Chaurasia. Early prediction of heart diseases using data mining techniques. Caribb J Sci Tech 2013;1:208–17.

methods. Proceeding of the Electrical Engineering Computer Science and Informatics, VOL. 1. 2014. p. 120–3. Thorpe Kevin E. How to construct regression models for observational studies (and how not to do it!). Can. J. Anesth. 2017;64:461–70. Flaherty Kevin R, King Jr.. Idiopathic interstitial pneumonia: what is the effect of a multidisciplinary approach to diagnosis? Am J Respir Crit Care Med 2004;170:904–10. Baxt William G. Application of artificial neural networks to clinical medicine. Lancet 1995;346:1135–8. Gillaizeau F, Grabar S. Modèles de régression multiple. Sang Thromb Vaiss 2011;23:360–70. Göthlin Jan H, Geitung Jonn T. Waiting for the doctor: the economic impact of the unavailability of radiologists. Acad Radiol 1996;3:S51–2. Katz Mitchell H. Multivariable analysis: a practical guide for clinicians and public health researchers. Cambridge university press; 2011. Organization World Health. ICD-9-CM: international classification of diseases, 9th revision: clinical modification. PMIC (Practice Management Information Corporation); 1998. Hirotogu Akaike. Information theory and an extension of the maximum likelihood principle. Springer; 1998. Ioannis Kavakiotis, Olga Tsave, Al. Machine learning and data mining methods in diabetes research. Comput Struct Biotechnol J 2017;15:104–16. Hox Joop J, Mirjam Moerbeek, Al. Multilevel analysis: techniques and applications. Routledge; 2017. Javed K, Wei Jun S, Ringner M. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 2001;7:673–9. Jostein L, Jan U, Yngve R, Hjortdahl Per. Patient allocations in general practice in case of patients' preferences for gender of doctor and their unavailability. BMC Res Notes 2011;4:112. Kurt Kroenke, Spitzer Robert L, BW WJ. Physical symptoms in primary care: predictors of psychiatric disorders and functional impairment. Arch Fam Med 1994;3:774.

127