CHINESE ASTRONOMY AND ASTROPHYSICS Chinese Astronomy and Astrophysics 43 (2019) 539–548
Study of Star/Galaxy Classification Based on the XGBoost Algorithm† LI Chao1,2 1
ZHANG Wen-hui3,4
LIN Ji-ming1
College of Information and Communication Engineering, Guilin University of Electronic Technology, Guilin 541004
2
Key Laboratory of Cognitive Radio and Information Processing, the Ministry of Education, Guilin University of Electronic Technology, Guilin 541004
3
Guangxi Cooperative Innovation Center of Cloud Computing and Big Data, Guilin University of Electronic Technology, Guilin 541004 4
Guangxi Colleges and Universities Key Laboratory of Cloud Computing and Complex Systems, Guilin University of Electronic Technology, Guilin 541004
Abstract Machine learning has achieved great success in many areas today. The lifting algorithm has a strong ability to adapt to various scenarios with a high accuracy, and has played a great role in many fields. But in astronomy, the application of lifting algorithms is still rare. In response to the low classification accuracy of the dark star/galaxy source set in the Sloan Digital Sky Survey (SDSS), a new research result of machine learning, eXtreme Gradient Boosting (XGBoost), has been introduced. The complete photometric data set is obtained from the SDSS-DR7, and divided into a bright source set and a dark source set according to the star magnitude. Firstly, the ten-fold cross-validation method is used for the bright source set and the dark source set respectively, and the XGBoost algorithm is used to establish the star/galaxy classification model. Then, the grid search and other methods are used to adjust the XGBoost parameters. Finally, based on the galaxy classification accuracy and other indicators, the classification results are analyzed, by comparing with the models of function tree (FT), Adaptive boosting (Adaboost), Random Forest (RF), Gradient Boosting Decision Tree (GBDT), Stacked Denoising AutoEncoders (SDAE), and Deep Belief Nets (DBN). The experimental results show that, the XGBoost improves †
Supported by the project of Guangxi cloud computation and big data collaborative innovation center,
the key lab of cloud computation and complex system of Guangxi colleges (No.1716) Received 2019–01–07; revised version 2019–01–11
A translation of Acta Astronomica Sinica Vol. 60, No. 2, pp. 16.1–16.10, 2019
[email protected]
[email protected] 0275-1062/19/$-see front matter © 2019 B. V. AllScience rights reserved. c Elsevier 0275-1062/01/$-see front matter 2019 Elsevier B. V. All rights reserved. doi: 10.1016/j.chinastron.2019.11.005 PII:
540
LI Chao et al. / Chinese Astronomy and Astrophysics 43 (2019) 539–548
the classification accuracy of galaxies in the dark source classification by nearly 10% as compared to the function tree algorithm, and improves the classification accuracy of sources with the darkest magnitudes in the dark source set by nearly 5% as compared to the function tree algorithm. Compared with other traditional machine learning algorithms and deep neural networks, the XGBoost also has different degrees of improvement. Key words stars: fundamental parameters—galaxies: fundamental parameters— techniques: photometric—methods: data analysis 1.
INTRODUCTION
In recent years, with the development of space science and technology and the successive implementation of large-scale survey projects, the amount of astronomical data has been increased exponentially, the data quantity is calculated by the order of magnitude of TB even PB, astronomy has been evidently developed to an unprecedented stage, i.e., the era of big data, huge amount of information, and whole waveband[1] . Faced to such huge and complicated astronomical data, how to make a highly effective and accurate data analysis becomes extremely important. The star/galaxy classification is always a fundamental content of astronomical data analysis, and the earlier study on this subject can be traced to the 18th century[2] . The original star/galaxy classification methods based on the morphology and heuristic division etc. were widely used in the previous time. With the continuous development of machine learning, more and more studies on the algorithms of star/galaxy classification are carried out hereby. For example, by removing the outlier data, and adopting the automatic clustering method, Yan et al.[3] made the star/galaxy classification for the photometric data of SDSSDR6 (Sloan Digital Sky Survey Data Release 6), the result showed that the automatic clustering method has a rather high efficiency. Vasconcellos et al.[4] used about 13 different decision tree algorithms to perform the star/galaxy classification for the photometric data of SDSS-DR7, the result showed that the function tree/decision tree algorithm is superior to other decision tree algorithms in the star/galaxy classification. Sevilla-Noarbe et al.[5] made a study on the application of boosted decision tree for the star/galaxy classification based on the characteristic data set given by the photometric image catalogue of SDSS-DR9, the experimental result showed that the classification performance of the boosted decision tree is superior to the photometric type classifier given by the SDSS data set. Kim et al.[6] proposed a deep convolution network frame, then applied it to the astronomical image data for the star/galaxy classification, and obtained a very good result. By comparing the classification performances of the Deep Belief Network (DBN), Neural Network (NN), and Support Vector Machine (SVM) algorithms on the SDSS data, Li et al.[7] studied and analyzed whether the three automatic classification methods are usable. Liu et al.[8] proposed a kind of nonparametric regression and combined it with the Adaboost (Adaptive boosting) for the
LI Chao et al. / Chinese Astronomy and Astrophysics 43 (2019) 539–548
541
MK classification of stellar spectra, to classify stars according to the spectral type and luminosity type, and to recognize the subtypes of each spectral type at the same time. Under the background of integrated learning, Xan et al.[9] explored the star/galaxy classification in astronomy, and gave a reasonable explanation. Though many excellent algorithms have been studied and used, all these algorithms have some shortages, such as the weak generalization ability, i.e., there is a very high classification accuracy in the bright source set, but the low classification accuracy in the dark source set, this problem can never been solved effectively. Up to now, it is rarely seen that the XGBoost (eXtreme Gradient Boosting) algorithm is applied to the astronomical data mining field, especially, used for for studying the star/galxy classification. Based on this fact, this paper has studied the star/galaxy classification algorithm based on XGBoost, and applied the XGBoost method to the photometric data of SDSS-DR7 for the first time, and by comparing the classification result of XGBoost with those of the models of Function Tree (FT), Adaboost, Random Forest (RF), Gradient Boosting Decision Tree (GBDT), Stacked Denoising AutoEncoders (SDAE), DBN etc., to verify the application value of the XGBoost method in the astronomical research.
2.
SLOAN DIGITAL SKY SURVEY
Up to now there have been many sky survey projects come into use in the world, but the SDSS is considered to be the most successful and influential one among the numerous survey projects. The photometric system of the SDSS measures the celestial bodies respectively in the u, g, r, i and z 5 bands. The photometric data that used in this paper only refer to the r band. In the photometric data, the data set simultaneously with the identified spectral parameters and photometric parameters only occupies a very small part of total photometric data, the most part that remained only possesses the photometric parameters. This means that the XGBoost star/galaxy classification model that proposed in this paper is possibly an effective method to make the accurate classification for those celestial bodies without the identified spectral parameters.
3.
LIFTING ALGORITHM
The lifting algorithm is based on such an idea, i.e., for an arbitrary complicated task, the final judgement obtained by a proper synthesis on the judgements of many experts is better than the individual judgement of an arbitrary expert among them. In practice, this is similar to the idea that “two heads are better than one”. The lifting algorithm is a statistical learning method used very often, it has a very wide application, and a very good effect as well. In the classification subject, firstly, by updating the weight of the training sample, it can learn from multiple classifiers, then, it makes a linear combination of these classifiers, to enhance the classification ability of the classifier.
542
LI Chao et al. / Chinese Astronomy and Astrophysics 43 (2019) 539–548
3.1 GBDT Principle The GBDT algorithm[10] is essentially a lifting algorithm with the decision tree as the basis function. The GBDT model can be expressed as an additive model of decision trees: fM (x) =
M
T (x; θm ) ,
(1)
m=1
here, x expresses the sample data set, T (x; θm ) means the decision tree, θm refers to the parameter of decision tree, M indicates the number of decision trees. The forward distribution algorithm is used for the GBDT. At first, it is necessary to determine the initial lifting tree f0 (x) = 0. Afterwards, according to the forward distribution algorithm, the model of the m-th step is obtained: fm (x) = fm−1 (x) + T (x; θm ) ,
(2)
here, fm−1 (x) is the present model. Finally, according to the empirical risk minimization, the parameter of the next decision tree θm is determined: θm = arg min θm
N
L(yi , f(m−1) (x) + T (x; θm )) ,
(3)
i=1
in Eq.(3), yi expresses the real label of the i-th sample data, N means the number of sample data. When L adopts the loss function with the form of squared error: L(y, f (x)) = (y − f (x))2 ,
(4)
here y expresses the real label of all the sample data, at this moment, the loss function becomes: 2 L y, f(m−1) (x) + T (x; θm ) = y − f(m−1) (x) − T (x; θm ) .
(5)
If it refers to the subject of classification, the GBDT algorithm needs to restrict the basis classifier to be the classification tree. Though there is a more complicated relationship between the input and output in the training data, the intrinsic character of the decision tree model itself determines that the linear combination of decisions trees can fit the training data very well, and obtain the model parameters. 3.2 XGBoost Principle The XGBoost[11] is also a kind of lifting algorithm. Different from the traditional GBDT which uses the first-order derivative information in the optimization, the XGBoost has made a very good improvement in the optimization. It performs the second-order Taylor expansion for the loss function, while the information of the first-order derivative is reserved, the information of the second-order derivative is also added in, which can make the model to be convergent more quickly. Not only that, in order to control the complexity of the model, a
LI Chao et al. / Chinese Astronomy and Astrophysics 43 (2019) 539–548
543
regular term is added in the loss function of XGBoost, to prevent the model from overfitting. The practical derivation process of the XGBoost algorithm is given as follows. It is assumed that D = {(xi , yi )}(|D| = n, xi ∈ Rd , yi ∈ R) is a data set that possesses n sample data, and each sample data has d features; xi refers to the i-th sample data. The integrated model of trees predicts the final result by using K (the number of trees) additive functions: yi = φ(xi ) =
K
fk (xi ), fk ∈ F ,
(6)
k=1
in which, F = {f (x) = wq(x) }(q : Rd → T, w ∈ RT ) (q means mapping the sample example Rd to the corresponding structure of leaf index, T expresses the number of leaf nodes, and RT is the space of the leaf node weight w) represents the function space of one decision tree, the functional relation of the sample data xi and predicted value yi is denoted by φ; wq(x) maps each node to a value, i.e., the value of f (x); fk means the model of the k-th tree. Each fk corresponds to an independent tree structure q and weight value w of the leaf node. In order to learn the function set that used in the model, the regularized objective function is defined as follows: ⎧ ⎨L(φ) = l( yi , yi ) + Ω(fk ) ⎩
i
Ω(f ) = γT +
1 2λ
k 2
,
(7)
w
here, l is the differentiable convex loss function used for evaluating the difference between the predicted value yi and the real value yi , Ω means the punishing term for the model complexity, γ indicates the regularized parameter of the leaf number, which is used to restrain the nodes to split downward continuously, λ is the regularized parameter of the leaf weight. The object is the minimized loss function L(t) =
n
(t−1) l yi , yi + ft (xi ) + Ω(ft ) ,
(8)
i=1 (t−1)
means the sum of in which, L(t) expresses the objective function of the t-th tree; yi output values of the previous (t − 1) trees, which forms the predicted value of the previous t − 1 trees; ft indicates the model of the t-th tree, ft (xi ) is the output result of the t-th tree, (t−1) the addition yi + ft (xi ) forms the most newly predicted value. To define gi and hi as:
gi =
hi =
(t−1) ∂l yi , yi (t−1)
∂ yi
(t−1) ∂ 2 l yi , yi (t−1)
∂ 2 yi
,
(9)
,
(10)
544
LI Chao et al. / Chinese Astronomy and Astrophysics 43 (2019) 539–548
(t−1)
and have the loss function expanded at yi L
(t)
≈
n
(t−1) l(yi , yi )
i=1
by using the Taylor formula:
1 2 + gi ft (xi ) + hi ft (xi ) + Ω(ft ) . 2
(11)
to remove the constant term, after t times of iteration the loss function becomes: L
(t)
=
n i=1
1 2 gi ft (xi ) + hi ft (xi ) + Ω(ft ) , 2
(12)
Ij = {i|q(xi ) = j} is defined as the example set of the leaf node j, according to Eq.(12) we can obtain:
L
(t)
=
n i=1
T 1 1 2 2 gi ft (xi ) + hi ft (xi ) + γT + λ w , 2 2 j=1 j
T ( = j=1
1 2 gi )wj + ( hi + λ)wj + γT , i∈Ij i∈Ij 2
(13)
here, wj expresses the weight of the leaf node j. For a fixed structure q(x) of the decision tree, the optimized weight wj∗ of the leaf node j can be calculated as: wj∗
= −
i∈Ij i∈Ij
gi
hi + λ
,
(14)
substituting wj∗ into the objective function, we have: T 2 1 ( i∈Ij gi ) + γT . L (q) = − 2 j=1 i∈Ij hi + λ (t)
(15)
Eq.(15) can be taken as the indicator to evaluate the tree structure quality, which can be used to calculate the score of the tree structure q. Even so, it is almost impossible to list all possible tree structure q. Hence, it is necessary to use a greedy algorithm to add branches iteratively at each known leaf node. It is assumed that IL and IR are respectively the left and right leaf node sets after the partition, i.e., I = IL ∪ IR , thus the loss function after the partition is given as follows: Lsplit
2 2 2 ( i∈IR gi ) ( i∈I gi ) 1 ( i∈IL gi ) −γ . = + − 2 i∈IL hi + λ i∈IR hi + λ i∈I hi + λ 4.
(16)
EXPERIMENTAL TEST
4.1 Introduction of Data Set In order to compare better with the existing algorithms, the adopted star/galaxy data set in this study is extracted from the SDSS database by using the SQL (Structured Query
LI Chao et al. / Chinese Astronomy and Astrophysics 43 (2019) 539–548
545
Language) instruction, which keeps consistent with Reference [4]. The data features are shown in Table 1.
Table 1 The features for the SDSS-DR7 star/galaxy classification Variable
Attribute
psfMag fiberMag petroMag modelMag petroRad petroR50 petroR90 lnLStar lnLExp lnLDeV mRrCc, mE1, mE2 specClass
PSF (point-spread function) magnitude Fiber magnitude Petrosian magnitud Model magnitude Petrosian radius Radius carrying 50% of Petrosian flux Radius carrying 90% of Petrosian flux Likelihood PSF Likelihood exponential Likelihood deVaucouleurs Adaptive moments Spectroscopic classification
4.2
Experimental Analysis
4.2.1
Feature importance test
By simulating the data features, the importance of data features is shown in Fig.1, in which F score is a parameter indicating the feature importance.
Fig. 1 Feature importance
546
LI Chao et al. / Chinese Astronomy and Astrophysics 43 (2019) 539–548
4.2.2 Optimization of the XGBoost model The greedy algorithm is adopted by the XGBoost, the practical procedures of the algorithm are as follows. The gridded search is used to make the parameter optimization for the XGBoost algorithm, the depth of the tree equals 6, the learning rate is 0.01, the model converges and attains the optimal value under 710 iterations, the trained model is used to make the experiment. Algorithm 1: Exact Greedy Algorithm for Split Finding Input: I, instance set of current node Input: s, feature dimension gain ← 0 G ← i∈I gi , H ← i∈I hi for k = 1 to d do GL ← 0, HL ← 0 for j in sorted (I, by xjk ) do GL ← GL + gj , HL ← HL + hj GR ← G − GL , HR ← H − HL 2
score ← max (score, HGLL+λ +
GR 2 HR +λ
−
G2 H+λ )
end end Output: Split with max score
4.2.3 Experimental method and model comparison In order to evaluate the performance of the XGBoost model in the star/galaxy classification, the ten-fold cross-validation method is used (to divide a complete data set into 10 equal parts, in which one part is taken as the test set, the rest 9 parts are taken as training sets), and it is compared with the FT, RF, GBDT, Adaboost in Reference [4], as well as with the presently used new algorithms DBN, SDAE etc., the detailed comparison results are shown in Table 2. Similarly, in order to ensure the effectiveness of the comparison of classification results, a consistent indicator (CP) for evaluating the classification performance as same as Reference [4], i.e., the correctness rate for the galaxy classification is adopted, which is defined as: CP(v) = 100 ×
Ngal−gal (v)δv , tot Ngalaxy (v)δv
(17)
in which, Ngal−gal (v)δv represents the number of the sample data with the stellar magnitudes δv tot within the interval of (v − δv 2 , v + 2 ) which are correctly classified as galaxies, Ngalaxy (v)δv means the total number of the sample data with the stellar magnitudes within the interval δv of (v − δv 2 , v + 2 ). The data set is divided into the intervals according to the value of modelMag (model magnitude). Here, the bright stellar magnitude interval (14–19), dark
LI Chao et al. / Chinese Astronomy and Astrophysics 43 (2019) 539–548
547
stellar magnitude interval (19–21), and the darkest stellar magnitude interval (20.5–21) represent respectively the various data sets with the stellar magnitudes corresponding to the different modelMag values. Table 2 The accuracy of SDSS-DR7 galaxy classification Set
CP(14–19)/%
CP(19–21)/%
CP(20.5–21)/%
XGBoost
99.87
95.72
79.48
GBDT
99.84
95.74
77.64
Adaboost
99.89
95.80
77.56
RF
99.88
95.44
75.71
SDAE
99.87
95.70
73.08
DBN
99.60
96.01
74.45
FT
99.64
84.98
74.04
Method
From Table 2 that obtained from the simulation experiment, we can find that the galaxy classification accuracy of XGBoost is superior to that of the FT. Especially in the dark stellar magnitude interval, the accuracy of XGBoost is enhanced about 10% in comparison with that of the FT. And in comparison with the other more advanced DBN[12] , SDAE, RF, Adaboost, GBDT etc., the galaxy classification accuracy is also enhanced by 2%–5% in the darkest stellar magnitude interval with the modelMag values of 20.5–21. It can be seen that the XGBoost algorithm model possesses a stronger generalization ability, the performance in the star/galaxy classification is better than the other algorithms. In addition, this paper has used about 0.88 million data with the modelMag attribute values of 14–19 to test the efficiencies of the XGBoost, GBDT, and Adaboost for the model training based on the data set of bright stellar magnitude. The bright sources are used because that in the data set of dark stellar magnitude or of the darkest stellar magnitude the data quantity is small, the difference of compared results is not evident. The result of the efficiency test is shown in Table 3. Table 3 The time of the model training SetTime
CP(14–19) /h
Method XGBoost
1.44
GBDT
15.58
Adaboost
28.05
548
LI Chao et al. / Chinese Astronomy and Astrophysics 43 (2019) 539–548
The training times of the other models are not tested, because the galaxy classification accuracy of the other models is much lower than that of the above 3 models. It is shown by the experimental result that in the case that the data set is not changed, the time consumption of the XGBoost in the model training is much less than those of the GBDT and Adaboost. Compared with the GBDT, the XGBoost has used the second-order information, it can converge more quickly in the training set. Hence, the XGBoost is not only superior to the other models in the accuracy, but also has a much higher efficiency than the GBDT and Adaboost. 5.
SUMMARY AND PROSPECT
By using the photometric data set of SDSS-DR7 and adopting the ten-fold cross-validation method, this paper has studied the subject of star/galaxy classification based on the XHBoost algorithm. Finally, by using the common-used methods of empirical parameter regulation, gridded search etc., the model is continuously optimized, and based on the indicator for evaluating the galaxy classification accuracy, it is compared with the FT, Adaboost, RF, GBDT, DBN and so on. As shown by the experimental result, the classification result of the optimized XGBoost algorithm model in the star/galaxy data set is much better than those of the other models. Meanwhile, during the model training, the XGBoost has a much higher efficiency than the GBDT and Adaboost. Therefore, for both the accuracy and efficiency, the XGBoost model is undoubted to possess an evident advantage. Although, the accuracy for the dark star/galaxy sources needs to be further improved, it is believable that with the gradually deepened study on the XGBoost algorithm in the aspect of astronomical data mining, the relevant astronomical fields will develop very soon. References 1
Zhang Y. H., Zhao Y. H., e-Science Teqhnology and Application, 2011, 2, 13
2
Messier C., Connoissance des Temps for 1784, 1781, 227
3
Yan T. S., Zhang Y. H., Zhao Y. H., et al., Science China G, 2009, 39, 1794
4
Vasconcellos E. C., De Carvalho R. R., Gal R. R., et al., AJ, 2010, 141, 189
5
Sevilla-Noarbe I., Etayo-Sotos P., A&C, 2015, 11, 64
6
Kim E. J., Brunner R. J., MNRAS, 2017, 464, 4463
7
Li J. F., Wang Y. L., Hu S., et al., Spectrocopy and Spectral Analysis, 2016, 36, 3261
8
Liu R., Qiao X. J., Zhang J. N., et al., Spectrocopy and Spectral Analysis, 2017, 37, 1553
9
Xan M. A., Ben H., David B., MNRAS, 2018, 481, 4194
10 11
Jerome H., Friedman., The Annals of Statistics, 2001, 29, 1189 Chen T., Guestrin C., ACM SIGKDD and International Conference on Knowledge Discovery and Data Mining, 2016, 785
12
Hinton G. E., Osindero S., Yee-Whye T., Neural Computation, 2006, 18, 1527, 16-9