Galaxy Classification Based on the XGBoost Algorithm

CHINESE ASTRONOMY AND ASTROPHYSICS Chinese Astronomy and Astrophysics 43 (2019) 539–548 Study of Star/Galaxy Classiﬁcation Based on the XGBoost Algor...

Download PDF

263KB Sizes 0 Downloads 44 Views

Report

PDF Reader
Full Text

CHINESE ASTRONOMY AND ASTROPHYSICS Chinese Astronomy and Astrophysics 43 (2019) 539–548

Study of Star/Galaxy Classiﬁcation Based on the XGBoost Algorithm† LI Chao1,2 1

ZHANG Wen-hui3,4

LIN Ji-ming1

College of Information and Communication Engineering, Guilin University of Electronic Technology, Guilin 541004

2

Key Laboratory of Cognitive Radio and Information Processing, the Ministry of Education, Guilin University of Electronic Technology, Guilin 541004

3

Guangxi Cooperative Innovation Center of Cloud Computing and Big Data, Guilin University of Electronic Technology, Guilin 541004 4

Guangxi Colleges and Universities Key Laboratory of Cloud Computing and Complex Systems, Guilin University of Electronic Technology, Guilin 541004

Abstract Machine learning has achieved great success in many areas today. The lifting algorithm has a strong ability to adapt to various scenarios with a high accuracy, and has played a great role in many ﬁelds. But in astronomy, the application of lifting algorithms is still rare. In response to the low classiﬁcation accuracy of the dark star/galaxy source set in the Sloan Digital Sky Survey (SDSS), a new research result of machine learning, eXtreme Gradient Boosting (XGBoost), has been introduced. The complete photometric data set is obtained from the SDSS-DR7, and divided into a bright source set and a dark source set according to the star magnitude. Firstly, the ten-fold cross-validation method is used for the bright source set and the dark source set respectively, and the XGBoost algorithm is used to establish the star/galaxy classiﬁcation model. Then, the grid search and other methods are used to adjust the XGBoost parameters. Finally, based on the galaxy classiﬁcation accuracy and other indicators, the classiﬁcation results are analyzed, by comparing with the models of function tree (FT), Adaptive boosting (Adaboost), Random Forest (RF), Gradient Boosting Decision Tree (GBDT), Stacked Denoising AutoEncoders (SDAE), and Deep Belief Nets (DBN). The experimental results show that, the XGBoost improves †

Supported by the project of Guangxi cloud computation and big data collaborative innovation center,

the key lab of cloud computation and complex system of Guangxi colleges (No.1716) Received 2019–01–07; revised version 2019–01–11

A translation of Acta Astronomica Sinica Vol. 60, No. 2, pp. 16.1–16.10, 2019 [email protected]

[email protected] 0275-1062/19/$-see front matter © 2019 B. V. AllScience rights reserved. c Elsevier 0275-1062/01/$-see front matter 2019 Elsevier B. V. All rights reserved. doi: 10.1016/j.chinastron.2019.11.005 PII:

540

LI Chao et al. / Chinese Astronomy and Astrophysics 43 (2019) 539–548

the classiﬁcation accuracy of galaxies in the dark source classiﬁcation by nearly 10% as compared to the function tree algorithm, and improves the classiﬁcation accuracy of sources with the darkest magnitudes in the dark source set by nearly 5% as compared to the function tree algorithm. Compared with other traditional machine learning algorithms and deep neural networks, the XGBoost also has diﬀerent degrees of improvement. Key words stars: fundamental parameters—galaxies: fundamental parameters— techniques: photometric—methods: data analysis 1.

INTRODUCTION

In recent years, with the development of space science and technology and the successive implementation of large-scale survey projects, the amount of astronomical data has been increased exponentially, the data quantity is calculated by the order of magnitude of TB even PB, astronomy has been evidently developed to an unprecedented stage, i.e., the era of big data, huge amount of information, and whole waveband[1] . Faced to such huge and complicated astronomical data, how to make a highly eﬀective and accurate data analysis becomes extremely important. The star/galaxy classiﬁcation is always a fundamental content of astronomical data analysis, and the earlier study on this subject can be traced to the 18th century[2] . The original star/galaxy classiﬁcation methods based on the morphology and heuristic division etc. were widely used in the previous time. With the continuous development of machine learning, more and more studies on the algorithms of star/galaxy classiﬁcation are carried out hereby. For example, by removing the outlier data, and adopting the automatic clustering method, Yan et al.[3] made the star/galaxy classiﬁcation for the photometric data of SDSSDR6 (Sloan Digital Sky Survey Data Release 6), the result showed that the automatic clustering method has a rather high eﬃciency. Vasconcellos et al.[4] used about 13 diﬀerent decision tree algorithms to perform the star/galaxy classiﬁcation for the photometric data of SDSS-DR7, the result showed that the function tree/decision tree algorithm is superior to other decision tree algorithms in the star/galaxy classiﬁcation. Sevilla-Noarbe et al.[5] made a study on the application of boosted decision tree for the star/galaxy classiﬁcation based on the characteristic data set given by the photometric image catalogue of SDSS-DR9, the experimental result showed that the classiﬁcation performance of the boosted decision tree is superior to the photometric type classiﬁer given by the SDSS data set. Kim et al.[6] proposed a deep convolution network frame, then applied it to the astronomical image data for the star/galaxy classiﬁcation, and obtained a very good result. By comparing the classiﬁcation performances of the Deep Belief Network (DBN), Neural Network (NN), and Support Vector Machine (SVM) algorithms on the SDSS data, Li et al.[7] studied and analyzed whether the three automatic classiﬁcation methods are usable. Liu et al.[8] proposed a kind of nonparametric regression and combined it with the Adaboost (Adaptive boosting) for the

LI Chao et al. / Chinese Astronomy and Astrophysics 43 (2019) 539–548

541

MK classiﬁcation of stellar spectra, to classify stars according to the spectral type and luminosity type, and to recognize the subtypes of each spectral type at the same time. Under the background of integrated learning, Xan et al.[9] explored the star/galaxy classiﬁcation in astronomy, and gave a reasonable explanation. Though many excellent algorithms have been studied and used, all these algorithms have some shortages, such as the weak generalization ability, i.e., there is a very high classiﬁcation accuracy in the bright source set, but the low classiﬁcation accuracy in the dark source set, this problem can never been solved eﬀectively. Up to now, it is rarely seen that the XGBoost (eXtreme Gradient Boosting) algorithm is applied to the astronomical data mining ﬁeld, especially, used for for studying the star/galxy classiﬁcation. Based on this fact, this paper has studied the star/galaxy classiﬁcation algorithm based on XGBoost, and applied the XGBoost method to the photometric data of SDSS-DR7 for the ﬁrst time, and by comparing the classiﬁcation result of XGBoost with those of the models of Function Tree (FT), Adaboost, Random Forest (RF), Gradient Boosting Decision Tree (GBDT), Stacked Denoising AutoEncoders (SDAE), DBN etc., to verify the application value of the XGBoost method in the astronomical research.

2.

SLOAN DIGITAL SKY SURVEY

Up to now there have been many sky survey projects come into use in the world, but the SDSS is considered to be the most successful and inﬂuential one among the numerous survey projects. The photometric system of the SDSS measures the celestial bodies respectively in the u, g, r, i and z 5 bands. The photometric data that used in this paper only refer to the r band. In the photometric data, the data set simultaneously with the identiﬁed spectral parameters and photometric parameters only occupies a very small part of total photometric data, the most part that remained only possesses the photometric parameters. This means that the XGBoost star/galaxy classiﬁcation model that proposed in this paper is possibly an eﬀective method to make the accurate classiﬁcation for those celestial bodies without the identiﬁed spectral parameters.

3.

LIFTING ALGORITHM

The lifting algorithm is based on such an idea, i.e., for an arbitrary complicated task, the ﬁnal judgement obtained by a proper synthesis on the judgements of many experts is better than the individual judgement of an arbitrary expert among them. In practice, this is similar to the idea that “two heads are better than one”. The lifting algorithm is a statistical learning method used very often, it has a very wide application, and a very good eﬀect as well. In the classiﬁcation subject, ﬁrstly, by updating the weight of the training sample, it can learn from multiple classiﬁers, then, it makes a linear combination of these classiﬁers, to enhance the classiﬁcation ability of the classiﬁer.

542

LI Chao et al. / Chinese Astronomy and Astrophysics 43 (2019) 539–548

3.1 GBDT Principle The GBDT algorithm[10] is essentially a lifting algorithm with the decision tree as the basis function. The GBDT model can be expressed as an additive model of decision trees: fM (x) =

M

T (x; θm ) ,

(1)

m=1

here, x expresses the sample data set, T (x; θm ) means the decision tree, θm refers to the parameter of decision tree, M indicates the number of decision trees. The forward distribution algorithm is used for the GBDT. At ﬁrst, it is necessary to determine the initial lifting tree f0 (x) = 0. Afterwards, according to the forward distribution algorithm, the model of the m-th step is obtained: fm (x) = fm−1 (x) + T (x; θm ) ,

(2)

here, fm−1 (x) is the present model. Finally, according to the empirical risk minimization, the parameter of the next decision tree θm is determined: θm = arg min θm

N

L(yi , f(m−1) (x) + T (x; θm )) ,

(3)

i=1

in Eq.(3), yi expresses the real label of the i-th sample data, N means the number of sample data. When L adopts the loss function with the form of squared error: L(y, f (x)) = (y − f (x))2 ,

(4)

here y expresses the real label of all the sample data, at this moment, the loss function becomes: 2 L y, f(m−1) (x) + T (x; θm ) = y − f(m−1) (x) − T (x; θm ) .

(5)

If it refers to the subject of classiﬁcation, the GBDT algorithm needs to restrict the basis classiﬁer to be the classiﬁcation tree. Though there is a more complicated relationship between the input and output in the training data, the intrinsic character of the decision tree model itself determines that the linear combination of decisions trees can ﬁt the training data very well, and obtain the model parameters. 3.2 XGBoost Principle The XGBoost[11] is also a kind of lifting algorithm. Diﬀerent from the traditional GBDT which uses the ﬁrst-order derivative information in the optimization, the XGBoost has made a very good improvement in the optimization. It performs the second-order Taylor expansion for the loss function, while the information of the ﬁrst-order derivative is reserved, the information of the second-order derivative is also added in, which can make the model to be convergent more quickly. Not only that, in order to control the complexity of the model, a

LI Chao et al. / Chinese Astronomy and Astrophysics 43 (2019) 539–548

543

regular term is added in the loss function of XGBoost, to prevent the model from overﬁtting. The practical derivation process of the XGBoost algorithm is given as follows. It is assumed that D = {(xi , yi )}(|D| = n, xi ∈ Rd , yi ∈ R) is a data set that possesses n sample data, and each sample data has d features; xi refers to the i-th sample data. The integrated model of trees predicts the ﬁnal result by using K (the number of trees) additive functions: yi = φ(xi ) =

K

fk (xi ), fk ∈ F ,

(6)

k=1

in which, F = {f (x) = wq(x) }(q : Rd → T, w ∈ RT ) (q means mapping the sample example Rd to the corresponding structure of leaf index, T expresses the number of leaf nodes, and RT is the space of the leaf node weight w) represents the function space of one decision tree, the functional relation of the sample data xi and predicted value yi is denoted by φ; wq(x) maps each node to a value, i.e., the value of f (x); fk means the model of the k-th tree. Each fk corresponds to an independent tree structure q and weight value w of the leaf node. In order to learn the function set that used in the model, the regularized objective function is deﬁned as follows: ⎧ ⎨L(φ) = l( yi , yi ) + Ω(fk ) ⎩

i

Ω(f ) = γT +

1 2λ

k 2

,

(7)

w

here, l is the diﬀerentiable convex loss function used for evaluating the diﬀerence between the predicted value yi and the real value yi , Ω means the punishing term for the model complexity, γ indicates the regularized parameter of the leaf number, which is used to restrain the nodes to split downward continuously, λ is the regularized parameter of the leaf weight. The object is the minimized loss function L(t) =

n

(t−1) l yi , yi + ft (xi ) + Ω(ft ) ,

(8)

i=1 (t−1)

means the sum of in which, L(t) expresses the objective function of the t-th tree; yi output values of the previous (t − 1) trees, which forms the predicted value of the previous t − 1 trees; ft indicates the model of the t-th tree, ft (xi ) is the output result of the t-th tree, (t−1) the addition yi + ft (xi ) forms the most newly predicted value. To deﬁne gi and hi as:

gi =

hi =

(t−1) ∂l yi , yi (t−1)

∂ yi

(t−1) ∂ 2 l yi , yi (t−1)

∂ 2 yi

,

(9)

,

(10)

544

LI Chao et al. / Chinese Astronomy and Astrophysics 43 (2019) 539–548

(t−1)

and have the loss function expanded at yi L

(t)

≈

n

(t−1) l(yi , yi )

i=1

by using the Taylor formula:

1 2 + gi ft (xi ) + hi ft (xi ) + Ω(ft ) . 2

(11)

to remove the constant term, after t times of iteration the loss function becomes: L

(t)

=

n i=1

1 2 gi ft (xi ) + hi ft (xi ) + Ω(ft ) , 2

(12)

Ij = {i|q(xi ) = j} is deﬁned as the example set of the leaf node j, according to Eq.(12) we can obtain:

L

(t)

=

n i=1

T 1 1 2 2 gi ft (xi ) + hi ft (xi ) + γT + λ w , 2 2 j=1 j

T ( = j=1

1 2 gi )wj + ( hi + λ)wj + γT , i∈Ij i∈Ij 2

(13)

here, wj expresses the weight of the leaf node j. For a ﬁxed structure q(x) of the decision tree, the optimized weight wj∗ of the leaf node j can be calculated as: wj∗

= −

i∈Ij i∈Ij

gi

hi + λ

,

(14)

substituting wj∗ into the objective function, we have: T 2 1 ( i∈Ij gi ) + γT . L (q) = − 2 j=1 i∈Ij hi + λ (t)

(15)

Eq.(15) can be taken as the indicator to evaluate the tree structure quality, which can be used to calculate the score of the tree structure q. Even so, it is almost impossible to list all possible tree structure q. Hence, it is necessary to use a greedy algorithm to add branches iteratively at each known leaf node. It is assumed that IL and IR are respectively the left and right leaf node sets after the partition, i.e., I = IL ∪ IR , thus the loss function after the partition is given as follows: Lsplit

2 2 2 ( i∈IR gi ) ( i∈I gi ) 1 ( i∈IL gi ) −γ . = + − 2 i∈IL hi + λ i∈IR hi + λ i∈I hi + λ 4.

(16)

EXPERIMENTAL TEST

4.1 Introduction of Data Set In order to compare better with the existing algorithms, the adopted star/galaxy data set in this study is extracted from the SDSS database by using the SQL (Structured Query

LI Chao et al. / Chinese Astronomy and Astrophysics 43 (2019) 539–548

545

Language) instruction, which keeps consistent with Reference [4]. The data features are shown in Table 1.

Table 1 The features for the SDSS-DR7 star/galaxy classiﬁcation Variable

Attribute

psfMag ﬁberMag petroMag modelMag petroRad petroR50 petroR90 lnLStar lnLExp lnLDeV mRrCc, mE1, mE2 specClass

PSF (point-spread function) magnitude Fiber magnitude Petrosian magnitud Model magnitude Petrosian radius Radius carrying 50% of Petrosian ﬂux Radius carrying 90% of Petrosian ﬂux Likelihood PSF Likelihood exponential Likelihood deVaucouleurs Adaptive moments Spectroscopic classiﬁcation

4.2

Experimental Analysis

4.2.1

Feature importance test

By simulating the data features, the importance of data features is shown in Fig.1, in which F score is a parameter indicating the feature importance.

Fig. 1 Feature importance

546

LI Chao et al. / Chinese Astronomy and Astrophysics 43 (2019) 539–548

4.2.2 Optimization of the XGBoost model The greedy algorithm is adopted by the XGBoost, the practical procedures of the algorithm are as follows. The gridded search is used to make the parameter optimization for the XGBoost algorithm, the depth of the tree equals 6, the learning rate is 0.01, the model converges and attains the optimal value under 710 iterations, the trained model is used to make the experiment. Algorithm 1: Exact Greedy Algorithm for Split Finding Input: I, instance set of current node Input: s, feature dimension gain ← 0 G ← i∈I gi , H ← i∈I hi for k = 1 to d do GL ← 0, HL ← 0 for j in sorted (I, by xjk ) do GL ← GL + gj , HL ← HL + hj GR ← G − GL , HR ← H − HL 2

score ← max (score, HGLL+λ +

GR 2 HR +λ

−

G2 H+λ )

end end Output: Split with max score

4.2.3 Experimental method and model comparison In order to evaluate the performance of the XGBoost model in the star/galaxy classiﬁcation, the ten-fold cross-validation method is used (to divide a complete data set into 10 equal parts, in which one part is taken as the test set, the rest 9 parts are taken as training sets), and it is compared with the FT, RF, GBDT, Adaboost in Reference [4], as well as with the presently used new algorithms DBN, SDAE etc., the detailed comparison results are shown in Table 2. Similarly, in order to ensure the eﬀectiveness of the comparison of classiﬁcation results, a consistent indicator (CP) for evaluating the classiﬁcation performance as same as Reference [4], i.e., the correctness rate for the galaxy classiﬁcation is adopted, which is deﬁned as: CP(v) = 100 ×

Ngal−gal (v)δv , tot Ngalaxy (v)δv

(17)

in which, Ngal−gal (v)δv represents the number of the sample data with the stellar magnitudes δv tot within the interval of (v − δv 2 , v + 2 ) which are correctly classiﬁed as galaxies, Ngalaxy (v)δv means the total number of the sample data with the stellar magnitudes within the interval δv of (v − δv 2 , v + 2 ). The data set is divided into the intervals according to the value of modelMag (model magnitude). Here, the bright stellar magnitude interval (14–19), dark

LI Chao et al. / Chinese Astronomy and Astrophysics 43 (2019) 539–548

547

stellar magnitude interval (19–21), and the darkest stellar magnitude interval (20.5–21) represent respectively the various data sets with the stellar magnitudes corresponding to the diﬀerent modelMag values. Table 2 The accuracy of SDSS-DR7 galaxy classiﬁcation Set

CP(14–19)/%

CP(19–21)/%

CP(20.5–21)/%

XGBoost

99.87

95.72

79.48

GBDT

99.84

95.74

77.64

Adaboost

99.89

95.80

77.56

RF

99.88

95.44

75.71

SDAE

99.87

95.70

73.08

DBN

99.60

96.01

74.45

FT

99.64

84.98

74.04

Method

From Table 2 that obtained from the simulation experiment, we can ﬁnd that the galaxy classiﬁcation accuracy of XGBoost is superior to that of the FT. Especially in the dark stellar magnitude interval, the accuracy of XGBoost is enhanced about 10% in comparison with that of the FT. And in comparison with the other more advanced DBN[12] , SDAE, RF, Adaboost, GBDT etc., the galaxy classiﬁcation accuracy is also enhanced by 2%–5% in the darkest stellar magnitude interval with the modelMag values of 20.5–21. It can be seen that the XGBoost algorithm model possesses a stronger generalization ability, the performance in the star/galaxy classiﬁcation is better than the other algorithms. In addition, this paper has used about 0.88 million data with the modelMag attribute values of 14–19 to test the eﬃciencies of the XGBoost, GBDT, and Adaboost for the model training based on the data set of bright stellar magnitude. The bright sources are used because that in the data set of dark stellar magnitude or of the darkest stellar magnitude the data quantity is small, the diﬀerence of compared results is not evident. The result of the eﬃciency test is shown in Table 3. Table 3 The time of the model training SetTime

CP(14–19) /h

Method XGBoost

1.44

GBDT

15.58

Adaboost

28.05

548

LI Chao et al. / Chinese Astronomy and Astrophysics 43 (2019) 539–548

The training times of the other models are not tested, because the galaxy classiﬁcation accuracy of the other models is much lower than that of the above 3 models. It is shown by the experimental result that in the case that the data set is not changed, the time consumption of the XGBoost in the model training is much less than those of the GBDT and Adaboost. Compared with the GBDT, the XGBoost has used the second-order information, it can converge more quickly in the training set. Hence, the XGBoost is not only superior to the other models in the accuracy, but also has a much higher eﬃciency than the GBDT and Adaboost. 5.

SUMMARY AND PROSPECT

By using the photometric data set of SDSS-DR7 and adopting the ten-fold cross-validation method, this paper has studied the subject of star/galaxy classiﬁcation based on the XHBoost algorithm. Finally, by using the common-used methods of empirical parameter regulation, gridded search etc., the model is continuously optimized, and based on the indicator for evaluating the galaxy classiﬁcation accuracy, it is compared with the FT, Adaboost, RF, GBDT, DBN and so on. As shown by the experimental result, the classiﬁcation result of the optimized XGBoost algorithm model in the star/galaxy data set is much better than those of the other models. Meanwhile, during the model training, the XGBoost has a much higher eﬃciency than the GBDT and Adaboost. Therefore, for both the accuracy and eﬃciency, the XGBoost model is undoubted to possess an evident advantage. Although, the accuracy for the dark star/galaxy sources needs to be further improved, it is believable that with the gradually deepened study on the XGBoost algorithm in the aspect of astronomical data mining, the relevant astronomical ﬁelds will develop very soon. References 1

Zhang Y. H., Zhao Y. H., e-Science Teqhnology and Application, 2011, 2, 13

2

Messier C., Connoissance des Temps for 1784, 1781, 227

3

Yan T. S., Zhang Y. H., Zhao Y. H., et al., Science China G, 2009, 39, 1794

4

Vasconcellos E. C., De Carvalho R. R., Gal R. R., et al., AJ, 2010, 141, 189

5

Sevilla-Noarbe I., Etayo-Sotos P., A&C, 2015, 11, 64

6

Kim E. J., Brunner R. J., MNRAS, 2017, 464, 4463

7

Li J. F., Wang Y. L., Hu S., et al., Spectrocopy and Spectral Analysis, 2016, 36, 3261

8

Liu R., Qiao X. J., Zhang J. N., et al., Spectrocopy and Spectral Analysis, 2017, 37, 1553

9

Xan M. A., Ben H., David B., MNRAS, 2018, 481, 4194

10 11

Jerome H., Friedman., The Annals of Statistics, 2001, 29, 1189 Chen T., Guestrin C., ACM SIGKDD and International Conference on Knowledge Discovery and Data Mining, 2016, 785

12

Hinton G. E., Osindero S., Yee-Whye T., Neural Computation, 2006, 18, 1527, 16-9

Galaxy Classification Based on the XGBoost Algorithm

Galaxy Classification Based on the XGBoost Algorithm

Recommend Documents