Maize haploid recognition study based on nuclear magnetic resonance spectrum and manifold learning

Maize haploid recognition study based on nuclear magnetic resonance spectrum and manifold learning

Computers and Electronics in Agriculture 170 (2020) 105219 Contents lists available at ScienceDirect Computers and Electronics in Agriculture journa...

3MB Sizes 0 Downloads 15 Views

Computers and Electronics in Agriculture 170 (2020) 105219

Contents lists available at ScienceDirect

Computers and Electronics in Agriculture journal homepage: www.elsevier.com/locate/compag

Maize haploid recognition study based on nuclear magnetic resonance spectrum and manifold learning ⁎

Wenzhang Gea,1, Jinlong Lib,1, Yaqian Wanga, Xiaoning Yua, Dong Ana,c, , Shaojiang Chenb,

T ⁎

a

College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China China Agricultural University National Maize Improvement Center, Beijing 100193, China c Beijing Engineering Research Center for Internet of Things Technology in Agriculture, Beijing 100083, China b

A R T I C LE I N FO

A B S T R A C T

Keywords: Maize Haploid identification Nuclear magnetic resonance Manifold learning Multi-manifold learning

Haploid breeding is a significant technology of maize breeding. Nondestructively, rapidly and accurately haploid kernel identification method is the basis of developing haploid breeding technology. The commonly adopted maize haploid recognition methods at present are mainly near-infrared spectroscopy (NIRS), machine vision and nuclear magnetic resonance (NMR) oil measurement. NMR spectrum analysis method based on pattern recognition was used in this paper for haploid recognition, which on the one hand could improve recognition efficiency, and on the other hand could overcome the limitation of NMR oil measurement method namely it could not be applied to maize kernels produced by conventional inducer. NMR spectrum as a kind of highdimensional data, manifold learning could effectively maintain the nonlinear structural properties of data while reducing dimensionality and extract easily identifiable features from these structures. Most manifold learning algorithms used at present map data of different categories onto the same low-dimensional embedded manifold. In order to better reserve essential structures of different categories of data, a new multi-manifold recognition framework was proposed in this paper for haploid recognition. The new framework uses the manifold learning algorithm to conduct feature extraction of NMR spectra of haploid and diploid respectively, and two low-dimensional manifold expressions are established; new samples are discriminated using the distance measurement method after being respectively mapped to two low-dimensional manifolds. For the difficulty existing in the calculation of point-to-manifold distance, the low-dimensional manifold structure is expressed in way of manifold coverage, and then point-to-manifold distance is expressed by calculating the distance from the sample point to the covered geometry. Maize kernels generated by high-oil induction system and conventional induction system were experimented in this paper. First of all, the feasibility of NMR spectrum analysis method based on pattern recognition for haploid identification was analyzed, the experiment was carried out using singlemanifold and multi-manifold identification frameworks respectively, and stability of the multi-manifold identification framework was discussed finally. Experimental results indicate that the recognition rate of maize kernels induced by high-oil inducer can reach as high as 98.33% and the recognition rate of maize kernels induced by conventional inducer can reach as high as 90%, it proved that NMR spectrum combining manifold learning algorithm is feasible for haploid recognition; in the meantime, the multi-manifold recognition framework proposed in this paper has achieved better result than single-manifold recognition framework with the recognition rate elevated by 5% or so.

1. Introduction Haploid breeding, which has become one of the major maize breeding techniques, can help maize breeding to get rid of problems like long period, high cost and low efficiency and it is very effective for

developing new varieties (Weber, 2014). The primary condition for implementing this technology is to obtain an enough quantity of maize haploid kernels. The probability for maize naturally occurring haploid is 0.05–0.1%, even artificially induced by high-frequency haploid inducer, and the induction rate is 8–15% (Cai et al., 2008; Chalyk and



Corresponding authors at: College of Information and Electrical Engineering, China Agricultural University, 17 Tsinghua East Road, Beijing 100083, China (D. An). China Agricultural University National Maize Improvement Center, Beijing 100193, China (S. Chen). E-mail addresses: [email protected] (D. An), [email protected] (S. Chen). 1 These authors contributed equally to this work. https://doi.org/10.1016/j.compag.2020.105219 Received 14 February 2019; Received in revised form 24 July 2019; Accepted 8 January 2020 0168-1699/ © 2020 Elsevier B.V. All rights reserved.

Computers and Electronics in Agriculture 170 (2020) 105219

W. Ge, et al.

many high-dimensional data are distributed on the low-dimensional nonlinear structure embedded into the high-dimensional linear space. The traditional linear dimensionality reduction method can’t effectively maintain this nonlinearity nature. Therefore, kernel method and manifold learning and other nonlinear dimensionality reduction methods have been developed. Kernel method derives from development and application of the Support Vector Machine theory, it maps original data into a higher-dimensional feature space through nonlinear mapping and process post-mapping data using a linear learning algorithm in the new high-dimensional feature space. The primary problem of kernel method is large calculation cost and the dimension reduction effect depends on selection of the kernel function, which needs to be determined usually by experience (Huang, 2018). With a reference to the concept of topological manifold, manifold learning algorithm assumes that high-dimensional observed are sampled from a potential low-dimensional manifold. The assumed manifold is learned through one explicit or implicit mapping relation, and original data are projected from the surrounding observation space to a low-dimensional embedded space, in which some global or local geometric attributes and internal structures of original data are kept (Huang and Liu, 2007). Due to its non-linear character and structurepreserving mapping, manifold learning algorithms have acquired favorable research achievements and applications in multiple aspects, for instance, face expression image analysis, data visualization, image information retrieval and anomaly detection have become important dimension reduction means in many high-dimensional data analysis processes. The manifold learning method was used in this study for dimension reduction and its performance in processing NMR spectrum of maize kernel was discussed. In addition, most manifold learning algorithms used at present map data of different categories onto the same low-dimensional embedding manifold. However, data of different categories have different features, and the assumption that these data are located on different manifold structures seems more reasonable (Hettiarachchi and Peters, 2015). A new multi-manifold framework was proposed in this paper for recognition of maize haploid kernels. The new framework conducted the recognition by establishing a low-dimensional manifold for each category and using distance to characterize similarity. To sum up, maize kernels generated by high-oil induction system and conventional induction system were experimented in this paper, the following contents were mainly discussed: the feasibility of the pattern recognition method based on NMR spectrum and combining manifold learning dimension-reduction algorithm in the maize haploid recognition; the effect of the proposed multi-manifold learning framework in the recognition was verified.

Rotarenco, 2001; Chen and Song, 2003; Dang et al., 2012; Liu and Song, 2000; Prigge et al., 2011; Rober et al., 2005). Therefore, one of the key problems to realize high-throughput commercialization of the haploid breeding technology lies in developing a set of effective haploid recognition system (Dwivedi et al., 2015). The haploid recognition methods which have been most extensively applied at present are Near Infrared Spectroscopy (NIRS), machine vision and NMR quantitative analysis. NIRS techniques with features of rapid, nondestructive could identify the haploid and Micro-NIR spectrometer scan fast and cost less, which have utility for automatically selecting haploid maize kernels from hybrid kernels(Qin et al., 2016; Li et al., 2018; Lin et al., 2018). However, the NIR spectra of maize haploid kernels are easily affected by many factors, such as light, temperature, humidity, NIR intensity and collecting instrument (Zhou et al., 2007). The machine vision method is based on Najavo marker (Nanda And Chase, 1966), which makes different color features in the embryo between haploid and diploid kernels. Li et al. designed a set of haploid screening system based on machine vision and the success rate to obtain embryo surface-containing pictures reached 90%, corrective haploid recognition rate by the system was 95% (Li et al., 2016). However, this genetic marker method still has certain limitations. First of all, when the induced female parent carries dominant pigment inhibiting genes, then this marked color gene can’t be expressed; secondly, genetic expression effects of different hybridized material combinations are quite different (Zhang et al., 2013; Li et al., 2016). Chen and Song put forward using oil xenia effect for haploid recognition, which makes the induced haploid kernels and diploid kernels by high oil inducer present a significant different in oil content (Chen and Song, 2003). Haploids can be separated out by measuring kernel oil contents using NMR spectrometer. Haploid automatic screening system based on NMR quantitative analysis has been developed so far, which can realize the recognition rate of 4 s per kernel with accuracy reaching 94% (Wang et al., 2016). The pattern recognition method based on low-field NMR spectrum of maize kernel was used in this paper. When this pattern recognition method is used, it’s unnecessary to calculate oil content, thus saving the weighing link and improving the automatic recognition efficiency; secondly, it’s not necessary to fabricate calibration curve on schedule in order to ensure measurement accuracy of oil contents, thus remitting the operating difficulty; finally, as it doesn’t completely rely on difference of oil contents for haploid recognition, so it can be applicable to maize kernels generated by conventional inducer, which account for the majority of maize varieties. Low-field NMR technology has been widely applied to quality detection of agricultural products in recent years. Santos et al. used low-field H-NMR to detect synthetic emulsions adulterated in the milk at different volume ratios, conducted multivariable data processing and T2 single-variable processing and established 2 classification models to control and classify milk quality (Santos et al., 2016). Roberta et al. used low-field N-HMR to detect longitudinal relaxation time T1 and transverse relaxation time T2 of honeys adulterated with 0–100% high-fruit maize syrups, and found after double-exponential fitting of the detection results that differences of honeys of different adulteration ratios in aspects of pH, color, water content, water activity and ash content were embodied at T2 , indicating that low-field NMR technology could be used to differentiate pure honey from honey adulterated with high-fruit maize syrups (Ribeiro et al., 2014). These studies have provided a theoretical foundation for this paper. NMR spectrum is a kind of high-dimensional data. The pattern recognition method needs to extract effective information as far as possible so as to realize accurate classification, so effective feature extraction and dimensionality reduction method is an important link in the identification process. The traditional linear dimension reduction methods assume that the data has a global linear structure, and representative methods are principal component analysis (PCA) and linear discriminant analysis (LDA). However, it’s found in practice that

2. Materials and methods 2.1. Experimental samples Experimental samples were divided into two parts, both of which were provided by national maize improvement center of China Agricultural University. The experimental materials were generated using the inducer carrying R1-nj gene marker as the male parent to induce common hybrids, where diploid would generate purple marker character at the embryo while haploid was colorless at the embryo because of parthenogenesis. In part one, high oil inducer CHIO3 (oil content: 8.72%) was used as the male parent to induce two common hybrids – Zhengdan 958 (Zheng 58 × Chang 7-2) and Nongda 616 (C228 × C1116) – respectively to generate maize kernels, which were then taken as experimental objects. In part two, the conventional inducer CAU5 (oil content: 4%) was used as the male parent to induce two common hybrids – Zhengdan 958 and Yudan 112 (L217 × L119A) – respectively to generate kernels, which were then taken as experimental objects. For the convenience of description, kernels generated by the above hybrid combinations were still named by the varieties of 2

Computers and Electronics in Agriculture 170 (2020) 105219

W. Ge, et al.

Fig. 1. Partial experimental samples: (a). maize kernels generated by high oil inducer; (b). maize kernels generated by conventional inducer.

Fig. 2. The distribution of oil content ratio of haploid and diploid: (a). Zhengdan 958H; (b). Zhengdan 958C.

2.2. NMR spectral acquisition

female parent materials. In order to differentiate kernels generated using high oil inducer to induce Zhengdan 958 from those generated using the conventional inducer to induce Zhengdan 958, letter H (High) or C (Conventional) was added to names of maize varieties. There were 4 groups of cross combinations to produce kernels, 100 haploid kernels and 100 diploid kernels for each group were selected as experimental objects and all samples were numbered in order. Fig. 1 shows the true appearance of the four groups of maize kernels. Kernel oil content distributions (the oil content of single kernel was measured by NMR spectrometer) of the two parts samples used in the experiment are shown in Fig. 2. It can be seen that oil content of haploid and diploid generated by high oil inducer are obviously difference in overall distribution, but partial overlapping exists within the interval 4%-5%. When oil content measurement method is used for haploid screening, it’s necessary to regulate oil content threshold below the overlapping region, and thus a certain quantity of haploid kernels will be lost. Oil content of haploid and diploid generated by conventional inducer are most mixed within the interval 2%-3%, which can’t be effectively differentiated using the oil content measurement method.

MRI analyzer (NMI20-015V-I model) produced by Suzhou Niumag Analytical Instrument Co., Ltd was used in this experiment. This instrument, which integrates relaxation analysis and MRI, can realize qualitative and quantitative analysis of substance contents like water and oil. Permanent magnet intensity used in this instrument is 0.5 ± 0.05 T, which belongs to low-field nuclear magnet. Radio frequency pulse sequences option is set as Q-CPMG to measure transverse relaxation time T2 . The main manual regulation parameters involved in the experiment are set as follows: number of signal sampling points TD = 120,006, repeated sampling times NS = 16, waiting time of repeated sampling TW = 1000 ms, number of echoes NECH = 2000 and time of echo TE = 0.6 ms. During the measurement process, maize kernels were firstly placed in a glass test tube which was then placed in the NMR spectrometer, and the measured data were not influenced by embryonic orientation of maize kernels. As the magnetic field can be influenced slightly by the temperature so as to cause the change of center frequency, alternative acquisition method was adopted in order to reduce the influence of parameter drift on measurement results, namely sampling through the method which acquired spectral signal of 3

Computers and Electronics in Agriculture 170 (2020) 105219

W. Ge, et al.

generalization ability is poor. He et al. assumes a low-dimensional coordinate yi = αT x i (He et al., 2005). This linear transformation is fused into LLE step (3) to propose Neighborhood Preserving Embedding (NPE), and the calculated result is a projection matrix. α is obtained by minimizing the objective function:

a diploid kernel after acquiring that of a haploid kernel. 2.3. Single-manifold based classification framework The key of the manifold learning algorithm lies in finding low-dimensional embedding manifold of high-dimensional data and the embedded nonlinear mapping from high-dimensional space to low-dimensional space is constructed so as to realize dimension reduction. There are several classical nonlinear manifold learning algorithms for finding the low-dimensional embedding manifold, such as local linear embedding (LLE), isometric mapping (Isomap) and Laplacian feature mapping (LE). LLE is a kind of local relation preserving algorithm, which assumed that the high-dimensional embedded space keeps the same local neighbor relation with the corresponding local neighborhood in the internal low-dimensional space, namely a point is arbitrarily taken in the high-dimensional embedded space, reconstruction can be carried out through a linear combination of all of its neighbor points, but this linear structure is kept unchanged in the low-dimensional space (Roweis and Saul, 2000). Neighbor weight of each point in this algorithm keeps unchanged under translation, rotation and stretching changes, so potential features of the object can be extracted and essential structures can be found. The dimension reduction process of LLE algorithm will be described simply in order to understand nonlinear dimension reduction characteristics of manifold learning more intuitively: It’s assumed that X = {x1, x2 , ...,x n}, x i ∈ RD in the high-dimensional space obtains a low-dimensional embedding outputY = {y1 , y2 , ...,yn }, yi ∈ RD through LLE, and d ≪ D .

n

ε (α ) = arg min ∑ ‖yi − i=1

n i=1

j=1

2.4. Multi-manifold based classification framework The manifold learning algorithm in the last section maps data of different categories onto a uniform low-dimensional embedding manifold. However, actual data points may be located on different manifolds, which will affect the description of real spatial distribution between data. Hereby a new multi-manifold framework is proposed for haploid recognition. The details of our multi-manifold classification framework will be introduced in this section, mainly divided into two parts: the first is the establishment of multiple low-dimensional manifolds and classification method; the second is the distance measurement from test sample to manifold. Distance measurement is a common method used for similarity calculation, and its accuracy directly affects recognition effect of the new framework. The idea of manifold covering in the bionic pattern recognition (BPR) (Wang, 2002) is taken for reference to express a distance calculation method from data point to manifold and add them to the multi-manifold classification framework.

k

∑ wij xj ‖2 j=1

j=1

where wij is the linear reconstruction weight of x i by the jth point x j . When x j is not one of the k nearest neighbors of x i , wij = 0; (3) LLE keeps wij unchanged in the low-dimensional space, so corresponding coordinate yi of x i in the low-dimensional space can be obtained by minimizing the objective function:

i=1

i=1

αT XMXT α

k

n

j=1

k

∑ wij αTxj ‖2

Through the algebraic transformation, the above equation can be

s.t. ∑ wij = 1

ε (Y ) = arg min ∑ ‖yi −

n

= arg min ∑ ‖αT x i −

where M = (I − W )T (I − W ) . rewritten into: α = arg min αXXT α Through eigenvalue decomposition solving: the matrix consisting of eigenvectors corresponding to minimum d eigenvalues of M is namely AT = [α1, α2, ...,αd]. Matrix A is a projection matrix could map test samples from a high-dimensional space into a low-dimensional space. NPE algorithm, which is a kind of linear approximation algorithm of LLE, not only contains LLE nonlinear dimension reduction characteristics so that it can effectively keep the local structure between samples but also can map test samples into a manifold structure which can reflect their essential features. To sum up, the algorithm combining NPE dimension reduction and K nearest neighbor (KNN) classification as the single-manifold classification framework was used for haploid recognition in this paper, and the effect of the single-manifold classification framework on NMR spectrum was discussed.

(1) Seek for the k nearest neighbors of each sample point x i . (2) Calculate the weight matrix W for linear reconstruction of x i based on sample points in the neighbor. W is obtained by minimizing the objective function:

ε (W ) = arg min ∑ ‖x i −

k

∑ wij yj ‖2

2.4.1. Multi-manifold dimension reduction and classification As for establishment of multiple low-dimensional manifolds, training samples of known category labels are grouped by category, and then manifold learning algorithm NPE is used to map data of different categories into their respectively low-dimensional embedding manifolds. The method of establishing low-dimensional manifolds for each category can effectively reserve essential features of their respective categories. In the classification process, a test sample is mapped into each lowdimensional manifold established in the last step to obtain multiple low-dimensional expressions. Each of these multiple manifolds corresponds to a category. For a sample belonging to one specific manifold, it’s assumed that its low-dimensional expression on this manifold is closer to the manifold ontology than low-dimensional expressions on manifolds of other categories, in other words, it accords with feature structure of this manifold category more. According to this assumption, the category of the manifold which is the nearest obtained by calculating the distance from low-dimensional expression of the test sample to the low-dimensional manifold is namely the category of the test sample.

k

∑ wij yj ‖2 j=1

Through the algebraic transformation, the above equation can be rewritten into:

min tr (YMY T )

s.t. YY T = I where M = (I − W )T (I − W ). Through eigenvalue decomposition solving: the matrix consisting of eigenvectors corresponding to minimum d eigenvalues of M is namely YT. All classical manifold learning algorithms, which direct at training samples, map training samples from a high-dimensional space into a low-dimensional space through the implicit mapping method so as to directly obtain corresponding low-dimensional coordinate. However, for test samples, as there is no uniform projection matrix, dimension reduction result of test data can’t be directly obtained, the 4

Computers and Electronics in Agriculture 170 (2020) 105219

W. Ge, et al.

Fig. 3. The system framework of identification model of haploid maize kernel: (a). single-manifold framework; (b). multi-manifold framework.

complicated geometries can be realized (Wang and Wang, 2002). Therefore, the calculation of the distance from a test sample to the covering manifold consisting of samples in the training sample set is implemented by firstly calculating the distance between single simple geometries, and the distance from the point to the covering manifold is the closest distance among all simple geometries. The concrete distance calculation process is as follows:

2.4.2. Distance measurement Manifold is a topological structure and calculating the distance from test sample point to the manifold is a difficult problem. No uniform formula or definition for the distance from a point to the manifold or the distance between manifolds has been given yet. For this problem, the idea of bionic pattern recognition was taken for reference to express the distance calculation method from a point to the manifold. In terms of bionic pattern recognition, the point set consisting of continuous mapping “images” of all sample points of one category of object in the feature space is a closed subspace, which varies from actual objects as manifested by manifolds of different dimensionalities. What is mainly studied in the bionic pattern recognition is how to cover high-dimensional manifolds better in the topological space. It’s believed in the bionic pattern recognition, the object has “homologous continuity law”, namely there is at least one gradual change process between two different samples of the same object category. All samples in the gradual change process still belong to this category. Homologous continuity law is mathematically described as: In the n-dimensional feature space Rn , it’s assumed that A is the set of all samples of the same category. If samples satisfy x , y ⊂ A , for anyε > 0 , there will certainly be a set B satisfying the following condition:

(1) Select base points from the training sample set of each category; (2) Connect base points according to a certain structure and sequence so as to construct the model skeleton; (3) Calculate distanced 2 (x , x1¯x2 ) from the sample point x to any covering unit x1¯x2 of one category, and calculation formula of d 2 (x , x1¯x2 ) is:

d 2 (x ,

2 q (x , x1, x2) < 0 ⎧||x − x1 || , x1¯x2 ) = ||x − x2 ||2 , q (x , x1, x2) > ||x1 − x2 || ⎨ 2 2 otherwise ⎩||x − x1 || − q (x , x1, x2),

where q (x , x1, x2) = (x − x1)

x2 − x1 . ∥x2 − x1 ∥

The above step (1) use the Kennard-Stone (K-S) algorithm to select the base points. K-S algorithm is an effective conversion set selection method enjoying extensive applications, and the distribution of sample set selected by this algorithm based on Euclidean distance between samples was more uniform and representative (Guo et al., 2016). The main process of the K-S algorithm is as follows:

B = {x1, x2 , ...,xk |x1 = x , xk = y, k ∈ N , ρ (x m , x m + 1) < ε , 1 ⩽ m ⩽ k − 1} ⊂A Where ρ (x m , x m + 1) is the distance between samples x m and x m + 1. After this law is introduced, continuous optimal cover of sample distribution of the same category can be realized in the feature space. As a general rule, the feature space Rn is a high-dimensional feature space of n ≥ 3, and sample distribution subspace of one object category is quite complicated in such a high-dimensional space. During the actual algorithm design process, the category subspace is decomposed into multiple sealed simple geometric spaces from the geometric perspective of high-dimensional image. The union of these simple geometries used for approximating the original category subspace, cover of

Step 1: Calculate the Euclidean distance between every two samples of all samples. The samples with the largest Euclidean distance are chosen as the first and second samples in the base point set. Step 2: Calculate every remaining sample’s Euclidean distance to the selected samples, and the minimum distance was selected. Until every remaining sample’s distance is calculated, the sample with the largest minimum Euclidean distance is chosen as the next sample in 5

Computers and Electronics in Agriculture 170 (2020) 105219

W. Ge, et al.

Fig. 4. The NMR spectra of samples and partial enlarged detail: (a). Zhengdan 958H; (b). partial enlarged detail of Zhengdan 958H; (c). Zhengdan 958C; (d). partial enlarged detail of Zhengdan 958C.

and nearly completely overlap with spectral signals of some haploids, thus forming a great difficulty to the classification process. In order to further discuss about the feasibility of NMR spectrum used in haploid recognition, principal component analysis of two experimental maize groups was carried out in this paper, and the first two dimensional vectors were selected to draw principal component scoring map as shown in Fig. 5, where x-coordinate and y-coordinate are the first principal component and the second principal component (PC1 and PC2). Accumulative variance contribute rate of two principal components in Fig. 5a is 96.04%, where PC1 accounts for 95.88% and PC2 accounts for 0.16%. Most of the raw spectral data was contained in the first two principal components. According to sample point distribution in the PCA feature space, haploids and diploids generated by high-oil inducer are distinguishable on the whole only with partial overlapping phenomenon. Accumulative variance contribution rate of two principal components in Fig. 5b is 84.62%, where PC1 accounts for 83.73% and PC2 accounts for 0.89%. The first principal component includes most of the original information of NMR spectrum, but its contribution rate is still at a low level when compared with that of the first principal component of the maize kernels induced by high-oil inducer, and contribution rates of single principal components don’t reach 1% from the second principal component, so it needs more principal components to provide enough original information. According to sample point distribution in the PCA feature space, mixing phenomenon of haploids and diploids generate by conventional inducer is quite serious with a great distinguishing difficulty.

the base point set. Step 3: Repeat Step 2 until the set sample number of the base point set is reached. To sum up, the flowchart of two classification frameworks for maize haploid used in this paper is shown in Fig. 3. Dimension reduction processing and classification model establishment are both implemented on MATLAB r2016a. 3. Results and discussion 3.1. NMR spectrum analysis NMR spectra of Zhendan 958H generated by high-oil inducer and Zhengdan 958C generated by conventional inducer, which were acquired in the experiment are shown in Fig. 4a and c respectively. Xcoordinate represents relaxation time while y-coordinate is signal intensity, and spectral signal is manifested by an attenuation curve. According to Fig. 4a and b, NMR spectra of haploids and diploids generated by high-oil inducer are obviously different in the overall distribution, and this difference provides a basis for using the pattern recognition method of NMR spectrum to recognize haploid generated by high-oil inducer. Secondly, two data categories are quite close at their border, which may be related to partial overlapping phenomenon between the two categories in oil content, and this phenomenon may affect the classification effect to a certain degree. As seen in Fig. 4c and d, the span width of signal intensity of haploids generated by conventional inducer is broad, diploid spectral signals are quite concentrated 6

Computers and Electronics in Agriculture 170 (2020) 105219

W. Ge, et al.

Fig. 5. Two-dimension map of the sample distribution: (a). Zhengdan 958H; (b). Zhengdan 958C.

3.2. Experimental results of the high-oil-induced kernels

Table 1 Classification accuracy of spectral data of two high-oil-induced kernel groups.

The two maize groups generated by high-oil inducer were used in this experiment to explore into whether the pattern recognition method combining NMR spectrum and manifold learning algorithm could effectively realize haploid recognition and compare single-manifold framework and multi-manifold framework in the aspect of recognition effect. NMR spectra of 100 haploid kernels and 100 diploid kernels of each high-oil-induced kernels group were selected as experimental samples. As reasonable division of sample sets was of great importance to improvement of recognition accuracy, K-S algorithm was used for sample segmentation. For the single-manifold framework, training set and test set were distributed at the proportion of 7:3, namely the training set included 140 NMR spectra (70 haploids and 70 diploids) and the test set included 60 spectra (30 haploids and 30 diploids). Dimension reduction of all data in the training set was carried out using the NPE algorithm so as to obtain the mapping relation from high-dimensional space to low-dimensional manifold. Data in the test set were projected into the lowdimensional space according to the mapping relation to obtain a lowdimensional expression. The spectral features of Zhengdan 958H

Group

Framework

Haploid (%)

Diploid (%)

Average (%)

Nongda 616

Single-manifold Multi-manifold Single-manifold Multi-manifold

96.67 96.67 100 100

93.33 96.67 96.67 96.67

95.00 96.67 98.33 98.33

Zhengdan 958H

extracted by single-manifold framework were projected into a threedimensional space (Fig. 6). Recognition rates of test sets for high-oilinduced kernel groups are seen in Table 1. According to Fig. 6, the training set in the first three dimensionalities presents an approximately linear cluster state in the low-dimensional space. The form of the training set is closely related to selection of number of neighbors. Even though this goes against finding the internal data structure from the angle of visualization, that all data of the same category are mapped onto one point will contribute more to classification from the angle of classification. Partial overlapping phenomenon exists in the training set, but the test set is of favorable separability. Model parameters include the nearest neighbor number k

Fig. 6. Samples distribution by single-manifold framework of Zhengdan 958H. 7

Computers and Electronics in Agriculture 170 (2020) 105219

W. Ge, et al.

learning framework in comparison with that under the single manifold learning framework, by about 1.7%. Experimental results of the two recognition frameworks proved that the pattern recognition method combining NMR spectrum and manifold learning dimension reduction method could effectively realize haploid recognition, and meanwhile, two frameworks had nearly the same recognition effect on the high-oil induced kernel groups.

and the dimension of low-dimensional embedding d are obtained through cross validation method. Recognition rate of the Zhengdan 958H test set reaching 98.33% when k = 13 and d = 16. In the multi-manifold framework, samples in the training set and test set were distributed according to the proportion of 7:3, namely the training set included 140 NMR spectra (70 haploids and 70 diploids) and the test set included 60 spectra (30 haploids and 30 diploids). Dimension reduction of 70 training data of haploids was carried out using NPE algorithm so as to obtain a mapping relation from the highdimensional space to a low-dimensional manifold; dimension reduction of 70 training data of diploids was implemented using the NPE algorithm to obtain another mapping relation from the high-dimensional space to the low-dimensional manifold. Although the quantity of training sets was smaller than that in the single-manifold framework, but priori knowledge of homologous continuity law was introduced into the multi-manifold framework, so effective information was not restricted to the quantity of training samples any longer. Data in the test set were projected into the low-dimensional space according to two mapping relations respectively. The distances from low-dimensional expression to manifolds in the test set were calculated respectively, and the category of the test set was classified as the category with the closest manifolds. The experiment is shown in Fig. 7. According to the figure, when test samples of haploids and diploids were mapped onto manifolds established for diploid samples, haploid test samples were farther away from diploid manifolds when compared with diploid test samples. Similarly, when test samples of haploids and diploids were mapped onto manifolds established for haploid training samples, diploid test samples were farther away from haploid manifolds when compared with haploid test samples, verifying that the method of similarity measurement using the distance was feasible. At the same time, it can be seen that test samples belong to the manifold were complementing parts of training samples with low manifold density to some degree, which on the one hand embodied the advantage of manifold learning namely keeping low-dimensional local linear relations and on the other hand verified the assumption of homologous continuity. Recognition rate in the test set reached as high as 98.33% when k = 12 and d = 16. Due to the variety difference, the recognition rate of Nongda 616 test set under the single manifold learning framework was lower than that of Zhengdan 958H by about 3.3%. Nongda 616 test set was improved to a small degree under the multi-manifold

3.3. Experimental results of the conventionally induced kernels NMR spectra of 100 haploids and 100 diploids of each maize kernel group induced by conventional inducer were selected in the experiment as experimental samples. Data distribution method and experimental process were consistent with the previous part. Experimental results are shown in Figs. 8 and 9 and Table 2. As shown in Fig. 9, training samples of haploids and diploids had serious mixing phenomenon on low-dimensional manifolds while classification of test samples excessively depended on the model established for training samples, which seriously affected recognition result of test samples. Yudan 112 has a recognition rate of less than 90% in both the single manifold identification framework and the multi-manifold identification framework. Zhengdan 958C has a recognition rate of up to 90% in the multi-manifold identification framework when k = 16 and d = 35. According to the table, the multi-manifold framework had a high recognition rate of haploids, and these two frameworks were both easily classify diploids into haploids mistakenly. We believe that the oil content as an important feature is more likely to affect the recognition results. Haploid oil content has a wider span, so the diploid is easily misidentified as a haploid. Besides, the single-manifold framework mainly learned some common low-dimensional features shared by two categories. The multi-manifold framework learned low-dimensional features of haploids and diploids respectively, so it could learn the unique characteristics of haploid, thus embodying advantages of the multi-manifold method. To sum up, the multi-manifold framework achieved a better effect under a high similarity degree between data of different categories, and meanwhile, the pattern recognition method combining NMR spectrum and manifold learning algorithm had a certain recognition effect on the conventionally induced kernel groups.

Fig. 7. Samples distribution by multi-manifold framework of Zhengdan 958H. 8

Computers and Electronics in Agriculture 170 (2020) 105219

W. Ge, et al.

Fig. 8. Samples distribution by single-manifold framework of Zhengdan 958C.

Fig. 9. Samples distribution by multi-manifold framework of Zhengdan 958C.

this section. Few parameters are involved in the manifold learning method, and it’s only necessary to select dimensionality of the low-dimensional space and number of neighbors. As a general rule, manifold learning result is quite sensitive to number of neighbors, because selection of number of neighbors decides the established overall manifold structure. If the number of neighbors is too small, topological structure of data can’t be effectively captured; if the number is too large, proximal points will not be located on local linear blocks, which is in nonconformity with the LLE assumption. It can be seen from the Fig. 10 that recognition rate of the multi-manifold method proposed in this paper will fluctuate greatly with change of k value, so it is quite sensitive to k. K should not be too small, or otherwise it will affect the recognition effect.

Table 2 Classification accuracy of spectral data of two conventionally induced kernel groups. Group

Framework

Haploid (%)

Diploid (%)

Average (%)

Yudan 112

Single-manifold Multi-manifold Single-manifold Multi-manifold

86.67 96.67 86.67 93.33

83.33 80.00 83.33 86.67

85.00 88.33 85.00 90.00

Zhengdan 958C

3.4. Parameter problem The influence of parameter selection in the proposed multi-manifold learning framework on haploid recognition result will be discussed in 9

Computers and Electronics in Agriculture 170 (2020) 105219

W. Ge, et al.

breeding of maize. Journal of Maize Sciences 16, 1–5. Chalyk, S.T., Rotarenco, V.A., 2001. The use of matroclinous maize haploids for recurrent selection. Russ. J. Genet. 37, 1382–1387. Chen, S.J., Song, T.M., 2003. Identification haploid with high oil xenia effect in maize. Acta Agron. Sinica 29, 587–590. Dang, N.C., Munsch, M., Aulinger, I., Renlai, W., Stamp, P., 2012. Inducer line generated double haploid seeds for combined waxy and opaque 2 grain quality in subtropical maize (Zea mays. L.). Euphytica 183, 153–160. Dwivedi, S.L., Britt, A.B., Tripathi, L., Sharma, S., Upadhyaya, H.D., Ortiz, R., 2015. Haploids: constraints and opportunities in plant breeding. Biotechnol. Adv. 33, 812–829. Guo, W.C., Gu, J.S., Liu, D.Y., Shang, L., 2016. Peach variety identification using nearinfrared diffuse reflectance spectroscopy. Comput. Electron. Agric. 123, 297–303. He, X.F., Cai, D., Yan, S.C., Zhang, H.J., IEEE Computer, S.O.C., 2005. Neighborhood preserving embedding. In: Tenth IEEE International Conference on Computer Vision, vols. 1 and 2, Proceedings. IEEE Computer Soc, Los Alamitos, pp. 1208–1213. Hettiarachchi, R., Peters, J.F., 2015. Multi-manifold LLE learning in pattern recognition. Pattern Recogn. 48, 2947–2960. Huang, Q.-H., Liu, Z., 2007. Overview of nonlinear dimensionality reduction methods in manifold learning. Appl. Res. Comput. (China) 24, 19–25. Huang, X., 2018. Research and development of feature dimensionality reduction. Comput. Sci. 45, 16–21. Li, W.J., Liu, Y.M., Chen, S.J., Qin, H., Liu, J., Tian, Z., 2016. Automatic separating system of maize haploid based on machine vision. J. Agric. Mech. Res. 1, 81–85. Li, W., Li, J.L., Li, W.J., Liu, L.W., Li, H.G., Chen, C., Chen, S.J., 2018. Near infrared spectroscopy analysis based machine learning to identify haploids in Maize. Spectrosc. Spectr. Anal. 38, 2763–2769. Lin, J.C., Yu, L.N., Li, W.J., Qin, H., 2018. Method for identifying maize haploid seeds by applying diffuse transmission near-infrared spectroscopy. Appl. Spectrosc. 72, 611–617. Liu, Z.Z., Song, T.M., 2000. The breeding and identification of haploid inducer with high frequency parthenogenesis in maize. Acta Agron. Sinica. 26, 6. Nanda, D., Chase, S., 1966. An embryo marker for detecting monoploids of maize (Zea Mays L.). Crop Sci. 6, 381–382. Qin, H., Ma, J.Y., Chen, S.J., Yan, Y.L., Li, W.J., Wang, P., Liu, J., 2016. Identification of haploid maize kernel using NIR spectroscopy in reflectance and transmittance modes: a comparative study. Spectrosc. Spectr. Anal. 36, 292–297. Prigge, V., Sanchez, C., Dhillon, B.S., Schipprack, W., Araus, J.L., Banziger, M., Melchinger, A.E., 2011. Doubled haploids in tropical maize: I. Effects of inducers and source germplasm on in vivo haploid induction rates. Crop Sci. 51, 1498–1506. Ribeiro, R.D.R., Marsico, E.T., Carneiro, C.D., Monteiro, M.L.G., Conte, C., de Jesus, E.F.O., 2014. Detection of honey adulteration of high fructose corn syrup by Low Field Nuclear Magnetic Resonance (LF H-1 NMR). J. Food Eng. 135, 39–43. Rober, F.K., Gordillo, G.A., Geiger, H.H., 2005. In vivo haploid induction in maize – performance of new inducers and significance of doubled haploid lines in hybrid breeding. Maydica 50, 275–283. Roweis, S.T., Saul, L.K., 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323-+. Santos, P.M., Pereira, E.R., Colnago, L.A., 2016. Detection and quantification of milk adulteration using time domain nuclear magnetic resonance (TD-NMR). Microchem. J. 124, 15–19. Wang, H.Z., Liu, J., Xu, X.P., Huang, Q.M., Chen, S.S., Yang, P.Q., Chen, S.J., Song, Y.Q., 2016. Fully-automated high-throughput NMR system for screening of haploid kernels of maize (corn) by measurement of oil content. PLoS One 11, 14. Wang, S.J., 2002. Bionic (topological) pattern recognition-a new model of pattern recognition theory and its applications. Acta Electron. Sinica 30, 1417–1420. Wang, S., Wang, B.N., 2002. Analysis and theory of high-dimension space geometry for artificial neural networks. Acta Electron. Sinica 30, 1–4. Weber, D.F., 2014. Today's use of haploids in corn plant breeding. In: In: Spark, D.L. (Ed.), Advances in Agronomy, vol. 123. Elsevier Academic Press Inc, San Diego, pp. 123–144. Zhang, J.X., Wu, Z.Y., Song, P., Li, W., Chen, S.J., Liu, J., 2013. Embryo feature extraction and dynamic recognition method for maize haploid seeds. Trans. Chinese Soc. Agric. Eng. 29, 199–203. Zhou, Y., Fu, X.P., Ying, Y.B., 2007. Effect of humidity on detection of near-infrared spectra. Spectrosc. Spectr. Anal. 27, 2197–2199.

Fig.10. The influence of k on recognition rate.

4. Conclusion The feasibility of the pattern recognition method combining NMR spectrum and manifold learning dimension reduction algorithm when applied to maize haploid recognition was discussed in this study. Firstly, experimental results verified that the pattern recognition method based on NMR spectrum could be used for haploid recognition, and the recognition rate of the high oil induced kernels could reach as high as 98%; for maize kernels generated by conventional inducer, as the oil content overlapping phenomenon was serious between haploids and diploids, a great recognition difficulty was brought and the highest recognition rate could reach 90%. Secondly, in consideration that samples of different categories existed on different multiple manifolds, the single-manifold method was extended to multi-manifold method, a new multi-manifold learning and classification framework was proposed, and it’s proved that its effect was superior to the single-manifold method through an experiment on the conventionally induced kernel groups, namely the effect was improved by about 5% Acknowledgements The authors gratefully acknowledge the financial support from the National Key R&D Program of China (Grant No. 2017YFD0701702). Appendix A. Supplementary material Supplementary data to this article can be found online at https:// doi.org/10.1016/j.compag.2020.105219. References Cai, Z., Xu, G.L., Chang, M.T., Lu, M., Zhang, H.Y., 2008. The advances in haploid

10