Interpolation of greenhouse environment data using multilayer perceptron

Interpolation of greenhouse environment data using multilayer perceptron

Computers and Electronics in Agriculture 166 (2019) 105023 Contents lists available at ScienceDirect Computers and Electronics in Agriculture journa...

2MB Sizes 0 Downloads 61 Views

Computers and Electronics in Agriculture 166 (2019) 105023

Contents lists available at ScienceDirect

Computers and Electronics in Agriculture journal homepage: www.elsevier.com/locate/compag

Interpolation of greenhouse environment data using multilayer perceptron a

a

a

a

b

Taewon Moon , Seojung Hong , Ha Young Choi , Dae Ho Jung , Se Hong Chang , Jung Eek Son a b

a,⁎

T

Department of Plant Science and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul 08826, Republic of Korea Korea Electronics Technology Institute, Seongnam 13509, Republic of Korea

A R T I C LE I N FO

A B S T R A C T

Keywords: Data loss Linear Multivariate regression Random forest Spline

For analysis of greenhouse environments using big data, measuring data should be continuously collected without data loss caused by sensing and networking problems. Recently, deep learning approach has been widely used for precision agriculture. However, in order to use deep learning methods, the enormous amount of reliable data is necessary. The objective of this study is to compare the interpolation accuracy of greenhouse environment data using multilayer perceptron (MLP) with existing statistical and machine learning methods. Linear and spline interpolations were selected as statistical methods, and linear models, MLP and random forest (RF) were selected as machine learning methods. The raw data used for interpolation were greenhouse environment data collected from October 2, 2016 to May 31, 2018 where Irwin mango (Mangifera indica L. cv. Irwin) trees were cultivated. As a result, the linear interpolation method showed the highest R2 (average 0.96) in short-term data loss conditions, but the MLP showed R2 = 0.95. However, in long-term data loss conditions, the accuracies of the linear, spline, and regression interpolations decreased, but the accuracies of the MLP and RF remained stable. However, MLP showed better accuracies than RF. Therefore, the MLP was better suited to interpolating greenhouse environment data because short- and long-term data loss actually occurred simultaneously when collecting greenhouse environment data. The trained MLP showed the high accuracy in both short- and longterm data interpolations, indicating that MLP can also be complementally used with existing methods. The trained MLP accurately estimated the missing data in the greenhouse and will contribute to the analysis of big data collected from greenhouses.

1. Introduction Greenhouse horticulture allows farmers to actively control growth environmental conditions. Crops can be produced year-round regardless of the climate in the controlled environment. The benefits of environmental control promote greenhouse use, and protected horticultural size is continuously increasing in agriculture (Guo et al., 2012). To control growth environments efficiently, the information from cultivation facilities has to be collected and processed carefully. In the previous studies, mathematical modeling was used for estimation of growth environment. Mathematical modeling generally does not need numerous data, so the number of data was limited before (Jung et al., 2016). Therefore, it was possible to collect and process each piece of data with care. In recent years, however, the amount of data collected in the greenhouse and the amount of data needed for research are increasing (Jung et al., 2018). Currently, especially machine learning and deep learning are growing fast with improvement of computing power and storage capacity, and deep learning noted as a method for realizing



artificial intelligence (Silver et al., 2017). Since its academic influence and problem-solving potential were great, deep learning is being studied in many fields including horticultural science based on big data (Moon et al., 2018b; Trejo-Perea et al., 2009; Yu et al., 2016). To use deep learning methods, the amount of available data and the reliability of the data should be sufficient. In greenhouse, environmental data are obtained from sensors placed at regular intervals, but the inside of greenhouse is a harsh condition for electrical sensors and the data collected have increased abruptly, so the data are not collected carefully. Due to the processing time and power limitations that are required in the equipment, the sensor is not able to obtain completed data without error, and even if a sophisticated sensor is used, data is likely to be lost due to various external causes such as blackout. Data loss has the potential to distort the results by adding bias to the overall data interpretation. To minimize the risk of data loss, a variety of methods have been studied in various fields in order to reduce the bias and to analyze the data accurately by introducing a pre-processing process suitable for each data characteristic (Batista and Monard, 2003; Schafer and Graham, 2002). Interpolation can compensate for data loss

Corresponding author. E-mail address: [email protected] (J.E. Son).

https://doi.org/10.1016/j.compag.2019.105023 Received 27 February 2019; Received in revised form 22 July 2019; Accepted 22 September 2019 0168-1699/ © 2019 Elsevier B.V. All rights reserved.

Computers and Electronics in Agriculture 166 (2019) 105023

T. Moon, et al.

spline interpolation, which are widely used, were selected as comparison targets for statistical interpolation. Linear interpolation, which is effective for high-density data (Malvar et al., 2004), estimated missing data based on a linear equation over the nearest two data (Eq. (1)).

with a common technique of predicting free-range data using known data. The purpose of this study is to find a suitable method to overcome limitations of measurement equipment and to supplement data by comparing and analyzing the result of applying some interpolation method to actual sensor data collected in greenhouse. Based on the results of previous studies, we compared multilayer perceptron (MLP) algorithm with linear and spline interpolation among statistical methods, and multivariate regression and random forest (RF) among machine learning methods (Appelhans et al., 2015).

f (x ) = f (x 0) +

2.1. Greenhouse and cultivation conditions A double-span arch-type plastic house (34.4 W × 30.0 L × 5.7 H, m) located at Boryeong, Korea (36°23′34″N 126°29′12″E) was used for the experiment. The greenhouse was covered with polyolefin films with a thickness of 0.15 mm and a light transmittance of approximately 92%. The inside temperature was maintained at 25 ± 1 °C using a hot-water heating system, and ventilation started automatically at a set point of 27 °C. One hundred 4-year-old Irwin mangoes (Mangifera indica L. Irwin) were grown in pots 0.8 m in diameter with a planting density of 6.25 m2. The soil of the post included the organic contents of 38–120 g·kg−1. Stem training and pruning were periodically performed to induce the vegetative growth and to fix the tree structure. A drip irrigation system was used for watering. CO2 fertilization was started from December 10, 2016, and the CO2 concentration was tried to be at 1200 μmol·mol−1.

g (x ) = g (xk ) + a1 (x − xk ) + a2 (x − xk )2 + a3 (x − xk )3

Environmental data such as temperature, relative humidity, light intensity, atmospheric pressure, and CO2 concentration were measured every 10 min by using a complex sensor module (Korea Electronics Technology Institute, Seongnam, Korea) located at the center of the greenhouse. Datetime data such as date, hour, and minutes were also used as training input. The environment data were normalized in the range of 0–1. In particular, the datetime data were normalized using sine function to represent time cycle. The ranges of the measured environmental data are described in Table 1. Assuming that no data are missing from measurements for all periods, the total data size is 87,408. The data were measured from October 2, 2016 to May 31, 2018 (Table 1). 2.3. Interpolation methods For statistical methods, the distribution of missing data is estimated with a function based on known data values. There are several kinds of methods to define the functions. Among them, linear interpolation and Table 1 Ranges of measured environmental factors and their corresponding observed data sizes with data loss ratio (with 87,408 as expected data size).

Temperature (°C) Relative humidity (%) Soil temperature (°C) Soil moisture content (%) Atmospheric pressure (hPa) Photosynthetic photon flux density (μmol·m−2·s−1) CO2 concentration (μmol·mol−1)

2.4. Accuracy of interpolation methods

Measured data size (Data loss ratio, %)

4.6–45.5 16.0–98.3 11.1–34.9 9.6–42.7 997.4–1038.3 0.0–1213.6

77,089 77,237 76,589 76,886 77,218 77,761

334.7–2720.7

77,130 (11.8)

(2)

where g is the function of spline interpolation and x represents the target value. a represents a coefficient determined by x and g (x ) , and k represents the index of known points. Among various machine learning methods, RF showed high performance in the previous studies (Liess et al., 2012; Polishchuk et al., 2009). RF synthesizes the results from many decision trees and then produces the final result. After learning several decision trees with different combinations randomly selected through the training data, the average of the decision values of each tree is taken as the final result for the empty data. This averaging process can filter irregular noise in the data and select important characteristics (Fig. 1). Recently, artificial neural network or deep neural network have been showing high performance (Taormina and Chau, 2015; Wang et al., 2016). MLPs are the neural network algorithms that learn the pattern of data by several layers with connected perceptrons, which consisted of neural network structures of biological organism (Fig. 2). In this study, a rectified linear unit (ReLU) function was used as activation function. The MLP consisted of four fully connected hidden layers. AdamOptimizer was used to train the MLP (Kingma and Ba, 2014), and the coefficients were modified to construct the MLP solving regression problem (Table 3). In general, neural networks are set to minimize cost (Rumelhart et al., 1988). In this study, mean square error (MSE) instead of root mean squared error (RMSE) was used as a cost for reducing computation. To prevent overfitting, batch normalization was used (Ioffe and Szegedy, 2015). All programs were based on Python language (v. 3.6.7, Python Software Foundation, Wilmington, USA) and its library. For the training of the RF and MLP, different combinations of environmental factors are used to estimate missing data (Table 2). The factors were selected empirically according to the previous modeling studies (Froehlich et al., 1979; Chandra et al., 1981; Jolliet, 1994; He and Ma, 2010). The MLP was constructed and trained using PyTorch (Paszke et al., 2017). The independent variables for multivariate regression were the same as used for MLP.

2.2. Data collection and preprocessing

Range

(1)

where the symbol f is the function of linear interpolation and the letter x represents the target value as described in Table 2. The number 0 and 1 represent the previous and next value from target value, respectively. Spline interpolation, on the other hand, uses polynomial equations of third or higher order. In this study, a cubic spline method using third order polynomials was selected, which is the most natural method to fill missing data and the most accurate method for cost calculation (de Carvalho et al., 2007). The cubic curves are expected to be continuous at the second derivative and simple to compute. The cubic interpolation method consists of finding a curve that passes through the points and coincides with the second derivative value at that point (Eq. (2)).

2. Materials and methods

Environmental factor

f (x1) − f (x 0 ) (x − x 0) x1 − x 0

To compare interpolation methods to predict short-interval data loss, 30% and 50% of the measured data were randomly selected by using random numbers generator to test for missing data. The average intervals of randomly extracted data were 19.92 min for 30% and 21.23 min for 50%. The maximum interval was 50 min for both conditions. For comparing the accuracy over long-interval data loss, the greenhouse environment data from June 14 to 16, 2017 and February 14 to 16, 2018 were selected and created the missing data at intervals of 2, 6, 12 and 24 h. The selected periods were the middle of the entire data collected period and all the data were perfectly measured without

(11.8) (11.6) (12.4) (12.0) (11.7) (11.0)

2

Computers and Electronics in Agriculture 166 (2019) 105023

T. Moon, et al.

Table 2 Selected features for training individual environmental factor in the machine learning approach. Environmental factor

Selected feature

Air temperature Relative humidity (RH) Soil temperature Soil moisture content (MC) Atmospheric pressure Photosynthetic photon flux density (PPFD) CO2 concentration

Datetime, Datetime, Datetime, Datetime, Datetime, Datetime, Datetime,

RH, atmospheric pressure, PPFD, CO2 concentration air temperature, soil temperature, MC, PPFD air temperature, RH, MC, PPFD air temperature, RH, soil temperature air temperature, RH air temperature, RH, atmospheric pressure air temperature, RH, atmospheric pressure, PPFD

periods to one month after the selected periods were used for training in this case. The number of training data for the intervals of 2, 6, 12, and 24 h were 8271, 8253, 8250, and 8248, respectively. Root mean square error (RMSE) and coefficient of correlation (R2) were calculated with interpolated and measured data. The interpolation targets were same with the collected environment factors such as air temperature, relative humidity, soil temperature, soil moisture content, atmospheric pressure, photosynthetic flux density, and CO2 concentration. 3. Results and discussion 3.1. General accuracies of interpolation methods The accuracy of the linear interpolation method was highest with R2 = 0.96 and RMSE = 12.69 in the estimation of random data loss (Fig. 3, Table 4). It was higher than the accuracies of other methods including MLP. The spline interpolation is also a statistical method but generally showed a lower accuracy than the linear interpolation method. Even though the test data were randomly selected, the average of data loss intervals was short. In other words, random extraction showed the similar results as in the short-term missing condition because the change of greenhouse environments was linear at this condition. Therefore, linear interpolation was suitable for a random-data extracting condition. Meanwhile, the interpolation using MLP showed the competitive accuracies with statistical interpolation methods. However, multivariate regression did not show acceptable accuracies. Since the only one structure of MLP was used to make the same conditions as other interpolation methods. The environmental factors have different characteristics, it is difficult to generalize all environmental factors using a single MLP. Therefore, if the optimum number of layers, number of perceptrons, and coefficients were determined, the accuracies could be improved. If the structure of the MLP is appropriately designed, a high accuracy of 0.9 or higher can be obtained (Moon et al., 2018b). If the MLP is applied to each of the environmental elements in different structures, the computational complexity will be greater than other interpolation methods used in this study, but higher accuracy can be expected. Among the environmental factors, the estimation accuracy of soil temperature was the highest on average (R2 = 0.93), while that of PPFD was the lowest (R2 = 0.69). In particular, overall accuracies in PPFD were stable in machine learning methods regardless of loss percentage, but declined in statistical methods (Fig. 4). All other

Fig. 1. Weekly averages of temperature, relative humidity, and Photosynthetic photon flux density (PPFD) in the greenhouse from October 2, 2016 to May 31, 2018. Zeros were excluded when radiation was averaged.

Fig. 2. Structures of a multilayer perceptron (MLP) used in this study. Refer to Table 2 for input environmental factors.

loss. The selected data were used to identify the difference in interpolation accuracy over data loss intervals. Remaining data were used for training in machine learning methods. To avoid overfitting by too many training data, the data from one month before the selected

Table 3 Parameters used in the multilayer perceptron related to the structure and training options. Parameter

Value

Description

Layer Number of perceptrons Learning rate β1 β2 ε

4 512 0.01 0.9 0.999 0.0001

Number of hidden layers in the neural network The number of perceptrons used for hidden layers Learning rate used by AdamOptimizer method Exponential mass decay rate for the momentum estimates Exponential velocity decay rate for the momentum estimates A constant for numerical stability

3

Computers and Electronics in Agriculture 166 (2019) 105023

T. Moon, et al.

Fig. 3. Depiction of randomly extracted data from October 2, 2016 to May 31, 2018. Green and red colors represent existing and lost data, respectively. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

environmental factors, except for multivariate regression, had accuracies of higher than 0.9, indicating that the interpolation methods accurately estimated the environmental factors. However, the sensor had a maximum unsigned 16-bit binary number of 65,535 and the measured photon values exceeded the limit. That is, since the correct value cannot be inputted during the daytime in the summer season when light intensity is very high, the original data have a discontinuous distribution. Therefore, the data were limited, resulting in the lowest interpolation accuracy among all environmental factors. In addition, the regression methods relate target values with input values (Kass, 1990). In particular, PPFD was directly related with solar radiation and it has zero values after sunset (Rich et al., 1993). However, the other environmental factors inside the greenhouse were fluctuating after the sunset, so the PPFD would had been predicted to non-zero values. Therefore, it can be said that the accuracy was reduced because the values of zero are not predicted properly. Interpolation of PPFDs would require a different approach than regression analysis.

Fig. 4. R2 values of the validation by the loss percentages of soil temperature (A) and photosynthetic photon flux density (PPFD, B). The methods were represented by linear (LIN), spline (SPL), multivariate regression (MVR), random forest (RF), and multilayered perceptron (MLP).

3.2. Accuracies of interpolation methods at different data loss intervals

regardless of loss interval, the RF was hardly predictive of missing data in February compared to June (Table 6). Since RF showed relatively high accuracy in only certain periods of time, the failure of prediction can be attributed to the overfitting of RF (Díaz-Uriarte and De Andres, 2006). In this study, a commonly used structure was applied. If the structure or the computation of RF is increased, it can show better results. In the case of MLP, it showed relatively high accuracy for the whole periods and intervals. When using the statistical methods, the estimated value tended to be completely different from the actual value as the missing period increased. However, when using the MLP, missing

In case of short-term data loss, the interpolation accuracy similar to the accuracy of randomly extracted interpolation method was shown (Tables 5 and 6). However, the accuracy of statistical methods was unstable with the changes of interval (Figs. 5 and 6). The accuracy of the RF and MLP was more stable than the statistical method when interpolating long-term missing data in June (Table 5). However, multivariate regression was hardly predictive of missing data in February and June. In this study, a basic regression structure was used. If the structure or regressing method is complicated, it could show better results. Although the trained RF showed somewhat stable accuracies

Table 4 The test accuracy (R2) and root mean square errors (RMSEs) of environmental factors at different interpolation methods: linear, spline, multivariate regression (MVR), random forest (RF) and multilayered perceptron (MLP). The data were randomly missing in each environmental factor (from October 2016 to May 2018). Method

30% data loss (R2|RMSE) Air temperature (°C)

Relative humidity (%)

Soil temperature (°C)

Soil moisture content (%)

Atmospheric pressure (hPa)

PPFDa (μmol·m−2·s−1)

CO2 concentration (μmol·mol−1)

Linear Spline MVR RF MLP

0.971|0.945 0.957|1.139 0.465|4.034 0.861|2.054 0.977|0.839

0.970|2.573 0.959|2.987 0.387|11.587 0.866|5.421 0.976|2.285

0.999|0.078 0.951|1.009 0.763|2.209 0.959|0.917 0.998|0.206

0.998|0.253 0.976|0.995 0.115|6.085 0.864|2.382 0.987|0.740

0.999|0.130 0.982|1.013 0.402|5.908 0.758|3.758 0.925|2.093

0.858|69.574 0.756|91.397 0.283|156.560 0.783|86.014 0.849|71.720

0.987|15.272 0.982|17.240 0.264|116.006 0.785|62.755 0.978|20.199

Linear Spline MVR RF MLP

50% data loss (R2|RMSE) 0.970|0.966 0.970|2.542 0.956|1.164 0.961|2.885 0.461|4.086 0.380|11.562 0.856|2.113 0.866|5.372 0.974|0.833 0.975|2.305

0.999|0.081 0.951|1.003 0.767|2.195 0.959|0.925 0.998|0.209

0.999|0.247 0.978|0.976 0.112|6.148 0.865|2.401 0.988|0.727

0.999|0.143 0.982|1.007 0.392|5.906 0.748|3.800 0.930|2.009

0.851|71.794 0.767|89.823 0.292|156.645 0.777|87.918 0.843|73.654

0.987|15.046 0.984|16.831 0.264|115.099 0.791|61.377 0.978|19.734

a

Photosynthetic photon flux density. 4

Computers and Electronics in Agriculture 166 (2019) 105023

T. Moon, et al.

Table 5 The test accuracy (R2) and root mean squared error (RMSE) of environmental factors at different data loss intervals (from June 14 to 16, 2017). Loss interval (h)

Linear (R2|RMSE) Air temperature (°C)

Relative humidity (%)

Soil temperature (°C)

Soil moisture content (%)

Atmospheric pressure (hPa)

PPFDa (μmol·m−2·s−1)

CO2 concentration (μmol·mol−1)

2 6 12 24

0.756|1.333 0.711|1.450 0.458|1.978 –y|3.318

0.881|3.057 0.771|4.230 0.433|6.627 –|9.742

0.981|0.167 0.950|0.269 0.227|1.056 –|1.299

0.960|0.429 0.945|0.504 0.837|0.868 0.593|1.372

0.990|0.147 0.955|0.312 0.837|0.592 0.806|0.647

0.850|113.459 0.756|129.956 0.188|264.341 –|362.455

0.923|27.545 0.903|30.859 0.342|80.652 –|122.922

2 6 12 24

Spline (R2|RMSE) 0.741|1.375 0.706|1.462 –|2.827 –|3.332

0.877|3.099 –|30.176 –|9.083 –|10.130

0.315|0.990 0.378|0.942 0.063|1.163 –|1.271

0.776|1.018 0.768|1.035 0.719|1.142 0.396|1.672

0.877|0.515 0.878|0.513 0.856|0.556 0.863|0.543

0.827|121.620 0.673|168.622 –|429.287 –|362.455

0.907|30.165 0.849|38.513 –|104.133 –|124.287

2 6 12 24

Multivariate regression (R2|RMSE) 0.692|1.498 0.652|5.225 0.680|5.230 0.650|5.230 0.679|1.521 0.647|5.228 0.683|1.520 0.652|5.215

0.796|0.541 0.796|0.539 0.797|0.541 0.798|0.540

–|2.477 –|2.479 –|2.481 –|2.480

–|1.529 –|1.532 –|1.530 –|1.530

0.615|181.482 0.608|184.615 0.609|184.499 0.611|184.928

0.838|39.849 0.831|40.715 0.834|40.562 0.834|40.557

2 6 12 24

Random forest (R2|RMSE) 0.817|1.155 0.569|5.808 0.819|1.149 0.539|6.002 0.809|1.176 0.479|6.354 0.813|1.168 0.432|6.666

0.877|0.420 0.857|0.453 0.795|0.544 0.786|0.555

–|2.861 –|3.523 –|3.718 –|4.392

0.697|0.808 0.207|0.455 –|2.034 –|3.203

0.887|98.173 0.879|102.525 0.877|102.960 0.881|102.097

0.960|19.687 0.947|22.833 0.948|22.740 0.947|22.915

2 6 12 24

Multilayer perceptron (R2|RMSE) 0.879|0.939 0.890|2.930 0.824|1.133 0.727|4.620 0.806|1.182 0.661|5.123 0.767|1.304 0.670|5.078

0.984|0.152 0.968|0.214 0.949|0.270 0.958|0.247

0.908|0.651 0.352|1.731 –|2.347 –|2.482

–|1.824 –|3.176 –|3.896 –|4.417

0.906|89.531 0.886|99.719 0.883|100.176 0.890|98.307

0.984|12.642 0.957|20.517 0.957|20.607 0.958|20.426

a y

Photosynthetic photon flux density. Negative value.

Table 6 The test accuracy (R2) and root mean squared error (RMSE) of environmental factors at different data loss intervals (from February 14 to 16, 2018). Loss interval (h)

Linear (R2|RMSE) Air temperature (°C)

Relative humidity (%)

Soil temperature (°C)

Soil moisture content (%)

Atmospheric pressure (hPa)

PPFDa (μmol·m−2·s−1)

CO2 concentration (μmol·mol−1)

2 6 12 24

0.894|0.914 0.338|2.284 –y|3.646 –|3.145

0.925|2.539 0.584|6.000 0.282|7.859 –|11.178

0.981|0.089 0.851|0.250 0.431|0.488 0.044|0.633

0.939|0.082 0.585|0.214 –|0.373 –|0.368

0.981|0.289 0.893|0.685 0.675|1.195 –|2.192

0.878|40.949 0.708|63.588 0.068|113.689 –|138.231

0.986|8.468 0.812|31.045 –|74.810 –|87.718

2 6 12 24

Spline (R2|RMSE) 0.821|1.189 –|4.047 –|4.109 –|3.169

0.917|2.673 0.657|5.447 0.314|7.686 –|10.885

0.229|0.568 0.212|0.574 0.073|0.623 –|0.662

0.142|0.306 0.115|0.312 –|0.399 –|0.362

0.803|0.932 0.719|1.113 0.468|1.529 –|2.240

0.865|43.149 0.467|85.954 –|130.326 –|138.231

0.982|9.670 0.791|32.678 –|91.245 –|89.170

2 6 12 24

Multivariate regression (R2|RMSE) 0.702|1.536 0.076|8.898 0.700|1.536 0.079|8.927 0.703|1.532 0.081|8.894 0.700|1.554 0.074|8.906

–|0.844 –|0.842 –|0.840 –|0.844

0.317|0.273 0.318|0.274 0.319|0.273 0.329|0.273

–|3.040 –|3.044 –|3.043 –|3.044

0.626|71.715 0.624|72.160 0.625|72.078 0.626|72.191

0.731|37.055 0.732|37.033 0.734|36.880 0.731|37.224

2 6 12 24

Random forest (R2|RMSE) 0.801|1.256 0.623|5.680 0.793|1.277 0.585|5.995 0.769|1.351 0.560|6.153 0.750|1.418 0.558|6.155

–|0.730 –|0.823 –|1.073 –|1.065

0.669|0.190 0.581|0.214 0.584|0.214 0.571|0.219

–|3.444 –|5.354 –|5.679 –|6.223

0.600|74.149 0.588|75.572 0.577|76.557 0.577|76.777

0.822|30.131 0.818|30.506 0.827|29.709 0.809|31.336

2 6 12 24

Multilayer perceptron (R2|RMSE) 0.915|0.822 0.919|2.633 0.834|1.144 0.788|4.281 0.872|1.007 0.771|4.444 0.836|1.148 0.772|4.421

0.903|0.202 0.478|0.468 0.044|0.633 0.300|0.542

0.877|0.116 0.722|0.175 –|0.278 0.379|0.263

0.479|1.514 –|2.567 –|3.123 –|3.255

0.833|47.940 0.779|55.303 0.735|60.575 0.715|63.054

0.951|15.847 0.864|26.334 0.852|27.463 0.814|30.945

a y

Photosynthetic photon flux density. Negative value. 5

Computers and Electronics in Agriculture 166 (2019) 105023

T. Moon, et al.

Fig. 5. Comparisons of air temperature, relative humidity, and photosynthetic photon flux density (PPFD) measured by sensors and estimated by linear interpolations at different data loss intervals (A) and multilayer perceptron interpolation (B) from June 14 to 16, 2017. The dates of the X-axis were labeled at 23:50 at that day.

However, the MLP method predicts the correct answer of the given data based on the learning data trend by using the given learning data (Freitag, 2000). MLP learned the changes of greenhouse environment during the experiment year and extracted the generalized relationship between actual environmental factors. Since the empty data are predicted using other environmental factors, the accuracy is not lowered with lost periods. Therefore, if long-term interpolation is required, it may be appropriate to use MLP, but it requires much data (Singh et al., 2017). There is a need to check the quantity of data when using the machine learning technique. In addition, the performance of the MLP could be improved when the MLP is applied to the separated seasons, not whole cultivation period. In the case of actual data, short-term omissions and long-term

periods did not affect accuracy. Statistical methods do not reflect periodic changes in data. The statistical methods are not to relate the environmental factors, but simply to fill in the blank data using data close to the front and the back of the sequence of the missing data, so that it does not analyze how the actual environmental factors affect each other and how they change (Parrish and Derber, 1992). If the missing period is long, the data referenced by statistical function is less related to the missing data. Therefore, it is difficult to accurately interpolate missing data using the statistical methods if a long-term omission occurs. In this study, multivariate regression was not affected by the length of the missing periods. Nevertheless, the regression model could not interpret the interaction among environmental factors and failed to interpolate missing values. 6

Computers and Electronics in Agriculture 166 (2019) 105023

T. Moon, et al.

Fig. 6. Comparisons of air temperature, relative humidity, and photosynthetic photon flux density (PPFD) measured by sensors and estimated by linear interpolations at different data loss intervals (A) and multilayer perceptron interpolation (B) from February 14 to 16, 2018. The dates of the X-axis were labeled at 23:50 at that day.

environmental data (Liess et al., 2012). However, if only one sensor is disconnected, it is possible for machine learning methods to refer to other environmental factors, but in the case of long-term omissions, all sensors usually stop operating. In this case, even if the training of machine learning methods is perfect, interpolation is impossible because there are no input values to obtain the missing value and the MLP cannot work when all data were missed at the same time. Therefore, it is necessary to study interpolation using past data or future data from the point of missing data. In this case, deep learning can be a methodology for the interpolation preparing the data for deep learning. Recurrent neural networks, which are methods of processing the sequence of data, can be used (Jozefowicz et al., 2015; Moon et al., 2018a). In the case of RF and MLP trained in this experiment, only the data

omissions were mixed because the collection period was long. Internal factors seemed to be that the sensors did not work for approximately 20 min for data processing on a daily basis. Otherwise, since the sensors were electrically connected, the measurement could fail due to electrical problems in the short term. Externally, during data collection, data were missing for several days due to problems such as a shutdown of the greenhouse. Internal factors due to sensor errors can be interpolated using simple statistical methods. Since the interpolation methods using machine learning were not affected by the missing period, it is appropriate to use the machine learning method in real situations if perfectly measured data are sufficient. If the data are insufficient, the machine learning method is difficult to use. The statistical method can be used without any other environmental data, but the machine learning method must have sufficient integrity of other 7

Computers and Electronics in Agriculture 166 (2019) 105023

T. Moon, et al.

of the target greenhouse are used, so it cannot be guaranteed that the estimation will be correct in other greenhouses. To interpolate in the new greenhouse, environmental data of the new greenhouse must be collected, and another machine learning model should be trained. However, because of the rapid development of computer science and data science, large amounts of data are accumulating, and increasingly more data will be accumulated in the future (Feng et al., 2017). Therefore, it is more efficient to use an interpolation method using machine learning than a statistical interpolation method.

Guo, S.R., Sun, J., Shu, S., Lu, X.M., Tian, J., Wang, J.W., 2012. Analysis of general situation, characteristics, existing problems and development trend of protected horticulture in China. China Veg. 18, 1–14. He, F., Ma, C., 2010. Modeling greenhouse air humidity by means of artificial neural network and principal component analysis. Comput. Electron. Agric. 71, S19–S23. Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Jolliet, O., 1994. HORTITRANS, a model for predicting and optimizing humidity and transpiration in greenhouses. J. Agric. Eng. Res. 57, 23–37. Jozefowicz, R., Zaremba, W., Sutskever, I., 2015. An empirical exploration of recurrent network architectures. In: Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), France: Lille, pp. 2342–2350 PMLR. Jung, D.H., Kim, D., Yoon, H.I., Moon, T.W., Park, K.S., Son, J.E., 2016. Modeling the canopy photosynthetic rate of romaine lettuce (Lactuca sativa L.) grown in a plant factory at varying CO2 concentrations and growth stages. Hortic. Environ. Biotechnol. 57, 487–492. Jung, D.H., Lee, J.W., Kang, W.H., Hwang, I.H., Son, J.E., 2018. Estimation of whole plant photosynthetic rate of Irwin mango under artificial and natural lights using a threedimensional plant model and ray-tracing. Int. J. Mol. Sci. 19, 152. Kass, R.E., 1990. Nonlinear regression analysis and its applications. J. Am. Stat. Assoc. 85, 594–596. Kingma, D., Ba, J., 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980v9. Liess, M., Glaser, B., Huwe, B., 2012. Uncertainty in the spatial prediction of soil texture comparison of regression tree and random forest models. Geoderma 170, 70–79. Malvar, H.S., He, L.W., Cutler, R., 2004. High-quality linear interpolation for demosaicing of Bayer-patterned color images. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004), Canada: Montreal, pp. iii–485. Moon, T., Ahn, T.I., Son, J.E., 2018a. Forecasting root-zone electrical conductivity of nutrient solutions in closed-loop soilless cultures via a recurrent neural network using environmental and cultivation information. Front. Plant Sci. 9, 859. Moon, T.W., Jung, D.H., Chang, S.H., Son, J.E., 2018b. Estimation of greenhouse CO2 concentration via an artificial neural network using environmental factors Hortic. Environ. Biotechnol. 59, 45–50. Parrish, D.F., Derber, J.C., 1992. The National Meteorological Center's spectral statisticalinterpolation analysis system. Mon. Wea. Rev. 120, 1747–1763. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A., 2017. Automatic differentiation in PyTorch. NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques, CA, US: Long Beach. Polishchuk, P.G., Muratov, E.N., Artemenko, A.G., Kolumbin, O.G., Muratov, N.N., Kuz’min, V.E., 2009. Application of random forest approach to QSAR prediction of aquatic toxicity. J. Chem. Inf. Model 49, 2481–2488. Rich, P.M., Clark, D.B., Clark, D.A., Oberbauer, S.F., 1993. Long-term study of solar radiation regimes in a tropical wet forest using quantum sensors and hemispherical photography. Agric. Forest Meteorol. 65, 107–127. Rumelhart, D.E., Hinton, G.E., Williams, R.J., 1988. Learning representations by backpropagating errors. Cogn. Model. 5, 1. Schafer, J.L., Graham, J.W., 2002. Missing data: Our view of the state of the art. Psychol. Methods 7, 147–177. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., et al., 2017. Mastering the game of go without human knowledge. Nature 550, 354–359. Singh, B., Sihag, P., Singh, K., 2017. Modelling of impact of water quality on infiltration rate of soil by random forest regression. Model. Earth. Syst. Environ. 3, 999–1004. Taormina, R., Chau, K.W., 2015. Neural network river forecasting with multi-objective fully informed particle swarm optimization. J. Hydroinform. 17, 99–113. Trejo-Perea, M., Herrera-Ruiz, G., Rios-Moreno, J., Miranda, R.C., Rivasaraiza, E., 2009. Greenhouse energy consumption prediction using neural networks models. Int. J. Agric. Biol. 11, 1–6. Wang, T., Gao, H., Qiu, J., 2016. A combined adaptive neural network and nonlinear model predictive control for multirate networked industrial process control. Trans. Neural Netw. Learn. Syst. 27, 416–425. Yu, H., Chen, Y., Hassan, S.G., Li, D., 2016. Prediction of the temperature in a Chinese solar greenhouse based on LSSVM optimized by improved PSO. Comput. Electron. Agric. 122, 94–102.

4. Conclusions Various interpolation methods are compared based on actual greenhouse data. Statistical methods showed high accuracy on estimating random extraction data and short-term missing data, but the accuracy of long-term interpolation was low because the tendency of the greenhouse environment was not reflected. On the other hand, the machine learning method showed stable accuracy even if the experimental period was changed. Because the actual cultivation environment data is mixed with short-term omissions and long-term omissions, it would be more appropriate to use the machine learning method than to use interpolation. In particular, between the machine learning methods, MLP showed high accuracy in all experiments. Therefore, MLP has the best performance among the interpolation methods applied to the greenhouse data, and it can be applied to the analysis of the big data obtained from the greenhouse. Acknowledgement This work was supported by the Korea Institute of Planning and Evaluation for Technology in Food, Agriculture, Forestry and Fisheries (IPET) through the Agriculture, Food and Rural Affairs Research Center Support Program funded by the Ministry of Agriculture, Food and Rural Affairs (MAFRA; 717001-07-1-HD240). References Appelhans, T., Mwangomo, E., Hardy, D.R., Hemp, A., Nauss, T., 2015. Evaluating machine learning approaches for the interpolation of monthly air temperature at Mt. Kilimanjaro, Tanzania. Spat. Stat. 14, 91–113. Batista, G.E., Monard, M.C., 2003. An analysis of four missing data treatment methods for supervised learning. Appl. Artif. Intell. 17, 519–533. Chandra, P., Albright, L.D., Scott, N.R., 1981. A time dependent analysis of greenhouse thermal environment. Trans. ASAE 24, 442–0449. De Carvalho, O.A., Guimarães, R.F., Gomes, R.A.T., Da Silva, N.C., 2007. Time series interpolation. In: IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2007). Spain: Barcelona, pp. 1959–1961. Díaz-Uriarte, R., De Andres, S.A., 2006. Gene selection and classification of microarray data using random forest. BMC Bioinf. 7, 3. Feng, Y., Peng, Y., Cui, N., Gong, D., Zhang, K., 2017. Modeling reference evapotranspiration using extreme learning machine and generalized regression neural network only with temperature data. Comput. Electron. Agric. 136, 71–78. Froehlich, D.P., Albright, L.D., Scott, N.R., Chandra, P., 1979. Steady-periodic analysis of glasshouse thermal environment. Trans. ASAE 22, 387–0399. Freitag, D., 2000. Machine learning for information extraction in informal domains. Mach. Learn. 39, 169–202.

8