Copyright © 2004 IF AC Fifth International Workshop
011
Artificial Intelligence in Agriculture, Cairo, Egypt
DATA MINING PEST, PESTICIDE AND METROLOGICAL DATA Ahsan AbduIlah*, Stephen Brobst+, Muhammad Umer* *National University of Computers & Emerging Sciences, Islamabad, Pakistan
[email protected],
[email protected] ... Teradata Division, NCR, Day ton, OH, USA
Abstract: Agro-Meu'ological data is horizontally wide (dozens of variables) and vertically deep (tens of thousands of recordings). Traditional data analysis techniques are not suitable for such large multivariate and apparently unstructured data sets. Thus data mining is a viable option for automated detection of hidden patterns in Agro-Metrological data. As commercially available Data Mining tools are very expensive, therefore, we have developed and used our own tool employing indigenous technique of Recursive Noise Removal (RNR), based on crossing minimization paradigm. In this paper we have considered two years of pest scouting data i.e. 2001 and 2002 and analyzed the results of performing clustering exploring White Fly populations. Copyright © 2004IFAC Keywords: Algorithm, Agriculture, Data Bases, Data Processing, Information Technology
1. INTRODUCTION Pakistan is one of the five major cotton growing countries in the world and also a major consumer of pesticides. Almost 70% of world cotton is produced in China (Mainland), India, Pakistan, USA and Uzbekistan (Chaudhry, 2000). In 1983 the cotton crop in Pakistan was severely damaged by pest attacks. Therefore, in 1984 The Department of Pest Warning and Quality Control of Pesticides Punjab was established. For the last two decades the said department has been performing pest scouting by randomly visiting literally tens of hundreds of farmers in 34 districts of Punjab. Pest scouting is a systematic field sampling process that provide field specific information on pest pressure and crop injury (MU Extension 200 1). The detailed pest scouting data collected is recorded by hand-filling or typing standard data sheets. A typical data recording consists of 35-40 variables, and a data sheet consists of around 20 such recordings. Prior to our work, the said data sheets have never been digitized, hence never used for querying by any database system. The agricultural data being considered is horizontally wide (dozens of variables) and vertically deep (tens of thousands of recordings). Coupling this with metrological data further increases the size and complexity of the database. Pakistan Metrological department · has been recording metrological data for several decades. Currently the Pakistan Met Department operates 7 weather stations in the province of Punjab. A typical Met recording consists of more than 50 variables. A small research
58
group like ours could not afford the expensive metrological data. Therefore, decided to use the publicly available metrological forecasts. For this purpose we downloaded the metrological forecasts of Temperature, %Humidity and Outlook from the web site of The Daily Dawn www.dawn.com. The human mind is incapable of handling more than a few variables and that too for small data sets. Therefore, pest scouting data coupled with metrological data can not be processed by humans, with whatever level of expertise. Recognize that finding a needle in a haystack is easy, as you what you are looking for. But finding what you don't know what you should be looking for, such as hidden pattern and relationships is a significantly difficult problem. Therefore, for such difficult problems, especially for large data sets, data mining is the ideal choice. The cost of a typical data mining tool runs in tens of thousands of dollars. Hence are out of the reach of most departments in a developing country like Pakistan. Thus the only other viable option is 10 develop a data mining tool, using indigenous data mining techniques to. perform clustering. A cluster is a collection of data elements that are highly similar to one another within the same cluster, but weakly similar from the data elements in other clusters. More formally, let 0 = (0.,02,03, ... On) be a set of n objects and let C = (C., C 2, ••• Ck ) be a partition of 0 into subsets; such that Ci n Cj = 0, i to
3. METHOD
uCk
j and k = O. Each subset is called a cluster, and C is a clusterillg solution. There are basically two types of classification (i) un-supervised classification or clustering, when we are unaWare of the class labels and may not even know the number of classes (ii) supervised classification (or classification) when we know the class labels and the number of classes, e.g. Decision Tree Induction, Genetic Algorithms, Bayesian classification etc. Bayesian classification requires initial knowledge of many probabilities, and has significant computational cost. This work is about clustering. For a detailed study on data mining see (Han and Kambt:r 2000). In this paper we have successfully shown how data mining can be helpful in gaining useful insight into agrometrological data. To perform the mining, we have applied a data mining technique called "Clustering by Recursive Noise Removal (RNR)" , originally proposed in (Abdullah and Brobst 2(03), based on the crossing minimization paradigm. The tool based on RNR was used to perform the clustering for this paper. 2. RELATED WORK Data mining applications have quite recently found their way into agricultural research and a lot of activity can be seen in this area such as (Bertis et. al. 200 1a, Cunningham and Holmes 2001, Harms et al 2000, Scherte 2002, Yang 1999). (Voss 2(03) gives the detail of GIMMI project which is aiming at providing a one-stop and integrated access to the assessment of pesticide leaching into soil and groundwater. Several IT tools including data mining are to be implemented as part of this project. (Avesani 2(02) describes a new approach for acquisition and preprocessing of agricultural data mining. (Bertis et al. 200lb) used date mining for discovering fraudulent claims made by farmers . (Christensen and Di Cook 1998) describes simple numerical methods have been used to establish the relationship between 10 soil characteristic variables and corn yield. (Yang 1999) used remote sensing techniques in conjunction with AI neural networks to identify weeds in corn fields. In (Nugyen 200 I) data mining techniques on images have been used to identify trash in the ginned cotton. In (Scherte 2(02) a case study approach was used to help understand how data mining could be used in the manufacturing of textiles using commercially available SAS tools. In (Harms et. al 2(00) spatio-temporal knowledge discovery techniques are integrated into a Geo-Spatial Decision Support System (GDSS) using a combination of data mining techniques to find relationships between user-specified target episodes and other climatic events and to predict the target episodes. Our work in unique in the sense that it integrates pesticides, ·pest and weather data for the cotton crop, and performs mining using an indigenously developed data mining tool. Thus is very attractive in terms of cost and quality of results.
59
The data for this study was obtained from 200+ pest scouting sheets. The digitization of the sheets was performed by double typing each and every sheet. Subsequently each record was crossed checked with the same record typed by the other Data Entry Operator by running simple SQL queries. After the errors were identified, the same were reconciled by comparison with the actual data sheets. Standard data cleansing and quality management techniques were subsequently used for data integration. standardization etc. prior to using the data. This effort resulted in an Agriculture Data Warehouse. The details of the said Warehouse are given in (Abdullah et al 2003. Abdullah et al 2004) along the data validation examples. Clustering by Recursive Noise Removal (RNR) algorithm is an unsupervised clustering technique based on recursively removing noise and using different crossing minimization heuristics such as the Median Heuristic from (Eades and Wormald 1986), Bury Center heuristic from (Sugiyama et al. 1981) and MaxSort variant from (Abdullah 1993) etc. on bipartite graphs. The effect of minimizing crossings of a bipartite graph is reduction in edge lengths. The consequence of reduction of edge lengths is that vertices that are highly connected with each other gets placed contagiously, resulting in clustering (Abdullah and Brobst 2003). Figure-1(a) shows a randomly permuted bipartite graph consisting of three cliques and some white noise. The clique and noise edges are color coded. Figure-l (b) shows the same graph using a matrix data structure. Figure-l(c) shows the graph of Figure-1(a) after application of a crossing minimization heuristic. The grouping of highly connected vertices is evident, so is the reduction in crossings. Figure-l(d) shows the graph of Figure-I (c) using a matrix data structure. ' ~r
JWI',,''''{ll,,'j,1'>:....+'1
Fig-l(a): Input graph with 245 erossings
tJ Fig-1(e): Output graph with 82 erossings
Fig-l(b): Matrix representation of Fig-l(a)
• Fig-l(d): Matrix representation of Fig-l(e)
As the amount of noise increases i.e. increase in missing edges in cliques and/or increase of edges across cliques. the efficacy of crossing minimization heuristics deteriorates quickly. Using the RNR technique. the matrix representation of the graph is recursively divided into four parts. and after each iteration of the crossing minimization heuristic. the top-right and bottom-left quadrants are dropped. This results in significant increase in noise immunity of the crossing minimization heuristics. when used for clustering. We begin by taking the given database and creating a correlation matrix . As the crossing minimization heuristics being considered work for an un-weighted graph. hence the real correlation matrix is discretized to get a binary correlation matrix . Discretization is the process of partitioning continuous variables/attributes into categories. As our work deals with unsupervised clustering. hence we will not consider supervised discretization methods such as Fuzzy discretization. Entropy Minimization discretization etc instead consider unsupervised discretization i.e. making use of information about the distribution of values of attributes without class infomlation. For a detailed comparison of different discretization methods see (Yang and Webb 2002 ). For the examples in this paper we have used Equal Frequency Discretization (EFD) from (Catlett 1991). EFD divides the sorted values into k intervals so that each interval contains approximately the same number of instances . Thus each interval contains n=k (possibly duplicated) adjacent values. k is a user predefined parameter and is set as 2 in this paper. The entire process of clustering that we have followed is briefly given as follows :
Create an n x n similarity matrix S where = Correlation between Row i and Row j Discretize S using EFD Run RNR Extract clusters
Figure-2(a): Input Binary Matrix Figure-2(a) shows the discretized similarity matrix for 200\. Figure-2(b) shows the output with two clusters extracted. Records belonging to the same cluster appear contiguously. Data demographics of the involved columns as per industry standard collected over the used data set are given in Table I.
. , Attributes,- / · i\:~·" ·.' 2001 ~' . ""~ . ":* . 2002 L.i';<\\'i;~;7;;;~;s~,:i;i ~WEN~,' ;;WFM" ~WFN* MARows Distinct Values Max Rows/ Value Typical Rows / Value
488 36
488 57
697 38
697 48
62
83
71
60
19
12
12
\0
Table-I: Data Demographics corresponding to Fi~-2
WFN: Whitefly Nymph Population Adult Population
#WFA: Whitefly
Figure-3a and 3b show two clusters identified for each year. To understand the meaning of these cluster and thus to gain insight in the data requires understanding the relationship among the columns on which clustering was performed.
Si.j
Average Population WhlteAy 3 ,00
V 2001
2 .50
4. RESULTS We applied RNR on our pest scouting data for years 2001 and 2002 of district Multan. Whitefly nymph population (WFN). white fly adult population (WFA) and last pesticide used columns were chosen for clustering. Clustering was performed only on the records where pesticide was sprayed and the dose of pesticide used was recorded .
~
-~ 1/1
2 .00 1.50 t OO 0 .50 0 .00
Cluster
2
Figure-3a: Wbltefly population for 2001
60
~-. - - . -- . - -
,
--
- - - -- -- - ----·1
Average Population WhlteFly Y2002
WhlteRy Population Farmer 1471:KAR Y2001
3 .00 ..,..,.,-.,.,........,...-.....,,...,...,,... 2 .SO
.iIt
!, III
:~
2.5
2 .00
1_
2
~
1.5
C III
1.SO 1.00
~
O.SO
0 .5
0 .00
Cluster
2
,-------------Figure-3b: Whitefly population for 2002
i
An important aspect of Figure-3 is that they give the relationship among WFA and WFN populations at a particular point in time i.e. WFA population of a cluster hasn't have its roots in the WFN population of the same cluster or vice versa. To determine the real cause behind an increased WFA or (WFN) population one has to perform a temporal follow-up of WFA (or WFN) population. We performed this follow-up by isolating those fields that have contributed in one cluster at one point in time and in the second cluster at another time i.e. fields common to both the clusters. To clarify this concept Table 2 is presented that gives some sample records which relate to a field common to both clusters. Note that the records are identified on the basis of fields instead of farmer.
Field ID 'Visit ' Date ;WFN
WFA
1467
12-8-2001
2
1421 1467 1467 ..... . ... ... 620
15-8-2001 3.5 8 24-8-2001 4.5 3 28-8-2001 5.5 1.5 ... .......... ...... ...... ............. ...... .... .. 28- 105 3 2001 Table-2: Sample data
3
.-
... ,.
: Cluster
.. . 1
.. . ... .. . ... ...
.. .
! ________ _ ___ .__Date L. _ _. ___ _. __.. ___ ... _ . _ - - ... _...--- _._--_._ -_. WhlteFly Populallon Farmer 1471:FH900 Y 2002
I: I
l" I
28/07
0&'08
tl/08
26/08
Date
0\109
12/09
iI I
_. _ _ _ _ 1
Figure-4: White Fly population for fields 1471:KAR and 1471: FH900
1 I 2 . .... ... .. 2
5. DISCUSSION
After isolating the fields common to both the clusters we analyzed WFA and WFN population dynamics at those fields. To demonstrate the concept we have chosen one of our observation fields. Following bar graph gives a comparison of WFN population value for the observation field at one recording time and its corresponding WFA population value of the next recording as s~own in Figure-4.
61
Figure-3 revealed opposite population ratio in the clusters for both years; e.g. cluster-I of year 200 I showed a greater nymph population while the average popUlation of adult whitefly in the same fields was considerably less. On the other hand cluster-2 of the same year depicts an entirely opposite picture. Similar pattern can be observed in the clusters for the year 2002. Now we come to an even harder part,' i.e. to understand the reasons behind these clusters and to draw useful knowledge out of above shown patterns. We directly tried the met elements while clustering, but the same did not give any distinctive results. Therefore, we now look at the met elements along with the cluster records so as to fully reflect the true dynamics of discovered clusters. The detailed data corresponding to field 147l:KAR is given in Table-3 along with possible explanations with regard to the behavior.
Date 24/07
' Pesticide' Polytrin-C (Profenofos + Cypermethrin)
03/08
Polytrin-C
12108
Curacron (Profenfos)
21108
Curacron (Profenfos)
26/08
Curacron (Profenfos) + Polo (Difenthauron)
clusters with opposite characteristics i.e. Nymph population less than Adult in one cluster and Nymph population greater than Adult population in other cluster. This behavior was observed for both years. Checking detail data explained the behavior i.e. farmers either used wrong pesticide. or effect of using correct pesticide was nullified by weather. This leads to the conclusion that either the farmers did not followed the "simple" pesticide application guidelines. or they were unaware of such guidelines. Both scenarios point to room for improvement in the extension services.
:Forecast: i:EffeetiAnalysis Sunny Inappropriate pesticide application. Pest shift from nei2hboring fields. Sunny Inappropriate pesticide aoolication. Showers Correct pesticide is sprayed but rain canceled the effect Sunny Correct pesticide usage
. /;.,.:~
Sunny
Correct usage.
7. References
pesticide
Abdullah. A. 1993. On placement of logic schematics. in proc. of IEEE TENC0N'93. Beijing. China. pp. 885-888.
Table-3: Detailed results of field 1471:KAR for 2001 The detailed data corresponding to field 1471:FH900 for the year 2002. is given in Table-4 along with possible explanations with regard to the behavior.
28/07
Pesticide Polytrin-C
Forecast Cloudy
05108
Chlorpyrifo
Cloudy
13/08
No Spray
Cloudy
No Spray
Thunder Storm Cloudy
Date
Effect Analysis Inappropriate pesticide aoolication. Correct pesticide is sprayed
Abdullah A. and Brobst. S .• 2003. Clustering Weak Clusters in Gene Expression Data. in proc. of The 2nd Annual Conference of the Korean Society for Bioinformatics (KSBI 2003). Daejcon. South Korea Abdullah A. and Brobst. S.. 2003; Oustering by Recursive Noise Removal. in proc. of Atlantic Symposium on Computational Biology and Genome Informatics (CBGI'03) Cary. North Carolina. USA. pp. 973-977.
Pest hostile weather conditions Inappropriate 01109 Polytrin-C pesticide application. Pest favorable weather Data not 12109 Curacron Correct pesticide is (Profenfos) Available sprayed Table-4: Detailed results of field 1471:FH900 for 2002 26/08
Abdullah A. and Brobst. S .• 2003. Data Mining Pest Scouting Data, in proc. of 30th Annual conference of the Tennessee Entomological Society. McMinnville. USA.
These comparisons are made with the consideration that WFA population has to grow from WFN population of the same field. Now if the increase in population is disproportionate it may point to following possibilities. (i) If WFN population is significantly less than corresponding WFA population it may point towards either WFA invasion from neighboring fields or a gross error in data recording process. (ii) Similarly If WFN population is significantly larger than corresponding WFA popUlation it may point towards an effective pesticide usage behavior. weather conditions hostile to whitefly or presence of predators. 6. CONCLUSIONS In this paper we have discussed the results of performing unsupervised clustering on White Fly Nymph and Adult populations along with pesticide used for two years of pest scouting data i.e. 2001 and 2002. Quality of results could have been even better if we had actual metrological recordings of various elements. instead of just four forecasted elements. Despite this. the clustering revealed
62
Abdullah A. et al. 2003. Agri Data MininglWarehousing: Innovative Tools for Analysis of Integrated Agricultural & Metrological Data. in proc. of International Workshop on Frontiers of Information Technology. December 23-24. 2003. Islamabad. Pakistan. Abdullah A. et al. 2004. Learning Dynamics of Pesticide Abuse through Data Mining. in proc. of The Australasian Workshop on Data Mining and Web Intelligence. 18-22 Jan. 2004. Dunedin. New Zealand. in ACS Conference Series in RCSC;llJ'Ch & Practice in Information Technology. Vol. 32. Abdullah A. et al. 2004. The Case for an Agri Data Warehouse: Enabling Analytical Exploration of Integrated Agricultural Data, to appear in proc. of lASTED International Conference on Databases and Applications (DBA 2004). Feb. 17-19 2004. Innsbruck. Austria. Avesani. P .• Olivetti. E .• and Susi S .• 2002. Feeding Data Mining IRST Technical Report #0207-01. Istituto Trentino di Cultura, Povo (Trento). Italy. 2002.
Bertis, B., et aI, 2001, Data Mining in U.S. Corn Fields, in Proc. of the First SIAM Intl!rnational Conference on Data Mining, Chicago, IL., USA. Bertis, B., et aI, 2001, Collusion in the U.S. Crop Insurance Program: Applied Data Mining, Center for Agribusiness Excellence, Tarlatan State university, Stephenville, Texas, Chaudhry, M. R., 2000, New Frontiers in Cotton Production, International Cotton Advisory Committee, USA . Catlett, 1., 1991, On Changing Continuous Attributes into Ordered Discrete Attributes, in proc. of European Working Session on Learning, pp . 164-178. Christensen, W. F., and Di Cook, 1998, Data Mining Soil Characteristics Affecting Corn Yield, citeseer.nj .nec.com/christensen98data.html . Cunningham, S. J., and Holmes G., 2001, Developing Innovative Applications in Agriculture using Data Mining, Department of Computer Science, University of Waikato, Hanlilton, New Zealand. Eades, P., and Wormald, N., 1986, The Median Heuristic for Drawing 2-layers Networks, Technical Report 69, Departmerit of Computer Science, University of Queensland, Brisbane, Australia. Hanns, S., et aI, 2000, Data Mining in a Geospatial Support System for Drought Risk Management, in proc. of First National Conference on Digital Government Research, Los Angeles, California, pp. 9-16. Han and Kamber, 2000, Data Mining: Concepts and Techniques, Pub Morgan Kaufmann Publishers. Nugyen, H. T., 2001, Some Practical Applications of Soft Computing and Data Mining, Data Mining and Computational Intelligence, Springer- Verlag, Berlin, pp. 273-307 . Scherte, S. L., 2002, Data mining and its potential use in textiles : A Spinning Mill, PhD Dissertation, North Carolina State University, NC, USA. Sugiyama, K. , Tagawa, S., and Toda, M., 1981, Methods for Visual Understanding of Hierarchical Systems. IEEE Trans. Syst. Man Cybern ., SMC-ll(2): 109-125 . Voss, H., 2003 , Simulation, Visualization, and Decision Support in GIMMI, 9th EC GI & GIS Workshop, ESDI Serving the User, A Corona, Spain. Yang C., Prasher S. 0., and Landry, 1. A., 1999, Use of Artificial Neural Networks to Recognize Weeds in a Corn Field, Journee d'information scientifique et technique en genie agroalimentaire, Saint-Hyacinthe QC, Canada, pp. 60-65 .
63
Yang, Y., Webb, G .. I., 2002, A Comparative Study of Discretization Methods for Naive-Bayesian Classifiers, in proc. of Pacific Rim Knowledge Acquisition Workshop, National Center of Sciences, Tokyo, Japan.