Document not found! Please try again

Geodata science and geochemical mapping

Geodata science and geochemical mapping

Journal Pre-proof Geodata science and geochemical mapping Renguang Zuo, Yihui Xiong PII: S0375-6742(19)30496-0 DOI: https://doi.org/10.1016/j.gexp...

27MB Sizes 0 Downloads 131 Views

Journal Pre-proof Geodata science and geochemical mapping

Renguang Zuo, Yihui Xiong PII:

S0375-6742(19)30496-0

DOI:

https://doi.org/10.1016/j.gexplo.2019.106431

Reference:

GEXPLO 106431

To appear in:

Journal of Geochemical Exploration

Received date:

29 August 2019

Revised date:

13 November 2019

Accepted date:

23 November 2019

Please cite this article as: R. Zuo and Y. Xiong, Geodata science and geochemical mapping, Journal of Geochemical Exploration (2018), https://doi.org/10.1016/ j.gexplo.2019.106431

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2018 Published by Elsevier.

Journal Pre-proof Geodata science and geochemical mapping: Renguang Zuo*, Yihui Xiong State Key Laboratory of Geological Processes and Mineral Resources, China University of Geosciences, Wuhan 430074, China *Email: [email protected] Abstract Geodata science (GDS) is an interdisciplinary field in which geoscience data are mined for us to

oo

f

well understand the origin, evolution and future of our Earth and planet with prediction and assessment of its resources and environments. The data chain of GDS involves collecting

pr

geosciences data, mining geoinformation, discovering geo-knowledge, and making spatial decisions.

e-

There are three groups of GDS methods for exploring and mining geoscience data including data

Pr

statistics, data mining, and data insight and prediction. A case study on geochemical exploration data mapping was conducted to demonstrate the powerful use of GDS. The results show that GDS

al

is a new research paradigm for exploring the spatial association of geochemical patterns, mining

rn

elemental association, and recognizing geochemical anomalies associated with mineralization via

Jo u

geo-computation and geo-visualization techniques in support of mineral exploration. Keywords: Geodata science; Data mining; Data insight and prediction; Geochemical exploration

1. Introduction

Data Science (DS) is the science of extracting information and knowledge from data with the aim of better understanding the dataset itself. DS, which was first introduced by Naur (1974) in Concise Survey of Computer Method, is defined as a new pattern for the research of data mining. It is similar but different with the term of data mining. DS is an interdisciplinary approach to data mining, which combines statistics, many fields of computer science, and scientific methods and processes in order to mine data in automated ways, without human interaction (Hayashi, 1998). Modern data science is increasingly concerned with big data. Mattmann (2013) and Cleveland (2014) further

Journal Pre-proof explained and promoted DS. DS is a complementation of computational science and statistics. It is not just restricted to the field of statistics with the increasingly concerned with big data and artificial intelligence (AI). Both DS and AI techniques have gained great attention when entering the era of big data. AI is a technique of learning human skills to deal with problems and is a broad family containing machine learning (ML) and other techniques. Deep learning (DL), as a subtype of ML algorithm, is a kind of artificial neural network linked to multiple hidden layers. DS is a cross subject of AI, ML and DL. In general, the workflow of DS is different from that of ML. The former

data,

training

model,

deploying

model

e-

pr

(https://www.deeplearning.ai/ai-for-everyone/).

and

oo

collecting

f

involves collecting data, analyzing data, and suggesting hypotheses or actions. The latter involves

Big data has gathered much attention and has instigated the research on data mining in multiple

Pr

domains. Similarly, earth scientists are dedicated to unearth potential information from big data to

al

find solutions to problems in nature, such as climate change prediction, air pollution monitoring, predicting risks to infrastructures by nature hazards, consumption of water and mineral resources,

rn

and identifying factors of earthquakes, landslides, flooding, and volcanic eruptions (Karpatne et al.,

Jo u

2019). Research on the earth system is shifting from collecting traditional patterns, such as empirical data, theoretical derivation, and simulation local, into exploiting and mining earth datasets to discover the interrelationships between different variables (Tansley and Tolle, 2009). The earth system has entered a stage of data-intensive scientific discovery, which benefits from new generations of sensors, instruments and platforms for quick transmission rates in data storage facilities, publicity available datasets that gives earth scientists the conditions for global research and resource sharing and large efforts to standardize geoscience data sets to facilitate better mining of them (e.g., Baumann et al., 2016; Ma, 2018).

Journal Pre-proof Geodata science (GDS) is an interdisciplinary field of science to mine geoscience data to better understand the origin, evolution, and future of our Earth and planet with prediction and assessment of its resources and environments. With a similar definition to geophysics and geochemistry, GDS is an interdisciplinary subject of geosciences and DS (Fig.1). The data chain of GDS includes collecting geosciences datasets, mining geoinformation, discovering geoknowledge, and making spatial decisions (Fig.2). Various methods are available to analyze and mine geoscience data. These methods can be classified as data statistics, data mining, and data insight and prediction (Fig.3).

f

Data statistics mainly refers to traditional statistical methods with the aim of sorting, filtering,

oo

calculating and counting the data to reveal meaningful information. Data mining refers to the

pr

discovery of unknown, potentially useful, and hidden rules from geoscience data via association

e-

analysis, clustering analysis, factor analysis and traditional AI algorithms. Data insight and prediction aim to provide insight into and prediction of geological events which are the core

Pr

application of big data for extracting geoscience features and integrating of geoscience variables to

al

support decision-making (Zuo et al., 2019). The main aim of this paper is to introduce the concept of GDS and its applications to deal with geoscience problems through a case study in Gejiu region,

rn

Yunnan Province, China. This case study demonstrates how the three groups of GDS methods can

Jo u

be applied to mine geochemical exploration data in support of mineral exploration.

2. Study area and Data

The study area, located in the Gejiu region, Yunnan Province, China, is one of the largest primary Sn mineral districts all over the word. It cantinas ore reserves of 300 Mt Sn, 300 Mt Cu, 400 Mt of Pb + Zn, and > 1000 Mt of Mn (Cheng et al., 2013a). Yu et al. (1988) introduced its geological background. Two major geological units consisting of igneous rocks and a sequence of Paleozoic to Mesozoic sedimentary rocks (Fig. 4) were developed in the study area. The outcrops in this district mainly contain Middle Triassic Gejiu Formation, and Falang Formation (Qin and Li, 2008). The carbonate rocks of these two formations, including the interlayered Triassic basic lavas in the Gejiu

Journal Pre-proof Formation, are the main ore-hosting rocks for Sn mineralization (Cheng et al., 2012, 2013b). The Gejiu granite Batholith as a granitoid complex can be recognized as porphyritic biotite granite, fine-grained porphyritic biotite granite, porphyritic biotite granite, coarse to medium-grained equigranular biotite granite, medium to fine-grained leucogranite and fine-grained equigranular granitic dyke swarm and small stocks around granite margins based on texture and mineralogy (Dai, 1996; Cheng and Mao, 2010; Cheng et al., 2013b). The study area was divided into the eastern and western parts by the Gejiu fault. N–S and E–W trending, and NE–SW or NW–SE trending are the

f

main orientations of faults and folds in the eastern and western parts of the study area, respectively.

oo

Sn polymetallic mineralization (such as Sn, Sn-Pb, Sn-Cu-Pb, and Sn-W) developed in this district

pr

mainly contain four ore types which are greisen, skarn, stratabound cassiterite-sulfide and vein type

e-

ore, respectively (Mao et al., 2008; Cheng et al., 2013b). Previous studies have pointed out that Sn mineralization in the study area had a spatial correlation with the Gejiu Formation, Geijiu Batholith,

Pr

and fault structures (e.g NNE-SSW and E-W trending faults) (Cheng, 2007; Mao et al., 2008;

al

Cheng and Mao, 2010; Cheng et al., 2013b).

rn

The stream sediment geochemical data were collected Chinese National Geochemical Mapping

Jo u

Project as part of the Regional Geochemistry National Reconnaissance (RGNR) Project initiated in 1979 (Xie et al., 1997). This project collected stream sediment samples in this district, for the determination of 39 major and trace element concentrations with a density of 1 sample per 4 km2. Eleven of them (Bi, Cd, Co, Cu, La, Mo, Nb, Pb, Th, U, and W) are obtained by inductively coupled plasma-mass spectrometry (ICP-MS), nine of them (Al, Cr, Fe, K, P, Si, Ti, Y, and Zr) are obtained by X-ray fluorescence (XRF), eleven of them (Ba, Be, Ca, Li, Mg, Mn, Na, Ni, Sr, V, and Zn) are obtained by inductively coupled plasma-atomic emission spectrometry (ICP-AES), three of them (Ag, B and Sn) are obtained by emission spectrometry (ES), two of them (As and Sb) are obtained by hydride generation-atomic fluorescence spectrometry (HG-AFS), and finally Au, Hg, and F are determined by graphite furnace-atomic absorption spectrometry (GF-AAS), cold

Journal Pre-proof vapor-atomic fluorescence spectrometry (CV-AFS), and ion selective electrode (ISE), respectively (Xie et al., 2008).The dataset had been used for prediction of Sn mineral deposits by Cheng (2007). For more details on the sampling, detection limits, and data quality can be found in Xie et al. (1997).

3. Methods 3.1. Exploratory spatial data analysis

f

As an extension of exploratory data analysis (Tukey, 1977), exploratory spatial data analysis

oo

(ESDA) aims to examine geospatial data using various approaches, such as histograms, Voronoi

pr

map, normal QQ plots, trend analysis, semivariogram/covariance clouds, cross-covariance clouds,

e-

and other functions (Anselin, 1999). The core functions of ESDA are to visualize and explore the spatial data. Symanzik (2014) considered a British anesthesiologist John Snow the ―grandfather of

Pr

ESDA‖. The local indicators of spatial association (LISA) are a popular and key technique for

al

ESDA (Symanzik, 2014). In 1995, Luc Anselin proposed the local Moran's I index to assess the spatial association at a location i (Anselin, 1995). The index of I can measure local instability and

rn

identify local spatial clusters and outliers. A value of I > 0 suggests a cluster feature with

Jo u

neighboring features with similarly high or low attribute values. A value of I < 0 suggests an outlier feature with neighboring features with dissimilar values. The results of local Moran's I index analysis lead to four patterns including cluster of high values (HH), cluster of low values (LL), outlier in which a high value is surrounded primarily by low values (HL), and outlier in which a low value is surrounded primarily by high values (LH).

3.2. Robust principal component analysis Principal component analysis (PCA) is a popular multivariate analysis method for reducing the dimensionality of datasets and integrating several correlated variables into a single principal component (PC) (Jolliffe, 2002). In applied geochemistry, the obtained principal component could

Journal Pre-proof represent a meaningful elemental association related to mineralization (Zuo, 2011). Robust PCA (RPCA) is not like traditional PCA, which is sensitive to outliers based on traditional variances or covariance matrices. RPCA is based on the minimum covariance determinant estimator instead of the classic covariance matrix (Filzmoser et al., 2009; Zuo et al., 2013). Therefore, RPCA can overcome the shortcoming of traditional PCA and can reduce the influences of outliers (Zuo, 2014). In addition, the isometric logratio (ilr) transformation (Egozcue et al., 2003) has been used in RPCA to reduce the effects of the closed data problem. The ilr-transformed variables lost the

f

correspondence between original variables and the ilr-transformed variables. So the analysis results

oo

generated from ilr-transformed variables are difficult to interpret (Egozcue and Pawlowsky-Glahn,

pr

2005). For interpreting results of ilr transformed data, sometimes we have to back-transform the

e-

resluts (e.g., loadings and scores of PCA) into the centered logratio (clr) space (Filzmoser et al., 2009; Reimann et al., 2008). Therefore, RPCA can lead to robust and reliable results. Most of

Pr

geochemical datasets consist of many variables. For instance, the National Geochemical Survey of

al

Australia dataset has 60 variables (de Caritat et al., 2010), and the Chinese National Geochemical Mapping dataset has 39 variables (Xie et al., 1997). However, only several elements are linked to

rn

mineralization and can guide mineral exploration. Identifying elemental associations related to a

Jo u

specific mineralization is a tough task when mapping geochemical exploration data. Zuo (2018) summarized three techniques for identifying elemental associations for a specific type of mineral deposit. These techniques include studying the geological characteristics of known mineral deposits, multivariate analysis, and spatial analysis (Zuo, 2018). In this study, RPCA as a multivariate analysis was used. For more details on RPCA can be found in Filzmoser et al. (2009).

3.3. Deep autoencoder network ML algorithms have been widely applied in many fields. Recently, they have been employed to the mapping of exploration geochemical data (e.g., Chen et al., 2014, 2017; Zhao et al., 2016). This is because they have the powerful ability to model the complex and unknown multivariate

Journal Pre-proof geochemical distribution and extract meaningful elemental associations related to mineralization (Zuo, 2017; Zuo and Xiong, 2018). DL as a type of ML algorithm has a strong ability to automatically extract high-level representations from complex data, and has potential for processing geochemical exploration data (Xiong and Zuo, 2016; Zuo and Xiong, 2018; Chen et al., 2019a, 2019b; Li et al., 2019; Zuo et al., 2019).

As a typical DL model, the deep autoencoder network (DAN) proposed by Hinton and

Valentine and Trampert, 2012; Fiore et al., 2013; Sakurada and Yairi,

pr

reconstruction errors (e.g.,

oo

f

Salakhutdinov (2006), has been successfully applied to anomaly detection based on the

e-

2014; Sun et al. 2014). The principle of DAN for geochemical anomaly detection lies in geochemical anomalies with small sample sizes, which are poorly encoded because they are linked

Pr

to a low probability of detection. Therefore, geochemical anomalies have high reconstruction errors

al

in the DAN model. Based on this principle, DAN has been successfully applied to recognize

rn

geochemical anomalies (Xiong and Zuo, 2016; Zuo et al., 2019) and mapping mineral prospectivity

Jo u

(Xiong et al., 2018). The training of DAN consists of two phases. The first phase is pretraining, where every restricted Boltzmann machine (RBM) is trained respectively for initialize weights. The second phase is unrolled and fine-tuning; once the training of an RBM is finished, another RBM is "stacked" on the top of it, taking its input from the output of the front RBM, and then, the whole deep network is fine-tuned via back-propagation to adjust all the parameters simultaneously. Thus, the weights of the multi-layer feed-forward neural network are initialized by the weights of the stacked RBMs instead of the traditional method of random initialization weights.

4. Results and discussion

Journal Pre-proof Mapping geochemical exploration data plays a critical role in mineral exploration. Various methods have been successfully applied to handle with geochemical exploration data. These methods include traditional methods such as mean ± 2 × standard deviations (Hawkes and Webb, 1962), probability graph (Sinclair, 1974), exploratory data analysis (Tukey, 1977), geostatistics (Matheron, 1962), gap statistic (Miesch, 1981; Wang and Zuo, 2016), multivariate statistics (Grunsky, 2010; Zuo, 2011), fractal/multifractal models (Cheng et al., 1994; 2000; Cheng, 2007; Zuo and Wang, 2016), and ML algorithms (Chen et al., 2014, 2017; Xiong and Zuo, 2016; Zuo and Xiong, 2018; Zuo et al., 2019).

f

In general, in the field of exploration geochemistry, the main aims of mapping geochemical

oo

exploration data are to explore the spatial geochemical patterns, reveal geochemical element

pr

association related to mineralization, and identify geochemical anomalies associated with

e-

mineralization in support of mineral exploration. In this section, ESDA, RPCA, and SDAN are employed as representative tools for data statistics, data mining and data insight and prediction,

4.1. ESDA

al

Pr

respectively, to process geochemical exploration data and identify geochemical anomalies.

rn

The major ore-forming elements Sn and Cu were selected as examples for the spatial

Jo u

autocorrelation analysis using the local Moran's I index supported by ArcGIS@TM 10.2. In term of the First Law of Geography (Tobler, 1970), the weights matrix for cluster and outliers analysis was created based on the inverse Euclidean distance of stream geochemical sampling points. The distance band or threshold was set as 20 km. Features outside 20 km of a target feature were ignored in analyses for that feature.

There were a total of 593 points. For Sn, there were 538 points which are classified as neither outliers nor clusters. 51 HH clusters, 3 LH and 1 HL outliers were detected (Fig.5a). Most of the known Sn polymetallic mineralization in the eastern part of the district were located around of HH clusters. Meanwhile, for Cu, there were 554 points which are classified as neither outliers nor

Journal Pre-proof clusters. 37 HH clusters, 1 LH and 1 HL outlier were detected (Fig. 5b). The cluster and outlier pattern of Cu is similar to that of Sn because more than 89% (33/37) of Cu HH clusters were also classified as Sn HH clusters, and the Cu HL outlier is in the same location as the Sn HL outlier, implying that Cu and Sn have a spatial correlation and could be related in origin (Cheng and Mao, 2010; Cheng et al., 2012).

4.2. RPCA

f

The major ore-forming elements of Ag, Au, As, Bi, Cd, Co, Cu, Hg, Mo, Ni, Pb, Sb, Sn, W, and Zn,

oo

were selected to reveal the elemental association related to Sn polymetallic mineralization. Here,

pr

the RPCA method firstly opened the raw data using ilr transformation to address the data closure

e-

problem; it then combined multiple geochemical variables (Zuo et al., 2013). The results of RPCA on ilr transformed geochemical data suggest that the negative loadings of PC2 with the assemblages

Pr

of Ag, Bi, Co, Cu, Ni, Pb, Sn, W, and Zn perhaps represents Sn polymetallic mineralization (Fig. 6a

al

and Table. 1). It can be observed that areas with high PC2 scores which show a strong spatial correlation with the Sn polymetallic mineralization are mainly distributed in the eastern parts of the

rn

study area (Fig. 6b). The areas with high PC2 scores occupied 5% and 10% of the total area contain

Jo u

34.1% and 54.5% of the known Sn deposits, respectively. The PC2 scores in the western part of the study area are relatively low compared to those in the eastern part. The high geochemical background in the eastern part might inhibit the recognition of relatively weak anomalies in the western part of the study area. Previous study suggested that the western part of the area has great potential for undiscovered Sn deposits (Cheng, 2007). Therefore, the elemental association anomalies based on the assemblages of Ag, Bi, Co, Cu, Ni, Pb, Sn, W, and Zn should be further studied in this area.

4.3. DAN

Journal Pre-proof Based on the results of RPCA, Ag, Bi, Co, Cu, Ni, Pb, Sn, W, and Zn were selected for further detecting geochemical anomalies associated with Sn mineralization via DAN. The anomaly recognition index of DAN is reconstruction error, which is high for geochemical anomalies samples, and low for geochemical background samples. Choosing appropriate model parameters is critical to the performance of DAN. Based on a number of experiments, the number of input layer units of the DAN is set to 5, and the number of hidden layer units was set to 5, 10, 20 and 40, respectively. The number of iterations and learning rate of the model were fixed to 200 and 0.3, respectively. The

oo

f

detailed optimal parameters selection process of DAN has been drawn in Xiong and Zuo (2016).

pr

On the basis of choosing appropriate model parameters, the reconstruction errors corresponding to

e-

each cell in the study area were calculated by DAN. The extracted elemental association anomalies based on the assemblages of Ag, Bi, Co, Cu, Ni, Pb, Sn, W, and Zn occur not only in the eastern

Pr

part of the area, where most of the known Sn polymetallic deposits fall within the target areas, but

al

also a number of target areas delineated in the western part of the area where there have not yet been any significant discoveries (Fig. 7). One of the reason may be attributed to the strong ability of

rn

deep learning to extract automatically high-level features from complex geochemical data, which

Jo u

are characterized by neither normal nor log-normal, strongly skewed, multi-modal data distributions due to various complex geological processes, complex erosion processes, and influence of compositions and distributions of regolith and bedrock (Reimann and Filzmoser, 2000; Reimann et al., 2002; Spadoni, 2006; Yousefi et al., 2013).

The Student’s t value was used to evaluate whether the anomalies obtained by DAN had a spatial correlation with the locations of known Sn polymetallic mineralization or not. In general, Student’s t value > 1.96 suggests a statistically significant correlation between anomalies and the known mineralization at a 95% confidence interval. The larger the Student’s t-value, the stronger the spatial correlation (Bonham-Carter, 1994). It can be observed from Fig.8 that the maximum

Journal Pre-proof Student’s t value is up to 5.84 which is higher than 1.96, suggesting a strong spatial correlation between the obtained anomalies and the locations of known Sn polymetallic mineralization.

In addition, from a geological point of view, most of the areas linked to high values occur either within the Gejiu Formation or round the Gejiu Batholith, and other anomalous areas develop along structures (Fig.7). These geological factors played a critical role for the formation of Sn polymetallic mineralization. The Gejiu Batholith provided fluids, heat, and a part of metals (Cheng

f

et al., 2012), the Gejiu formation offered depositional spaces for metals (Cheng et al., 2012, 2013b),

oo

and faults served as pathways of hydrothermal fluids and the space for the depositional of ore

pr

minerals, such as six EW-trending faults in Laochang Sn–W–Cu polymetallic deposit in this district

e-

(Sun et al., 1987; Jiang et al., 1997; Cheng et al., 2013a, 2013b). These observations indicate these anomalies detected in this study have a close spatial association with Sn polymetallic mineralization.

Pr

Therefore, the delineated anomaly areas based on DAN in the western part can be considered as

rn

5. Conclusions

al

favorable prospective for undiscovered Sn polymetallic deposits.

Jo u

Geodata science is a discipline that deals with and mines geoscience data in order to derive the geoinformation and geoknowledge of interest. In this paper, a case study for mapping geochemical exploration data was reported to demonstrate the new research paradigm of GDS in geoscience. The following conclusions can be obtained: (1) GDS is the science to studying and mining geospatial patterns and can derive meaningful and unknown geoinformation and geoknowledge; (2) GDS is a new research paradigm in geoscience and can be used to explore the spatial association of geochemical patterns, mining elemental association, and recognize geochemical anomalies associated with mineralization via geo-computation and geo-visualization techniques in support of mineral exploration; and

Journal Pre-proof (3) For processing stream sediment geochemical data in the Gejiu district, Yunnan Province, China via GDC, we found that Sn and Cu have a close spatial association with each other, and Ag, Bi, Co, Cu, Ni, Pb, Sn, W, and Zn can be regarded as pathfinder elements for Sn polymetallic mineralization. The anomalies obtained by this study provide a clue for the next round of mineral exploration in this study area.

Acknowledgements

f

Thanks are due to two reviewers’ comments and suggestions, which helped us improve this study.

oo

This study was jointed awarded by the National Natural Science Foundation of China

pr

(Nos.41972303 and 41772344), and the Most Special Fund from the State Key Laboratory of

e-

Geological Processes and Mineral Resources, China University of Geosciences (MSFGPMR03-3).

Pr

References

al

Anselin, L., 1995. Local indicators of spatial association—LISA. Geographical Analysis 27, 93–

rn

115.

Jo u

Anselin, L., 1999. Interactive techniques and exploratory spatial data analysis. Geographical Information Systems: principles, techniques, management and applications 1, 251-264. Baumann, P., Mazzetti, P., Ungar, J., Barbera, R., Barboni, D., Beccati, A., Campalani, P., 2016. Big data analytics for earth sciences: the EarthServer approach. International Journal of Digital Earth 9, 3–29. Bonham-Carter, G.F., 1994. Geographic information systems for geoscientists: modeling with GIS. Pergamon Press, Oxford. 398 pp. Chen, L., Guan, Q., Xiong, Y., Liang, J., Wang, Y., Xu, Y., 2019a. A spatially constrained multi-autoencoder approach for multivariate geochemical anomaly recognition. Computers & Geosciences 125, 43-54.

Journal Pre-proof Chen, L., Guan, Q., Feng, B., Yue, H., Wang, J., Zhang, F., 2019b. A multi-convolutional autoencoder approach to multivariate geochemical anomaly recognition. Minerals 9, 270. Chen, Y., Lu, L., Li, X., 2014. Application of continuous restricted Boltzmann machine to identify multivariate geochemical anomaly. Journal of Geochemical Exploration 140, 56–63. Chen, Y., Wu, W., 2017. Application of one-class support vector machine to quickly identify multivariate anomalies from geochemical exploration data. Geochemistry: Exploration

oo

f

Environment. Analysis 17, 231–238. Cheng, Q., 2007. Mapping singularities with stream sediment geochemical data for prediction of

pr

undiscovered mineral deposits in Gejiu, Yunnan Province, China. Ore Geology Reviews 32,

e-

314–324.

Pr

Cheng, Q., Agterberg, F. P., Ballantyne, S. B., 1994. The separation of geochemical anomalies from background by fractal methods. Journal of Geochemical Exploration 51,109–130.

al

Cheng, Q., Xu, Y., Grunsky, E., 2000. Integrated spatial and spectrum method for geochemical

rn

anomaly separation. Natural Resources Research 9, 43–52.

Jo u

Cheng, Y., Mao, J., 2010. Age and geochemistry of granites in Gejiu area, Yunnan province, SW China: constraints on their petrogenesis and corresponding tectonic setting. Lithos 120, 258– 276.

Cheng, Y., Mao, J., Rusk, B., Yang, Z., 2012. Geology and genesis of Kafang Cu–Sn deposit, Gejiu district, SW China. Ore Geology Reviews 48, 180–196. Cheng, Y., Mao, J., Spandler, C., 2013a. Petrogenesis and geodynamic implications of the Gejiu igneous complex in the western Cathaysia block, South China. Lithos 175–176, 213–229 Cheng, Y., Mao, J., Chang, Z., Pirajno, F., 2013b. The origin of the world class tin-polymetallic deposits in the Gejiu district, SW China: Constraints from metal zoning characteristics and

Journal Pre-proof 40

Ar–39Ar geochronology. Ore Geology Reviews 53, 50–62.

Cleveland, W.S., 2014. Data science: an action plan for expanding the technical areas of the field of statistics. Statistical Analysis and Data Mining, 414–417. Dai, F., 1996. Characteristics and evolution of rock series, lithogenesis, metallogenesis of crust-derived anatectin magma in Gejiu ore field. Geol. Yunnan 15, 330–344 (in Chinese with English abstract).

f

de Caritat, P., Cooper, M., Pappas, W., Thun, C., Webber, E., 2010. National geochemical survey of

oo

australia: analytical methods manual geoscience Australia record. (2010/15 (22 pp.)).

pr

Egozcue, J.J., Pawlowsky-Glahn, V., Mateu-Figueras, G., BarceloVidal, C., 2003. Isometric logratio

e-

transformations for compositional data analysis. Mathematical Geology 35, 279–300.

Pr

Egozcue, J. J., Pawlowsky-Glahn, V., 2005. Groups of Parts and Their Balances in Compositional Data Analysis. Mathematical Geology, 37, 795-828.

al

Filzmoser, P., Hron, K., Reimann, C., 2009. Principal component analysis for compositional data

rn

with outliers. Environmetrics 20, 621–632.

Jo u

Fiore, U., Palmieri, F., Castiglione, A., De Santis, A., 2013. Network anomaly detection with the restricted Boltzmann machine. Neurocomputing, 122, 13-23. Grunsky, E.C., 2010. The interpretation of geochemical survey data. Geochemistry: Exploration, Environment, Analysis 10, 27–74. Hawkes, H.E., Webb, J.S., 1962. Geochemistry in mineral exploration. Harper and Row, New York, NY. Hayashi, C., 1998. What is data science? Fundamental concepts and a heuristic example. In Data science, classification, and related methods (pp. 40-51). Springer, Tokyo. Hinton, G.E., Salakhutdinov, R.R., 2006. Reducing the dimensionality of data with neural networks.

Journal Pre-proof Science 313, 504–507. Jiang, Z.W., Nicholas, H.S.O., Teren, D.B., 1997. Numerical modeling of fault-controlled fluid flow in the genesis of tin deposits of the Malage ore field Gejiu mining district. Economic Geology 92, 228–247. Jolliffe, I.T., 2002. Principal Component Analysis, 2nd edn. Springer, New York, 547 NY. 487 pp. Karpatne, A., Ebert-Uphoff, I., Ravela, S., Babaie, H. A., Kumar, V., 2019. Machine learning for the

oo

f

geosciences: Challenges and opportunities. IEEE Transactions on Knowledge and Data Engineering 31, 1544–1554.

pr

Li, K., 2007. Integrated information metallogenic prediction of tin-polymetallic deposit in western

e-

Gejiu, Yunnan. A Dissertation Submitted to China University of Geosciences for the Degree of

Pr

Master of Engineering (In Chinese with English abstract). Li, S., Chen, J., Xiang, J., 2019. Applications of deep convolutional neural networks in prospecting

al

prediction based on two-dimensional geological big data. Neural Computing and Applications.

rn

https://doi.org/10.1007/s00521-019-04341-3.

Jo u

Ma, X., 2018. Data science for geoscience: leveraging mathematical geosciences with semantics and open data. In Handbook of Mathematical Geosciences, pp. 687-702. Springer, Cham. Mao, J., Cheng, Y., Guo, C., Yang, Z., Zhao, H., 2008. Gejiu tin polymetallic ore-field: deposit model and discussion. Acta Geology Sinica 81, 1456–1468 (in Chinese with English abstract). Matheron, G., 1962. Traité de géostatistique appliquée. Editions Technip. Mattmann, C.A., 2013. A vision for data science. Nature 493, 473–475. Miesch, A.T., 1981. Estimation of the geochemical threshold and its statistical significance. Journal Geochemical Exploration 16, 49–76. Naur, P., 1974. Concise Survey of Computer Methods. Petrocelli Books. 397 p.

Journal Pre-proof Qin, D., Li, Y., 2008. Studies on the Geology of the Gejiu Sn–Cu Deposit. Science Press, Beijing, pp. 1–180 (in Chinese with English abstract). Reimann, C., Filzmoser, P., 2000. Normal and lognormal data distribution in geochemistry: death of a myth. Consequences for the statistical treatment of geochemical and environmental data. Environmental geology, 39, 1001-1014. Reimann, C., Filzmoser, P., Garrett, R. G., 2002. Factor analysis applied to regional geochemical

oo

f

data: problems and possibilities. Applied Geochemistry 17, 185-206. Reimann, C., Filzmoser, P., Garrett, R., Dutter, R., 2008. Statistical Data Analysis Explained:

pr

Applied Environmental Statistics with R. John Wiley & Sons, Chichester. 362 pp.

e-

Sakurada, M., Yairi, T., 2014. Anomaly detection using autoencoders with nonlinear dimensionality

Pr

reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, pp. 4.

al

Sinclair, A.J., 1974. Selection of threshold values in geochemical data using probability graphs.

rn

Journal of Geochemical Exploration 3, 129–149.

Jo u

Spadoni, M., 2006. Geochemical mapping using a geomorphologic approach based on catchments. Journal of Geochemical Exploration, 90, 183-196. Sun, J., Jiang, Z., Lei, Y., 1987. Structure-geochemistry of Malage deposit in Gejiu district. Geochimica 4, 303–311 (in Chinese with English abstract). Sun, J., Steinecker, A., Glocker, P., 2014. Application of deep belief networks for precision mechanism quality inspection. In International Precision Assembly Seminar, 87–93. Symanzik, J., 2014. Exploratory spatial data analysis. M. M. Fischer, P. Nijkamp (eds.), Handbook of Regional Science, https://doi.org/10.1007/978-3-642-36203-3_76-1. Tansley, S., Tolle, K. M., 2009. The fourth paradigm: data-intensive scientific discovery, Vol. 1. A.

Journal Pre-proof J.Hey. Redmond, WA: Microsoft research. Tobler, W. R., 1970. A computer movie simulating urban growth in the Detroit region. Economic Geography 46, 234–240. Tukey, J.W., 1977. Exploratory data analysis. Addison Wesley, Reading. Valentine, A. P., Trampert, J., 2012. Data space reduction, quality assessment and searching of seismograms: autoencoder networks for waveform data. Geophysical Journal International 189,

oo

f

1183–1202. Wang, J., Zuo, R., 2016. An extended local gap statistic for identifying geochemical anomalies.

pr

Journal of Geochemical Exploration 164, 86–93.

e-

Xie, X., Mu, X., Ren, T., 1997. Geochemical mapping in China. Journal of Geochemical

Pr

Exploration 60, 99–113.

Xie, X., Wang, X., Zhang, Q., Zhou, G., Cheng, H., Liu, D., Cheng, Z., Xu, S., 2008. Multiscale

rn

341.

al

geochemical mapping in China. Geochemistry: Exploration, Environment, Analysis 8, 333–

Jo u

Xiong, Y., Zuo, R., 2016. Recognition of geochemical anomalies using a deep autoencoder network. Computers & Geosciences 86, 75–82. Xiong, Y., Zuo, R., Carranza, E.J.M., 2018. Mapping mineral prospectivity through big data analytics and a deep learning algorithm. Ore Geology Reviews 102, 811–817. Yousefi, M., Carranza, E. J. M., Kamkar-Rouhani, A., 2013. Weighted drainage catchment basin mapping of geochemical anomalies using stream sediment data for mineral potential modeling. Journal of Geochemical Exploration, 128, 88-96. Yu, C., Tang, Y., Shi, P., Deng, B., 1988. The dynamic system of endogenic ore formation in Gejiu tin–polymetallic ore region, Yunnan Province. China University of Geosciences Press, Wuhan,

Journal Pre-proof China, 394 pp. (in Chinese with English Abstract). Zhao, J., Chen, S., Zuo, R., 2016. Identifying geochemical anomalies associated with Au–Cu mineralization using multifractal and artificial neural network models in the Ningqiang district, Shaanxi, China. Journal of Geochemical Exploration 164, 54–64. Zuo, R., 2011. Identifying geochemical anomalies associated with cu and Pb–Zn skarn mineralization using principal component analysis and spectrum–area fractal modeling in the

oo

f

Gangdese belt, Tibet (China). Journal of Geochemical Exploration 111, 13–22. Zuo, R., Xia, Q., Wang, H., 2013. Compositional data analysis in the study of integrated

pr

geochemical anomalies associated with mineralization. Applied Geochemistry 28, 202–211.

e-

Zuo, R., 2014. Identification of geochemical anomalies associated with mineralization in the

Pr

Fanshan district, Fujian, China. Journal of Geochemical Exploration 139, 170–176. Zuo, R., Wang, J., 2016. Fractal/multifractal modeling of geochemical data: a review. Journal of

al

Geochemical Exploration 164, 33–41.

rn

Zuo, R., 2017. Machine learning of mineralization-related geochemical anomalies: a review of

Jo u

potential methods. Natural Resources Research 26, 457-464. Zuo, R., Xiong, Y., 2018. Big data analytics of identifying geochemical anomalies supported by machine learning methods. Natural Resources Research 27, 5–13. Zuo, R., 2018. Selection of an elemental association related to mineralization using spatial analysis. Journal of Geochemical Exploration 184, 150–157. Zuo, R., Xiong, Y., Wang, J., Carranza, E.J.M., 2019. Deep learning and its application in geochemical mapping. Earth-Science Reviews 192, 1–14.

Journal Pre-proof Figure and Table caption Figure 1. Geodata science as an interdisciplinary subject of geoscience and data science. Figure 2. The data chain of Geodata science. Figure 3. Geodata science tools. Figure 4. Simplified geological map of Gejiu region, Yunnan Province, China (after Li, 2007). Figure 5. Cluster and outliers analysis of (a) Sn and (b) Cu. Figure 6. Results of robust principal component analysis: (a) biplot of PC1 and PC2, and (b) the

f

spatial distribution of PC2.

oo

Figure 7. Geochemical anomalies detected by deep autoencoder network.

e-

pr

Figure 8. Plot of Student's t-values vs geochemical anomaly (a), and geochemical anomaly map (b).

Jo u

rn

al

Pr

Table 1. Loading values of principal component analysis

Jo u

rn

al

Pr

e-

pr

Figure 1

oo

f

Journal Pre-proof

Journal Pre-proof

Jo u

rn

al

Pr

e-

pr

oo

f

Figure 2

Journal Pre-proof

Jo u

rn

al

Pr

e-

pr

oo

f

Figure 3

pr

oo

f

Journal Pre-proof

Jo u

rn

al

Pr

e-

Figure 4

e-

pr

oo

f

Journal Pre-proof

Jo u

rn

al

Pr

(a)

(b) Figure 5

Jo u

rn

al

Pr

e-

(a)

pr

oo

f

Journal Pre-proof

(b) Figure 6

e-

pr

oo

f

Journal Pre-proof

Jo u

rn

al

Pr

Figure 7

Journal Pre-proof 7.0 6.0

4.0 3.0 2.0

0.0 0.2

0.4

0.6

oo

0

f

1.0

rn

al

Pr

e-

(a)

pr

Geochemical anomaly

Jo u

t-values

5.0

(b) Figure 8

0.8

1

Journal Pre-proof Table 1 Elements

PC1

PC2

PC3

0.022

-0.094

0.411

As

-0.275

0.327

-0.256

Au

0.082

-0.002

-0.026

Bi

-0.470

-0.227

0.012

Cd

0.164

0.058

0.512

Co

0.295

-0.238

-0.091

Cu

0.228

-0.249

-0.441

Hg

0.168

0.369

0.243

Mo

0.130

0.387

-0.086

Ni

0.370

-0.181

-0.339

Pb

-0.281

-0.145

Sb

-0.079

0.522

-0.213

Sn

-0.306

-0.232

-0.068

W

-0.314

Zn

0.266

f

Ag

pr

oo

0.123

-0.012

-0.182

0.230

Pr

e-

-0.113

Highlights

GDS is the science to studying and mining geospatial patterns and can derive meaningful and

rn

al

unknown geoinformation and geoknowledge

Jo u

GDS is a new research paradigm in geoscience and can be used for geochemical mapping in support of mineral exploration

GDC can reveal the spatial association, identify the elemental association, and recognize geochemical anomalies.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8