Crystallographic prediction from diffraction and chemistry data for higher throughput classification using machine learning

Computational Materials Science xxx (xxxx) xxxx Contents lists available at ScienceDirect Computational Materials Science journal homepage: www.else...

Download PDF

2MB Sizes 0 Downloads 61 Views

Report

PDF Reader
Full Text

Computational Materials Science xxx (xxxx) xxxx

Contents lists available at ScienceDirect

Computational Materials Science journal homepage: www.elsevier.com/locate/commatsci

Crystallographic prediction from diﬀraction and chemistry data for higher throughput classiﬁcation using machine learning ⁎

Jeﬀery A. Aguiara, , Matthew L. Gonga,b, Tolga Tasdizenb a b

Idaho National Laboratory, Nuclear Materials Department, Idaho Falls, ID 83415, USA University of Utah, Scientiﬁc Computing Imaging Institute, Salt Lake City, UT 84106, USA

A R T I C LE I N FO

A B S T R A C T

Keywords: Microscopy Machine learning Data analytics Materials discovery Material informatics

Simultaneously capturing material structure and chemistry in the form of accessible data is often advantageous for drawing correlations and enhancing our understanding of measurable materials behavior and properties. Unfortunately, in many cases, accessing data at the scale required, is highly multidimensional and sparse by the historical and evolving nature of materials science. To mitigate diﬃculties, we develop and employ methods of data analytics in conjunction with open accessible chemistry and structure datasets, to classify and reduce the amount of data needed for extracting useful descriptors from multidimensional techniques. The construction and systematic ablation of our model highlights the potential for dimensional reduction in data sampling, improved classiﬁcation, and identiﬁcation of correlations among material crystallography and chemistry.

1. Introduction

of data [10–12]. Machine learning in other ﬁelds of study has been predicated on the creation of large curated datasets and access to orders of magnitude more data than previously available [13–15]. As social media created an explosion of images for computer vision algorithms, high throughput and ultrafast microscopy combined with advances such as in-situ microscopy have the potential to revolutionize autonomous data collection methods, including adaptive image and feature tracking [9]. As materials communities create, curate, share, and aggregate data, new possibilities for machine learning arise, especially considering exascale computing capabilities in the future [16]. Material science as a ﬁeld is at the center of a conﬂuence of technological advancements that will continue to create new research possibilities. Historically work classifying material structure from collected datasets has been challenged with uncertainty requiring multiple views and in some cases modeling to determine crystal structure. Reporting on correlations between materials systems, composition, fabrication method, and prior experience have since been expressed in terms of generic rules-of-thumb for exploring materials [17]. Eﬀorts to utilize diﬀraction data to determine space group date decades the of material science, including the transformation of pair distribution functions, diﬀraction proﬁles, and two dimensional patterns utilizing a variety of methods [18–23]. Determining space-group information from diﬀraction pattern have further been based on a number of statistical and brute-force searches [22,24–26]. In many cases of these cases exploring a material system have often required working within a familiar

Determining crystal structure and chemistry are important primary steps in materials science. Diﬀraction data encompasses a large range of modalities and acquisition methods including diﬀraction techniques such as X-ray diﬀraction (XRD), electron based scattering diﬀraction (EBSD), selected area electron diﬀraction (SAED), and high-resolution atomic scale (scanning) transmission electron microscopy (S/TEM). In parallel, modern spectrometers detect emission and ionization edges characteristic of the elements over the entire periodic table at ultrafast timescales [1,2]. Collectively the hardware to resolve atomic structure and chemistry have greatly improved and accelerated our abilities to gather more detailed and speciﬁc data sets. Simultaneous collection of chemistry and structural data taken from the same single point on a specimen, results in new challenges and opportunities [3–5]. Simultaneous measurements lends itself to new opportunities in materials by design research, where the enormity of data presents new challenges. With the expanded potential volume in data, there is an increasing need to create tools and packages that can analyze multimodal data gathered on increasingly complex and modern equipment [6]. (See Figs. 1–3). Automation is poised to not only revolutionize data collection, but leads itself to new data-driven approaches to materials discovery and subsequent optimization [7–9]. In order to keep pace with the growing tide of data presenting a challenge in data deluge, new analysis tools and workﬂows must be created for users to aid them in parsing volumes

⁎

Corresponding author. E-mail address: Jeﬀ[email protected] (J.A. Aguiar).

https://doi.org/10.1016/j.commatsci.2019.109409 Received 6 August 2019; Received in revised form 20 October 2019; Accepted 11 November 2019 0927-0256/ © 2019 Published by Elsevier B.V.

Please cite this article as: Jeffery A. Aguiar, Matthew L. Gong and Tolga Tasdizen, Computational Materials Science, https://doi.org/10.1016/j.commatsci.2019.109409

Computational Materials Science xxx (xxxx) xxxx

J.A. Aguiar, et al.

Fig. 1. Schematic for Materials Data and Structure. a) The structure of the database used to create and store the training set. b) Relative composition of the database based on crystallographic family, genera, and species. A high-level overview of the distribution shows an abundance and scarcity of certain families and genera. The black lines are representative of portions of data starting from the inner circle of crystal family to outer circle of point group data. c) Data allocation during the diﬀerent folds during training and testing for cross-validation.

Fig. 2. Data Processing and Augmentation Schema for Diﬀraction. From the data source in the form of translated diﬀractograms from high resolution images or recorded diﬀraction patterns lends itself to extrapolating radial proﬁles using the same workﬂow. Legend dissociates ﬁgure into acquisition modalities and processing required to create a uniform feature vector.

classes and crystal family classiﬁcation that require higher accuracies and deduction [36]. Based on the need for improved quantitative tools for deducing not only low to high symmetry classes from experimental data. There is a confounding need to leverage computational tools and platforms that utilize either or both structural and chemistry data. Providing ranked predictions, from multimodal data has not yet been demonstrated fully in the community. In this work, we present a modular neural network architecture a simpliﬁed and generalizable representation of crystallography and chemistry to classify crystal structure. The generalized representation and space group prediction from either or both diﬀraction and chemistry data allows for greater ﬂexibility for users to include in their research. The service demonstrates a workﬂow and analysis tool for high throughput characterization and deduces crystal family, point group, or space group from experimental data. By reducing the complexity and need for familiarity the model lends itself to additional opportunities for under-represented and poorly understood materials. The work will detail the method for creating the deep learning service and provide a demonstration of the increased speed of an automated workﬂow. The service is designed to alleviate data deluge and provide a simple reliable workﬂow that does not require expert knowledge of material structure

material systems [3]. This situation is further complicated by the presence of a confounding defects and additional phases that can otherwise be mislabeled, not identiﬁed, or otherwise ignored due to low signal to noise. Current models can miss this information if the signals are not over pronounced or otherwise not identiﬁed. However, this is an over identiﬁed limitation of current techniques, and has been an active area of research for extrapolating crystallography, where researchers focused on materials properties and behavior have been focused on understanding how physical structure and chemistry are linked to properties [27–30]. One therefore naturally seeks a simpliﬁed and generalized representation of materials data that conveys the underlying complexity of these relationships, including the multidimensional materials sampling problem and accurately classifying materials amongst challenges in misclassiﬁcation and signaling limitations [31,32]. The described scenario is a grand challenge amongst many communities looking to developing methods for equipping high throughput instruments with near autonomous workﬂows utilizing the breadth of generated data [33–35]. To date machine learning models for crystallography have demonstrated up to a 81% accuracy for determining space group from simulated diﬀraction based data, but challenged to deduce experimental data, especially for lower symmetry 2

Computational Materials Science xxx (xxxx) xxxx

J.A. Aguiar, et al.

Fig. 3. Schematic for modular architecture combining structure and chemistry. A) Modular network schematic and data concatenation for merging chemistry and structural determination from diﬀractograms. Boxes are color coded to three module types that are illustrated in (B), with encompassing layer architectures.

After checking the CIFs for formatting, diﬀraction proﬁles were simulated using PyMatgen [47]. The proﬁles were converted into a single feature vector parsed based on peak positions and the basis for data folds in Fig. 1b. Proﬁles that contained no peaks in the range between 0.50 and 6 Angstroms were also removed from the training set. Remaining proﬁles were stored in a SQL database with labels for family, genera, species and chemistry. The cleaned training set shown in Fig. 1c consisted of 431,000 proﬁles with their associated chemical metadata, including composition.

and chemistry in real-time. 2. Methods 2.1. Data collection and cleaning gathering data To assemble a training set, crystal information ﬁles (CIFs) were gathered from open materials and crystallography databases [37]. CIF ﬁles were primarily acquired from Materials Project, Aﬂow, and Open Crystallography Database (OCD) [34,38–44]. The CIFs included crystals from all crystal families in varying proportions reported in Fig. 1. The pre-simulation dataset consisted of 572,000 CIFs encoded with additioal material descriptors reported in Fig. 1a. Additional CIFs for underrepresented space groups from the inorganic crystal structure database (ICSD), included to provide further examples. CIFs were then used to generate a standard query language (SQL) database with relevant crystallography data associated with computed diﬀraction proﬁles as function of scattering angle, reciprocal lattice spacing, and chemical composition information. Encoded into a diﬀraction proﬁle are the structural ﬁngerprints for any material based on crystal geometry, underlying atomic coordination, and occupancy. Under the inﬂuence of an impinging X-ray, and destructive interference neutron, or electron results in a series of peaks in scattered intensity, where there is constructive interference, forming a two-dimensional diﬀraction pattern. Pending on scattering geometry one speciﬁc pattern is generated per sample orientation, where a single crystal for one orientation will show a series of identiﬁable peaks in reciprocal space. Filling in all reciprocals requires a highly polycrystalline sample or alternatively a sample that is precessed through all sample orientations and diﬀracting conditions completing the Ewald sphere. For the purposes of classiﬁcation, all orientations and identiﬁable peaks are input into the model. Chemistry is input into the model as an additional descriptor and further augments the models to deduce material structure [45,46].

2.3. Data representation Simplifying the representation of diﬀraction and chemistry shown in Fig. 3 allows for a broader set of data acquisition methods. Whether the input is a Fourier transformed high resolution atomic scale image, or diﬀraction proﬁle acquired using electrons, neutrons, or X-rays the relevant atomic scattering peaks are positioned with respect to their crystallographic scattering position in reciprocal space. A classiﬁcation model therefore that considers peak position alone in reciprocal space is therefore impervious to changes in technique. Training data was therefore built from CIFs and simulated a wealth of available features to train models, including their chemical signature. We opted for a minimalist representation that would represent features that are expected to be present in the widest range of acquisition method. For diﬀraction, we reduced the proﬁle to a vector of peak locations. Peaks were represented in the vector as 0 if no peak was detected and 1 if a peak was detected. We then divided the peaks into 900 bins uniformly partitioning the range from 0.5 to 6 Angstroms in reciprocal lattice spacing. This range was chosen to accommodate a wide variety of techniques to within 0.10 Angstrom resolution. By imposing fewer requirements and assumptions on the model inputs we were able to create generalized model that builds in uncertainty to deduce classiﬁcation and augmentation. 2.4. Data augmentation schema

2.2. Data parsing and curation In addition to the simulated diﬀraction proﬁles, a set of augmentation operations were deﬁned on the data set to be used during training to bolster the training data. This includes a relative peak assignment uncertainty of ± 0.3 Angstrom, in reciprocal space. The value was chosen based on a window of uncertainty amongst common reﬁnement methods and scattering sources. Neural networks require larger training sets than other machine

The CIFs were checked for consistency and proper formatting. CIFs that were missing structures, chemical formulas, or whose symmetry operations were inconsistent with their space group in Fig. 1a were removed from the training set. The removal was performed consistent with established sets of crystallographic rules and classes. CIFs missing one essential ﬁeld, such as structure, were often missing other ﬁelds. 3

Computational Materials Science xxx (xxxx) xxxx

J.A. Aguiar, et al.

split classiﬁcation into three stages along phylogenic lines: crystal family, point groups, and space groups. In keeping with the phylogenic schema, we refer to these hierarchical levels as family, genera, and species, respectively. At each level of the hierarchy a new model was trained, an ensemble of models to predict family, and then within each family, models were trained to predict genera and lastly for each genera, models were trained to diﬀerentiate species. Due to the large branching nature of the schema, models were not trained end to end, but instead used the previous predictions to determine, the proceeding model to use next in the next sequence of steps. Combining diﬀraction and chemistry data, two parallel networks were created to learn from distinct inputs. The architecture shown in Fig. 3 was created using three modules: two, designed to learn from the input data types and one to perform a task, in this case classiﬁcation. The modular design was created to test various hypothesis and allow for ﬂexibility in training and retraining portions of the network, without challenges to retrain an entire network, but as submodules. The modular architecture allows it to be easily extended and retrained in parts to incorporate additional datatypes should other data or combinations become available. Comprised of a series of convolutional layers with max pooling the diﬀraction module was designed to capture the spatial component of the signal. Lacking a spatial component, chemistry is captured by stacked dense layers. During training each layer is followed by a normalization layer. The outputs of both networks were then concatenated and used as feature vectors for a classiﬁcation module, which contained a series of stacked dense layers ending with a SoftMax layer for classiﬁcation. Models at diﬀerent levels of the hierarchy are based on the same architecture. Due to the large number of hyperparameters to test the optimal parameters which were found at the family level and applied to all genera and species level models. The genera and species level models used diﬀerent ﬁnal SoftMax layers than the family model to accommodate for the varying number of classes. As we will show in the ablation study optimizing each stage separately or using a recurrent neural network architecture could be an area for further research. An example network for classifying family is comprised of a diffraction module, a chemistry module and a classiﬁcation module. The diﬀraction module contains four stacked blocks of convolution, pooling, normalization and activation layers. The initial convolutional layer is comprised of 3 × 3 kernels with an output tensor of 1 × 40 × 900. After pooling the output is the batch size × 40 × 450. Repeating the process of convolution and pooling three times yields a ﬁnal output shape of 1 × 40 × 112, which is then ﬂattened into a 1 × 4032 tensor to be concatenated with the output of the chemistry module. The chemistry module contains four stacked blocks of dense, normalization and activation layers. The initial chemistry input is a 1 × 118 tensor containing the atomic composition of the elements present in the structure. The dense layer contains 20 nodes and subsequent layers have 15, 11, and 8 nodes, respectively. The outputs of the two modules are then concatenated and put into the classiﬁcation module dense layers. The classiﬁcation module had four blocks of dense, normalization, and activation layers. Dense layers had 500, 250 and then C nodes respectively, where C is the number of classes at the stage in the hierarchy. For example, if the classiﬁcation module was for families C would be seven. The last layer of the classiﬁcation module is a SoftMax layer.

learning algorithms. In order to address both the scarcity and imbalance of these rarer classes we deﬁned a set of functions that would mimic data collected within an experimental setting. We deﬁned two augmentations for the diﬀraction input and one for the chemistry. The functions were chosen to replicate experimental variations that were plausible across experimental modalities. Diﬀraction augmentation accounts for variations within camera calibration and peak localization methods. Peak positions were shifted by a number of bins drawn from a normal distribution centered at 0 within a variance of 1.5, bins where the width of a bin equated to 0.006 1/Angstroms. The range of possible shifts were chosen to account for diﬀerences in binning method, centering of experimental data, and dispersion variation over the entire input proﬁle. For atomic percentage, we allowed a composition to change by up to 5 atomic % (at. %) or 5 parts per million (ppm) to mimic the experimental uncertainty amongst common experimental modalities. Methods for chemical composition analysis of materials include, energy dispersive X-ray spectroscopy (EDS), atom probe tomography (APT), mass spectrometry (MS), and electron energy loss spectroscopy (EELS). Pending quantiﬁed standards that calibrate the results of these techniques there is upper uncertainty bound of 5 at. %, where we have implemented this value into the model to cover a signiﬁcant range and higher ablation value. With higher certainty the statistics lends itself to improved classiﬁcation, where potential for higher background is captured with higher ablation. 2.5. Experimental validation set A robust processing pipeline reported in Fig. 2 was developed for diﬀerent collection modalities of diﬀraction to create the necessary feature vector for model input. Two-dimensional diﬀraction data is azimuthally integrated to create a proﬁle in pixel space that is used alongside calibration settings to determine the d-spacing of the peaks. To detect peak positions in reciprocal spacing proﬁles are processed through a max voting algorithm. The voting algorithm often does not require a background subtraction to ﬁt peaks, where instead utilizes a max polling variational proﬁle to deﬁne a rising feature as a peak. The detected peak locations are binned and cataloged by position. Chemistry data is implemented as a simple binary vector to capture the presence of elements in a material and if available, a vector of atomic percentage. 3. Machine learning model development 3.1. Model selection Machine learning algorithms including random forests, naïve Bayes, and support vector machines (SVMs) were compared with artiﬁcial neural networks to determine which algorithms would be best suited for the task of structural characterization. Training was performed using an Nvidia DGX-1, utilizing multiple Tesla V100 graphical processing units (GPUs) within the high performance computing resources at Idaho National Laboratory (INL) [48]. Neural networks were shown to have many positive properties and higher predictive capabilities for this task than other machine learning algorithms. The availability of large datasets and augmentation methods made it possible to train neural networks. Neural networks (NNs) represent a class of learning algorithms. We use convolutional neural networks (CNNs) to capture the diﬀraction inputs due to their spatial component [49]. A series of dense layers were used to capture the chemistry.

4. Training 4.1. Training data

3.2. Hierarchical classiﬁcation Due to the imbalance in membership at all levels of the hierarchy a leave one out cross validation split was used to generate training and validation data instead of a single training, validation and testing set as a single balanced testing set would have either been small or used all

Attempts to directly classify space group in the past were challenged by overrepresented classes. A hierarchical approach decomposes the diﬃcult classiﬁcation problem into smaller and manageable tasks. We 4

Computational Materials Science xxx (xxxx) xxxx

J.A. Aguiar, et al.

For diﬀraction, starting with combinations of convolutional blocks containing at least four layers was designed to extract spatially ordered data in Fig. 4A. Comparatively, Fig. 4B is composed of sequential convolution blocks ending with a ﬂattening operation for classiﬁcation. For chemistry modules composed of sequential dense blocks ending with a single dense layer in Fig. 4c created a simpliﬁed representation of chemistry, where dense blocks containing 3 layers in Fig. 4d are designed to ﬁnd relationships between non-spatially ordered variables. These selections in layers highlight the reﬁnements in the model for additional data and augmentations. For the genera and species level the ablation studies started with diﬀraction only and then added additional features, including chemistry.

the examples of the rare classes. Instead, reported in Fig. 1b we split the data into ﬁve folds, trained on four, and tested against the remaining portion. Models trained on each fold combination were aggregated and compared to determine how much the model was overﬁtting and to see if generalization occurred. Within the open materials data repository there are signiﬁcant imbalances over number of space groups, point groups, and crystal families. In order to address the imbalance between classes a weighting was applied to the loss function at training to incentivize correctly predicting rarer classes and high symmetry. Considering uncertainties in peak position, ablated peaks, missing elements, or composition numbers lended itself to further data augmentation to further bolster rarer classes during training and generate more training data.

4.4. Family level ablation results

4.2. Hyperparameter search

To determine the roles of chemistry and diﬀraction have on classiﬁcation, versions of the model that only incorporated one modality were trained for comparison. The model containing only chemistry had limited predictive power as seen in Fig. 5A compared to the other models. Within the cubic class the model performs at 98% accuracy and has a signiﬁcantly higher chance than random. Fig. 5B contains the model trained on only diﬀraction. The model performs well across all families with an average accuracy ~88%, with the largest drop in accuracy between monoclinic and triclinic families. These classes represent those with minimal symmetric operations, resulting in primitive representation on atomic arrangement of materials. Fig. 5C shows the eﬀect the number of bins has on the model’s ability to predict. The model was trained using a reduced feature vector with 180 features instead of 900, parameters including number of layers, stride, kernel size, normalization remain the same. The model accuracy suﬀers noticeably, dropping an average of ~30 % accuracy across all families and mode collapse is observed in the band of misclassiﬁcations surrounding orthorhombic. Confusion matrices in Fig. 5D–G are models utilizing both diﬀraction and chemistry data. These models had key features of chemistry augmentation, diﬀraction augmentation and normalization removed to evaluate the eﬀectiveness and strength of features including, diﬀraction only without normalization during training (Fig. 5D), diﬀraction only with normalization (Fig. 5E), diﬀraction and chemistry (Fig. 5F), and diﬀraction and chemistry with diﬀraction augmentation (Fig. 5G). There are marginal improvements from adding in diﬀraction augmentation, where there is ~1–2% improvement in orthorhombic through the cubic family, but a decrease in performance in monoclinic of ~5%. Allowing the atomic percentage to vary by a margin of 5 at. %, decreases performance at lower symmetry without a noticeable increase in accuracy at higher symmetries when predicting family. This behavior suggests generically material chemistries are not organized over crystal

Starting with a modular architecture presents some challenges when optimizing the complete architecture because it introduces several potential axes for tunable hyperparameters. For this case study, the architecture is composed of three modules that can contain variable number of layers and types of connections. Initial attempts to determine which hyper-parameters to hold ﬁxed and which to include in the optimization were one oﬀ model comparisons were used to see if changing a speciﬁc parameter (e.g. layer depth, stride, and kernel size) yielded noticeable changes to model performance. Due to the size of the search space, initial comparisons were between partially trained models to increase the cycle time on iterations. Parameters of the augmentation functions were held constant across all levels of the hierarchy. 4.3. Hierarchical model ablation on crystal family, genera, and species classiﬁcation To determine which portions of the model contributed most to the predictive accuracy an ablation study was performed at each level of the hierarchy that compared variations of the model. Augmentations produce diﬀerent eﬀects at each level of classiﬁcation. We discuss the implications of these variations further as they appear during the hierarchical ablation study. In order to elucidate which portions of the deep learning model were most impactful for classiﬁcation, variations on the model were trained with the same hyperparameters. At the family level, we tested models that contained only the diﬀraction module, only the chemistry module, permutations on augmentation and without normalization. Due to the combinatorial nature of possible combinations and speciﬁc attributes for the ablation study, a selection of variations are shown in Fig. 4 showing a progression through the diﬀerent case studies for deducing predictions based on either or both diﬀraction and chemistry.

Fig. 4. Module Architectures and Block Descriptions. a) Convolution blocks contain four layers designed to extract spatially ordered data. b) Diﬀraction modules are composed of sequential convolution blocks ending with a ﬂatten operation. c) Chemistry modules are composed of sequential dense blocks ending with a single dense layer to create a simpliﬁed representation of chemistry. d) Dense blocks contain 3 layers and are designed to ﬁnd relationships between non-spatially ordered variables. 5

Computational Materials Science xxx (xxxx) xxxx

J.A. Aguiar, et al.

Fig. 5. Confusion matrices of family level predictions. Predicted and expected family classiﬁcation, where predicted is the vertical and expected is the horizontal starting with triclinic (1), monoclinic (2), orthorhombic (3), tetragonal (4), trigonal (5), hexagonal (6), and cubic (7). Confusion matrices trained on: A) chemistry only, B) diﬀraction only, C) diﬀraction with wider bins, D) diﬀraction only without normalization during training, E) diﬀraction only with normalization, F) diﬀraction and chemistry, G) diﬀraction and chemistry with diﬀraction augmentation, H) diﬀraction and chemistry with combined augmentations. Values reported in percentages.

Table 1 Results of Family to Genera Ablation Study. Numbers reported in the table are averages across all genera present within the family, individual genera may perform better than the average for the family. Within each family common genera have higher accuracy than rarer genera. Values reported in percentages.

Chemistry augmentation had a positive eﬀect on more balanced datasets, with an increase of an average of 2–7% accuracy for most classes. Orthorhombic and Tetragonal crystal families having only chemistry augmentation during training decreased correct classiﬁcations of uncommon genera by ~10–15%. Diﬀraction augmentations had a positive eﬀect for predicting genera within the trigonal and hexagonal families, with an average increase of 2–4%. For the other crystal families, it had a negative eﬀect, lowering accuracies by 8–12%. Within the cubic family distinguishing peaks are tightly clustered distributions with less variance, allowing diﬀraction peak position to shift by more than 0.02 Angstroms or 3 bins obscures critical information and produces worse models that are heavily prone to mode collapse. At lower symmetry, there is a signiﬁcant imbalance within the data, where the peak distributions have higher variance consistent with a reduced order of symmetry operations for these more primitive classes. Combined augmentations produced more consistent models for predicting genera within Tetragonal, Trigonal, and Hexagonal crystal

family, however within crystal family at the point group level, considering chemistry improves the classiﬁcation by as much as 9% as reported in Table 1. 5. Genera level ablation results With general trends apparent at the family level, a reduced set of model variations were used in the ablation study of the genera level classiﬁcation. A set of ﬁve variations tested were: diﬀraction, with chemistry and no augmentation, with chemistry and chemistry augmentation, with chemistry and diﬀraction augmentation, with chemistry and both variants of augmentation. Table 1 shows the average changes across all genera within each family as diﬀerent features in the model. Adding chemistry improved the predictive power of models across all crystal families except hexagonal. The largest improvement in accuracy was distinguishing between cubic genera with an average 9% increase in accuracy, but contains signiﬁcantly higher improvements for less common genera. 6

Computational Materials Science xxx (xxxx) xxxx

J.A. Aguiar, et al.

extracting crystallographic structure from high resolution imaging, diﬀraction-based, and ﬁrst principles based datasets [31,50]. Diﬀering in implementation the model discussed here provides a tool posed for determining phase utilizing a modular architecture that takes clear advantages of augmentation, structural and chemical data. In light of the signiﬁcant imbalances of data, we were able to train a hierarchical set of models and test to a high accuracy above 80% at all levels of crystallography. There are however identiﬁed limitations and challenges worth noting. Challenges include distinguishing between near symmetrical similar space groups, overlapping diﬀraction peaks, or multiphases captured in a single diﬀraction pattern. In these complications there are suggested strategies for utilizing less prevalent diﬀraction peaks and combinations that lend themselves to crystal family and space group diﬀerentiation. Similarly, additional sample orientations can be acquired to disambiguate the prediction further in scenarios were ambiguity may exist. An additional challenge in our approach is the potential for determining multiple unknown phases captured in a single or stack of diﬀraction patterns. Utilizing the same predictive framework an implemented strategy has been employed to predict over all permutations forming a statistical distribution rather than a single set of predictions. Leveraging the high throughput and predictive nature of our deep learning framework that occurs over a few milliseconds the architecture and speed lends itself naturally to additional implementations and development beyond the initial training for crystallography. For the purposes of evaluating a deep learning model for material classiﬁcation from diﬀraction and chemistry data, we were able to develop, evaluate and deploy a modular predictive architecture for materials prediction. As additional information becomes available and experiments are performed the modular architecture can be further evolved.

families with lower variance. Between the models trained on diﬀerent folds, bettered the accuracy for rare and uncommon classes, decreased the accuracy for common classes. The tradeoﬀ between higher accuracy for common classes or better predictive power for rare classes represents a choice when considering how the model will be used. The variance in performance from augmentation appears to be a function of data balance and prevalence, as well as symmetry within a crystal family consistent with the arrangement and organization of materials and crystallography. 5.1. Species level ablation results At the species level imbalances between classes creates a noticeable eﬀect with mode collapse aﬀecting several genera within the orthorhombic and cubic crystal families. With individual species comprising greater than 90% of the population in their genera presents a signiﬁcant imbalance when considering the training and implementation of our models. Two diﬀerent accuracies are reported in the table to highlight the disparity between common and rare species. Raw accuracy is the percentage of correctly classiﬁed proﬁles across all species, scaled accuracy is the average accuracy of each species within the genera. Raw accuracy being higher than scaled accuracy is a symptom of imbalance where the trained model becomes preferential to common species due prevalence in the training set. Even outside of the extreme cases, imbalance between classes increased when going from the genera level to the species level. Models with just diﬀraction, combined diﬀraction and chemistry, and combined data with augmentation were compared for the ablation study and results are captured within Table 2. Chemistry had a pronounced eﬀect, improving performance predicting species within all genera between 10% and 35%. Despite signiﬁcant ability to classify materials we note information contained in the training data is not uniformly distributed across all space group, crystal families, or material classes. It is unclear if the abundance of crystals in the common classes is representative of the true distribution of materials or sampling bias that is a product of past research eﬀorts being concentrated on speciﬁc materials. The imbalance between space groups within the data set proved to be one of the greatest challenges in producing good models. Prior developments by Ward et al., and Ovierado et al. are similar tools utilizing machine learning for the purposes of evaluating and

6. Concluding remarks This paper shows the development and demonstration of a deep learning hierarchial-based model for materials classiﬁcation and discovery from either or combined prospectives of material structure and chemistry. Modular neural networks provide a ﬂexible framework to build multimodal models . With an average accuracy above 85% at each level of the hierarchy, the deep learning model can predict the space

Table 2 Genera to Species Ablation Study Summary. Change in accuracy is the change in raw accuracy. Genera are color coded for to their respective families in Table 1 to highlight the structure of the species (space groups). Genera 1, 2, 5, and 22 are omitted because they contain only a single species. Genera 19 and 21 are omitted because insuﬃcient proﬁles were available to train models, with less than 1000 proﬁles within the cleaned training set.

7

Computational Materials Science xxx (xxxx) xxxx

J.A. Aguiar, et al.

group of an unknown crystal structure without any a priori information. By providing a ranked list of possible space groups and potential chemistries, the deep learning-based model and workﬂow represents a milestone towards fully automated materials applications, where readily identifying materials and their behavior is a focus. [14]

Disclaimer This information was prepared as an account of work sponsored by an agency of the U.S. Government. Neither the U.S. Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. References herein to an speciﬁc commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the U.S. Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reﬂect those of the U. S. Government or any agency thereof.

[15]

[16]

[17]

[18]

[19]

Credit authorship contribution statement

[20]

Jeﬀery A. Aguiar: Supervision, Conceptualization, Data curation, Writing - review & editing, Validation, Methodology. Matthew L. Gong: Data curation, Software, Writing - original draft. Tolga Tasdizen: Supervision.

[21]

[22]

Conﬂict of interest [23]

No conﬂicts of interest are expressed at this time.

[24]

References [25] [1] E. Pomarico, Y.-J. Kim, F.J.G. de Abajo, O.-H. Kwon, F. Carbone, R.M. van der Veen, Ultrafast electron energy-loss spectroscopy in transmission electron microscopy, MRS Bull. 43 (2018) 497–503, https://doi.org/10.1557/mrs.2018.148. [2] R.F. Egerton, Electron energy-loss spectroscopy, In the Electron Microscope, ssecond ed., Plenum Press, 1996. [3] A. Kumar, O. Ovchinnikov, S. Guo, F. Griggio, S. Jesse, S. Trolier-McKinstry, S.V. Kalinin, Spatially resolved mapping of disorder type and distribution in random systems using artiﬁcial neural network recognition, Phys Rev B. 84 (2011), https:// doi.org/10.1103/PhysRevB.84.024203. [4] S.V. Kalinin, B.G. Sumpter, R.K. Archibald, Big–deep–smart data in imaging for guiding materials design, Nat. Mater. 14 (2015) 973–980, https://doi.org/10.1038/ nmat4395. [5] M.L. Green, C.L. Choi, J.R. Hattrick-Simpers, A.M. Joshi, I. Takeuchi, S.C. Barron, E. Campo, T. Chiang, S. Empedocles, J.M. Gregoire, A.G. Kusne, J. Martin, A. Mehta, K. Persson, Z. Trautt, J. Van Duren, A. Zakutayev, Fulﬁlling the promise of the materials genome initiative with high-throughput experimental methodologies, Appl. Phys. Rev. 4 (2017) 011105, , https://doi.org/10.1063/1.4977487. [6] I. Takeuchi, M. Lippmaa, Y. Matsumoto, Combinatorial Experimentation and Materials Informatics, MRS Bull. 31 (2006) 999–1003, https://doi.org/10.1557/ mrs2006.228. [7] R.D. King, J. Rowland, W. Aubrey, M. Liakata, M. Markham, L.N. Soldatova, K.E. Whelan, A. Clare, M. Young, A. Sparkes, S.G. Oliver, P. Pir, The Robot Scientist Adam, Computer. 42 (2009) 46–54, https://doi.org/10.1109/MC.2009.270. [8] Networking chemical robots for reaction multitasking | Nature Communications, (n. d.). https://www.nature.com/articles/s41467-018-05828-8. (accessed July 26, 2019). [9] D. Xue, P.V. Balachandran, J. Hogden, J. Theiler, D. Xue, T. Lookman, Accelerated search for materials with targeted properties by adaptive design, Nat. Commun. 7 (2016) 11241, https://doi.org/10.1038/ncomms11241. [10] N. Bonnet, Artiﬁcial intelligence and pattern recognition techniques in microscope image processing and analysis, in: P.W. Hawkes (Ed.), Advances in Imaging and Electron Physics, Elsevier Academic Press Inc., San Diego, 2000, p. 114, , https:// doi.org/10.1016/S1076-5670(00)80020-8. [11] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15 (2003), https://doi.org/10.1162/ 089976603321780317. [12] S. Jesse, S.V. Kalinin, Principal component and spatial correlation analysis of spectroscopic-imaging data in scanning probe microscopy, Nanotechnology. 20 (2009), https://doi.org/10.1088/0957-4484/20/8/085714. [13] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A.

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

8

Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems, ArXiv:1603.04467 [Cs]. (2016). http://arxiv.org/abs/1603.04467. (accessed July 27, 2019). K. Lee, J. Caverlee, S. Webb, Uncovering Social Spammers: Social Honeypots + Machine Learning, in: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA, 2010: pp. 435–442. doi:10.1145/1835449.1835522. C.-H. Liu, Y. Tao, D. Hsu, Q. Du, S.J.L. Billinge, Using a machine learning approach to determine the space group of a structure from the atomic pair distribution function, Acta Cryst. A 75 (2019) 633–643, https://doi.org/10.1107/ S2053273319005606. J. Dongarra, P. Beckman, T. Moore, P. Aerts, G. Aloisio, J.C. Andre, D. Barkai, J.Y. Berthou, T. Boku, B. Braunschweig, F. Cappello, B. Chapman, X. Chi, A. Choudhary, S. Dosanjh, T. Dunning, S. Fiore, A. Geist, B. Gropp, R. Harrison, M. Hereld, M. Heroux, A. Hoisie, K. Hotta, Z. Jin, Y. Ishikawa, F. Johnson, S. Kale, R. Kenway, D. Keyes, The international exascale software project roadmap, Int. J. High Perform. Comput. Appl. 25 (2011), https://doi.org/10.1177/ 1094342010391989. A. Agrawal, A. Choudhary, Perspective: Materials informatics and big data: realization of the “fourth paradigm” of science in materials science, APL Mater. 4 (2016) 053208, , https://doi.org/10.1063/1.4946894. Direct Phasing in Crystallography - Carmelo Giacovazzo - Oxford University Press, (n.d.). https://global.oup.com/academic/product/direct-phasing-in-crystallography-9780198500728?cc=us&lang=en&. (accessed October 13, 2019). P. Dewolﬀ, On the determination of unit-cell dimensions from powder diﬀraction patterns, Acta Crystallogr. A 10 (1957) 590–595, https://doi.org/10.1107/ S0365110X57002066. J. Visser, A fully automatic program for ﬁnding unit cell from powder data, J. Appl. Crystallogr. 2 (1969) 89-, https://doi.org/10.1107/S0021889869006649. A.A. Coelho, Indexing of powder diﬀraction patterns by iterative use of singular value decomposition, J. Appl. Crystallogr. 36 (2003) 86–95, https://doi.org/10. 1107/S0021889802019878. A. Altomare, G. Campi, C. Cuocci, L. Eriksson, C. Giacovazzo, A. Moliterni, R. Rizzi, P.-E. Werner, Advances in powder diﬀraction pattern indexing: N-TREOR09, J. Appl. Crystallogr. 42 (2009) 768–775, https://doi.org/10.1107/ S0021889809025503. A. Boultif, D. Louer, Powder pattern indexing with the dichotomy method, J. Appl. Crystallogr. 37 (2004) 724–731, https://doi.org/10.1107/S0021889804014876. M.A. Neumann, X-Cell: a novel indexing algorithm for routine tasks and diﬃcult cases, J. Appl. Crystallogr. 36 (2003) 356–365, https://doi.org/10.1107/ S0021889802023348. A.J. Markvardsen, K. Shankland, W.I.F. David, J.C. Johnston, R.M. Ibberson, M. Tucker, H. Nowell, T. Griﬃn, ExtSym: a program to aid space-group determination from powder diﬀraction data, J. Appl. Crystallogr. 41 (2008) 1177–1181, https://doi.org/10.1107/S0021889808031087. A.A. Coelho, An indexing algorithm independent of peak position extraction for Xray powder diﬀraction patterns, J Appl Cryst. 50 (2017) 1323–1330, https://doi. org/10.1107/S1600576717011359. A.V. Ievlev, M.A. Susner, M.A. McGuire, P. Maksymovych, S.V. Kalinin, Quantitative analysis of the local phase transitions induced by laser heating, ACS Nano 9 (2015) 12442–12450, https://doi.org/10.1021/acsnano.5b05818. C.J. Long, D. Bunker, X. Li, V.L. Karen, I. Takeuchi, Rapid identiﬁcation of structural phases in combinatorial thin-ﬁlm libraries using x-ray diﬀraction and nonnegative matrix factorization, Rev. Sci. Instrum. 80 (2009) 103902, , https://doi. org/10.1063/1.3216809. S. Jesse, P. Maksymovych, S.V. Kalinin, Rapid multidimensional data acquisition in scanning probe microscopy applied to local polarization dynamics and voltage dependent contact mechanics, Appl. Phys. Lett. 93 (2008), https://doi.org/10. 1063/1.2980031. N. Artrith, A. Urban, G. Ceder, Eﬃcient and accurate machine-learning interpolation of atomic energies in compositions with many species, Phys. Rev. B. 96 (2017) 014112, , https://doi.org/10.1103/PhysRevB.96.014112. F. Oviedo, Z. Ren, S. Sun, C. Settens, Z. Liu, N.T.P. Hartono, S. Ramasamy, B.L. DeCost, S.I.P. Tian, G. Romano, A.G. Kusne, T. Buonassisi, Fast and interpretable classiﬁcation of small X-ray diﬀraction datasets using data augmentation and deep neural networks, NPJ Comput. Mater. 5 (2019) 60, https://doi.org/10. 1038/s41524-019-0196-x. J.B. MacQueen, Some methods for classiﬁcation and analysis of multivariate observations, in: L.M.L. Cam, J. Neyman (Eds.), Proc. of the ﬁfth Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, 1967, pp. 281–297. J.J. de Pablo, B. Jones, C.L. Kovacs, V. Ozolins, A.P. Ramirez, The Materials Genome Initiative, the interplay of experiment, theory and computation, Curr. Opin. Solid State Mater. Sci. 18 (2014) 99–117, https://doi.org/10.1016/j.cossms. 2014.02.003. A. Jain, S.P. Ong, G. Hautier, W. Chen, W.D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, K.A. Persson, Commentary: the Materials Project: a materials genome approach to accelerating materials innovation, APL Mater. 1 (2013) 011002, , https://doi.org/10.1063/1.4812323. J.J. de Pablo, N.E. Jackson, M.A. Webb, L.-Q. Chen, J.E. Moore, D. Morgan, R. Jacobs, T. Pollock, D.G. Schlom, E.S. Toberer, J. Analytis, I. Dabo, D.M. DeLongchamp, G.A. Fiete, G.M. Grason, G. Hautier, Y. Mo, K. Rajan, E.J. Reed,

Computational Materials Science xxx (xxxx) xxxx

J.A. Aguiar, et al.

[36]

[37]

[38]

[39]

[40]

[41]

[42]

E. Rodriguez, V. Stevanovic, J. Suntivich, K. Thornton, J.-C. Zhao, New frontiers for the materials genome initiative, NPJ Comput. Mater. 5 (2019) 41, https://doi.org/ 10.1038/s41524-019-0173-4. W.B. Park, J. Chung, J. Jung, K. Sohn, S.P. Singh, M. Pyo, N. Shin, K.-S. Sohn, Classiﬁcation of crystal structure using a convolutional neural network, IUCrJ. 4 (2017) 486–494, https://doi.org/10.1107/S205225251700714X. S.R. Hall, F.H. Allen, I.D. Brown, The crystallographic information ﬁle (CIF): a new standard archive ﬁle for crystallography, Acta Cryst A. 47 (1991) 655–685, https:// doi.org/10.1107/S010876739101067X. S. Curtarolo, W. Setyawan, S. Wang, J. Xue, K. Yang, R.H. Taylor, L.J. Nelson, G.L.W. Hart, S. Sanvito, M. Buongiorno-Nardelli, N. Mingo, O. Levy, AFLOWLIB.ORG: a distributed materials properties repository from highthroughput ab initio calculations, Comput. Mater. Sci. 58 (2012) 227–235, https:// doi.org/10.1016/j.commatsci.2012.02.002. S. Curtarolo, W. Setyawan, G.L.W. Hart, M. Jahnatek, R.V. Chepulskii, R.H. Taylor, S. Wang, J. Xue, K. Yang, O. Levy, M.J. Mehl, H.T. Stokes, D.O. Demchenko, D. Morgan, AFLOW: an automatic framework for high-throughput materials discovery, Comput. Mater. Sci. 58 (2012) 218–226, https://doi.org/10.1016/j. commatsci.2012.02.005. A. Merkys, A. Vaitkus, J. Butkus, M. Okulič-Kazarinas, V. Kairys, S. Gražulis, ıt COD::CIF::Parser: an error-correcting CIF parser for the Perl language, J. Appl. Crystallogr. 49 (2016), https://doi.org/10.1107/S1600576715022396. S. Gražulis, A. Merkys, A. Vaitkus, M. Okulič-Kazarinas, Computing stoichiometric molecular composition from crystal structures, J. Appl. Crystallogr. 48 (2015) 85–91, https://doi.org/10.1107/S1600576714025904. S. Gražulis, A. Daškevič, A. Merkys, D. Chateigner, L. Lutterotti, M. Quirós, N.R. Serebryanaya, P. Moeck, R.T. Downs, A. Le Bail, Crystallography open database (COD): an open-access collection of crystal structures and platform for world-

[43] [44]

[45]

[46]

[47]

[48]

[49] [50]

9

wide collaboration, Nucleic Acids Res. 40 (2012) D420–D427, https://doi.org/10. 1093/nar/gkr900. R.T. Downs, M. Hall-Wallace, The American mineralogist crystal structure database, Am. Mineral. 88 (2003) 247–250. S. Gražulis, D. Chateigner, R.T. Downs, A.F.T. Yokochi, M. Quirós, L. Lutterotti, E. Manakova, J. Butkus, P. Moeck, A. Le Bail, Crystallography open database – an open-access collection of crystal structures, J. Appl. Crystallogr. 42 (2009) 726–729, https://doi.org/10.1107/S0021889809016690. C. Nyshadham, C. Oses, J.E. Hansen, I. Takeuchi, S. Curtarolo, G.L.W. Hart, A computational high-throughput search for new ternary superalloys, Acta Mater. 122 (2017) 438–447, https://doi.org/10.1016/j.actamat.2016.09.017. S. Kirklin, J.E. Saal, V.I. Hegde, C. Wolverton, High-throughput computational search for strengthening precipitates in alloys, Acta Mater. 102 (2016) 125–135, https://doi.org/10.1016/j.actamat.2015.09.016. S.P. Ong, W.D. Richards, A. Jain, G. Hautier, M. Kocher, S. Cholia, D. Gunter, V.L. Chevrier, K.A. Persson, G. Ceder, Python materials genomics (pymatgen): a robust, open-source python library for materials analysis, Comput. Mater. Sci. 68 (2013) 314–319, https://doi.org/10.1016/j.commatsci.2012.10.028. A.R. Brodtkorb, T.R. Hagen, M.L. Sætra, Graphics processing unit (GPU) programming strategies and trends in GPU computing, J. Parallel Distributed Comput. 73 (2013) 4–13, https://doi.org/10.1016/j.jpdc.2012.04.003. S.S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, New York, NY, 1999. L. Ward, K. Michel, C. Wolverton, Automated crystal structure solution from powder diﬀraction data: validation of the ﬁrst-principles-assisted structure solution method, Phys. Rev. Mater. 1 (2017), https://doi.org/10.1103/PhysRevMaterials. 1. 063802.

Crystallographic prediction from diffraction and chemistry data for higher throughput classification using machine learning

Crystallographic prediction from diffraction and chemistry data for higher throughput classification using machine learning

Recommend Documents