Crystallographic prediction from diffraction and chemistry data for higher throughput classification using machine learning

Crystallographic prediction from diffraction and chemistry data for higher throughput classification using machine learning

Computational Materials Science xxx (xxxx) xxxx Contents lists available at ScienceDirect Computational Materials Science journal homepage: www.else...

2MB Sizes 0 Downloads 61 Views

Computational Materials Science xxx (xxxx) xxxx

Contents lists available at ScienceDirect

Computational Materials Science journal homepage: www.elsevier.com/locate/commatsci

Crystallographic prediction from diffraction and chemistry data for higher throughput classification using machine learning ⁎

Jeffery A. Aguiara, , Matthew L. Gonga,b, Tolga Tasdizenb a b

Idaho National Laboratory, Nuclear Materials Department, Idaho Falls, ID 83415, USA University of Utah, Scientific Computing Imaging Institute, Salt Lake City, UT 84106, USA

A R T I C LE I N FO

A B S T R A C T

Keywords: Microscopy Machine learning Data analytics Materials discovery Material informatics

Simultaneously capturing material structure and chemistry in the form of accessible data is often advantageous for drawing correlations and enhancing our understanding of measurable materials behavior and properties. Unfortunately, in many cases, accessing data at the scale required, is highly multidimensional and sparse by the historical and evolving nature of materials science. To mitigate difficulties, we develop and employ methods of data analytics in conjunction with open accessible chemistry and structure datasets, to classify and reduce the amount of data needed for extracting useful descriptors from multidimensional techniques. The construction and systematic ablation of our model highlights the potential for dimensional reduction in data sampling, improved classification, and identification of correlations among material crystallography and chemistry.

1. Introduction

of data [10–12]. Machine learning in other fields of study has been predicated on the creation of large curated datasets and access to orders of magnitude more data than previously available [13–15]. As social media created an explosion of images for computer vision algorithms, high throughput and ultrafast microscopy combined with advances such as in-situ microscopy have the potential to revolutionize autonomous data collection methods, including adaptive image and feature tracking [9]. As materials communities create, curate, share, and aggregate data, new possibilities for machine learning arise, especially considering exascale computing capabilities in the future [16]. Material science as a field is at the center of a confluence of technological advancements that will continue to create new research possibilities. Historically work classifying material structure from collected datasets has been challenged with uncertainty requiring multiple views and in some cases modeling to determine crystal structure. Reporting on correlations between materials systems, composition, fabrication method, and prior experience have since been expressed in terms of generic rules-of-thumb for exploring materials [17]. Efforts to utilize diffraction data to determine space group date decades the of material science, including the transformation of pair distribution functions, diffraction profiles, and two dimensional patterns utilizing a variety of methods [18–23]. Determining space-group information from diffraction pattern have further been based on a number of statistical and brute-force searches [22,24–26]. In many cases of these cases exploring a material system have often required working within a familiar

Determining crystal structure and chemistry are important primary steps in materials science. Diffraction data encompasses a large range of modalities and acquisition methods including diffraction techniques such as X-ray diffraction (XRD), electron based scattering diffraction (EBSD), selected area electron diffraction (SAED), and high-resolution atomic scale (scanning) transmission electron microscopy (S/TEM). In parallel, modern spectrometers detect emission and ionization edges characteristic of the elements over the entire periodic table at ultrafast timescales [1,2]. Collectively the hardware to resolve atomic structure and chemistry have greatly improved and accelerated our abilities to gather more detailed and specific data sets. Simultaneous collection of chemistry and structural data taken from the same single point on a specimen, results in new challenges and opportunities [3–5]. Simultaneous measurements lends itself to new opportunities in materials by design research, where the enormity of data presents new challenges. With the expanded potential volume in data, there is an increasing need to create tools and packages that can analyze multimodal data gathered on increasingly complex and modern equipment [6]. (See Figs. 1–3). Automation is poised to not only revolutionize data collection, but leads itself to new data-driven approaches to materials discovery and subsequent optimization [7–9]. In order to keep pace with the growing tide of data presenting a challenge in data deluge, new analysis tools and workflows must be created for users to aid them in parsing volumes



Corresponding author. E-mail address: Jeff[email protected] (J.A. Aguiar).

https://doi.org/10.1016/j.commatsci.2019.109409 Received 6 August 2019; Received in revised form 20 October 2019; Accepted 11 November 2019 0927-0256/ © 2019 Published by Elsevier B.V.

Please cite this article as: Jeffery A. Aguiar, Matthew L. Gong and Tolga Tasdizen, Computational Materials Science, https://doi.org/10.1016/j.commatsci.2019.109409

Computational Materials Science xxx (xxxx) xxxx

J.A. Aguiar, et al.

Fig. 1. Schematic for Materials Data and Structure. a) The structure of the database used to create and store the training set. b) Relative composition of the database based on crystallographic family, genera, and species. A high-level overview of the distribution shows an abundance and scarcity of certain families and genera. The black lines are representative of portions of data starting from the inner circle of crystal family to outer circle of point group data. c) Data allocation during the different folds during training and testing for cross-validation.

Fig. 2. Data Processing and Augmentation Schema for Diffraction. From the data source in the form of translated diffractograms from high resolution images or recorded diffraction patterns lends itself to extrapolating radial profiles using the same workflow. Legend dissociates figure into acquisition modalities and processing required to create a uniform feature vector.

classes and crystal family classification that require higher accuracies and deduction [36]. Based on the need for improved quantitative tools for deducing not only low to high symmetry classes from experimental data. There is a confounding need to leverage computational tools and platforms that utilize either or both structural and chemistry data. Providing ranked predictions, from multimodal data has not yet been demonstrated fully in the community. In this work, we present a modular neural network architecture a simplified and generalizable representation of crystallography and chemistry to classify crystal structure. The generalized representation and space group prediction from either or both diffraction and chemistry data allows for greater flexibility for users to include in their research. The service demonstrates a workflow and analysis tool for high throughput characterization and deduces crystal family, point group, or space group from experimental data. By reducing the complexity and need for familiarity the model lends itself to additional opportunities for under-represented and poorly understood materials. The work will detail the method for creating the deep learning service and provide a demonstration of the increased speed of an automated workflow. The service is designed to alleviate data deluge and provide a simple reliable workflow that does not require expert knowledge of material structure

material systems [3]. This situation is further complicated by the presence of a confounding defects and additional phases that can otherwise be mislabeled, not identified, or otherwise ignored due to low signal to noise. Current models can miss this information if the signals are not over pronounced or otherwise not identified. However, this is an over identified limitation of current techniques, and has been an active area of research for extrapolating crystallography, where researchers focused on materials properties and behavior have been focused on understanding how physical structure and chemistry are linked to properties [27–30]. One therefore naturally seeks a simplified and generalized representation of materials data that conveys the underlying complexity of these relationships, including the multidimensional materials sampling problem and accurately classifying materials amongst challenges in misclassification and signaling limitations [31,32]. The described scenario is a grand challenge amongst many communities looking to developing methods for equipping high throughput instruments with near autonomous workflows utilizing the breadth of generated data [33–35]. To date machine learning models for crystallography have demonstrated up to a 81% accuracy for determining space group from simulated diffraction based data, but challenged to deduce experimental data, especially for lower symmetry 2

Computational Materials Science xxx (xxxx) xxxx

J.A. Aguiar, et al.

Fig. 3. Schematic for modular architecture combining structure and chemistry. A) Modular network schematic and data concatenation for merging chemistry and structural determination from diffractograms. Boxes are color coded to three module types that are illustrated in (B), with encompassing layer architectures.

After checking the CIFs for formatting, diffraction profiles were simulated using PyMatgen [47]. The profiles were converted into a single feature vector parsed based on peak positions and the basis for data folds in Fig. 1b. Profiles that contained no peaks in the range between 0.50 and 6 Angstroms were also removed from the training set. Remaining profiles were stored in a SQL database with labels for family, genera, species and chemistry. The cleaned training set shown in Fig. 1c consisted of 431,000 profiles with their associated chemical metadata, including composition.

and chemistry in real-time. 2. Methods 2.1. Data collection and cleaning gathering data To assemble a training set, crystal information files (CIFs) were gathered from open materials and crystallography databases [37]. CIF files were primarily acquired from Materials Project, Aflow, and Open Crystallography Database (OCD) [34,38–44]. The CIFs included crystals from all crystal families in varying proportions reported in Fig. 1. The pre-simulation dataset consisted of 572,000 CIFs encoded with additioal material descriptors reported in Fig. 1a. Additional CIFs for underrepresented space groups from the inorganic crystal structure database (ICSD), included to provide further examples. CIFs were then used to generate a standard query language (SQL) database with relevant crystallography data associated with computed diffraction profiles as function of scattering angle, reciprocal lattice spacing, and chemical composition information. Encoded into a diffraction profile are the structural fingerprints for any material based on crystal geometry, underlying atomic coordination, and occupancy. Under the influence of an impinging X-ray, and destructive interference neutron, or electron results in a series of peaks in scattered intensity, where there is constructive interference, forming a two-dimensional diffraction pattern. Pending on scattering geometry one specific pattern is generated per sample orientation, where a single crystal for one orientation will show a series of identifiable peaks in reciprocal space. Filling in all reciprocals requires a highly polycrystalline sample or alternatively a sample that is precessed through all sample orientations and diffracting conditions completing the Ewald sphere. For the purposes of classification, all orientations and identifiable peaks are input into the model. Chemistry is input into the model as an additional descriptor and further augments the models to deduce material structure [45,46].

2.3. Data representation Simplifying the representation of diffraction and chemistry shown in Fig. 3 allows for a broader set of data acquisition methods. Whether the input is a Fourier transformed high resolution atomic scale image, or diffraction profile acquired using electrons, neutrons, or X-rays the relevant atomic scattering peaks are positioned with respect to their crystallographic scattering position in reciprocal space. A classification model therefore that considers peak position alone in reciprocal space is therefore impervious to changes in technique. Training data was therefore built from CIFs and simulated a wealth of available features to train models, including their chemical signature. We opted for a minimalist representation that would represent features that are expected to be present in the widest range of acquisition method. For diffraction, we reduced the profile to a vector of peak locations. Peaks were represented in the vector as 0 if no peak was detected and 1 if a peak was detected. We then divided the peaks into 900 bins uniformly partitioning the range from 0.5 to 6 Angstroms in reciprocal lattice spacing. This range was chosen to accommodate a wide variety of techniques to within 0.10 Angstrom resolution. By imposing fewer requirements and assumptions on the model inputs we were able to create generalized model that builds in uncertainty to deduce classification and augmentation. 2.4. Data augmentation schema

2.2. Data parsing and curation In addition to the simulated diffraction profiles, a set of augmentation operations were defined on the data set to be used during training to bolster the training data. This includes a relative peak assignment uncertainty of ± 0.3 Angstrom, in reciprocal space. The value was chosen based on a window of uncertainty amongst common refinement methods and scattering sources. Neural networks require larger training sets than other machine

The CIFs were checked for consistency and proper formatting. CIFs that were missing structures, chemical formulas, or whose symmetry operations were inconsistent with their space group in Fig. 1a were removed from the training set. The removal was performed consistent with established sets of crystallographic rules and classes. CIFs missing one essential field, such as structure, were often missing other fields. 3

Computational Materials Science xxx (xxxx) xxxx

J.A. Aguiar, et al.

split classification into three stages along phylogenic lines: crystal family, point groups, and space groups. In keeping with the phylogenic schema, we refer to these hierarchical levels as family, genera, and species, respectively. At each level of the hierarchy a new model was trained, an ensemble of models to predict family, and then within each family, models were trained to predict genera and lastly for each genera, models were trained to differentiate species. Due to the large branching nature of the schema, models were not trained end to end, but instead used the previous predictions to determine, the proceeding model to use next in the next sequence of steps. Combining diffraction and chemistry data, two parallel networks were created to learn from distinct inputs. The architecture shown in Fig. 3 was created using three modules: two, designed to learn from the input data types and one to perform a task, in this case classification. The modular design was created to test various hypothesis and allow for flexibility in training and retraining portions of the network, without challenges to retrain an entire network, but as submodules. The modular architecture allows it to be easily extended and retrained in parts to incorporate additional datatypes should other data or combinations become available. Comprised of a series of convolutional layers with max pooling the diffraction module was designed to capture the spatial component of the signal. Lacking a spatial component, chemistry is captured by stacked dense layers. During training each layer is followed by a normalization layer. The outputs of both networks were then concatenated and used as feature vectors for a classification module, which contained a series of stacked dense layers ending with a SoftMax layer for classification. Models at different levels of the hierarchy are based on the same architecture. Due to the large number of hyperparameters to test the optimal parameters which were found at the family level and applied to all genera and species level models. The genera and species level models used different final SoftMax layers than the family model to accommodate for the varying number of classes. As we will show in the ablation study optimizing each stage separately or using a recurrent neural network architecture could be an area for further research. An example network for classifying family is comprised of a diffraction module, a chemistry module and a classification module. The diffraction module contains four stacked blocks of convolution, pooling, normalization and activation layers. The initial convolutional layer is comprised of 3 × 3 kernels with an output tensor of 1 × 40 × 900. After pooling the output is the batch size × 40 × 450. Repeating the process of convolution and pooling three times yields a final output shape of 1 × 40 × 112, which is then flattened into a 1 × 4032 tensor to be concatenated with the output of the chemistry module. The chemistry module contains four stacked blocks of dense, normalization and activation layers. The initial chemistry input is a 1 × 118 tensor containing the atomic composition of the elements present in the structure. The dense layer contains 20 nodes and subsequent layers have 15, 11, and 8 nodes, respectively. The outputs of the two modules are then concatenated and put into the classification module dense layers. The classification module had four blocks of dense, normalization, and activation layers. Dense layers had 500, 250 and then C nodes respectively, where C is the number of classes at the stage in the hierarchy. For example, if the classification module was for families C would be seven. The last layer of the classification module is a SoftMax layer.

learning algorithms. In order to address both the scarcity and imbalance of these rarer classes we defined a set of functions that would mimic data collected within an experimental setting. We defined two augmentations for the diffraction input and one for the chemistry. The functions were chosen to replicate experimental variations that were plausible across experimental modalities. Diffraction augmentation accounts for variations within camera calibration and peak localization methods. Peak positions were shifted by a number of bins drawn from a normal distribution centered at 0 within a variance of 1.5, bins where the width of a bin equated to 0.006 1/Angstroms. The range of possible shifts were chosen to account for differences in binning method, centering of experimental data, and dispersion variation over the entire input profile. For atomic percentage, we allowed a composition to change by up to 5 atomic % (at. %) or 5 parts per million (ppm) to mimic the experimental uncertainty amongst common experimental modalities. Methods for chemical composition analysis of materials include, energy dispersive X-ray spectroscopy (EDS), atom probe tomography (APT), mass spectrometry (MS), and electron energy loss spectroscopy (EELS). Pending quantified standards that calibrate the results of these techniques there is upper uncertainty bound of 5 at. %, where we have implemented this value into the model to cover a significant range and higher ablation value. With higher certainty the statistics lends itself to improved classification, where potential for higher background is captured with higher ablation. 2.5. Experimental validation set A robust processing pipeline reported in Fig. 2 was developed for different collection modalities of diffraction to create the necessary feature vector for model input. Two-dimensional diffraction data is azimuthally integrated to create a profile in pixel space that is used alongside calibration settings to determine the d-spacing of the peaks. To detect peak positions in reciprocal spacing profiles are processed through a max voting algorithm. The voting algorithm often does not require a background subtraction to fit peaks, where instead utilizes a max polling variational profile to define a rising feature as a peak. The detected peak locations are binned and cataloged by position. Chemistry data is implemented as a simple binary vector to capture the presence of elements in a material and if available, a vector of atomic percentage. 3. Machine learning model development 3.1. Model selection Machine learning algorithms including random forests, naïve Bayes, and support vector machines (SVMs) were compared with artificial neural networks to determine which algorithms would be best suited for the task of structural characterization. Training was performed using an Nvidia DGX-1, utilizing multiple Tesla V100 graphical processing units (GPUs) within the high performance computing resources at Idaho National Laboratory (INL) [48]. Neural networks were shown to have many positive properties and higher predictive capabilities for this task than other machine learning algorithms. The availability of large datasets and augmentation methods made it possible to train neural networks. Neural networks (NNs) represent a class of learning algorithms. We use convolutional neural networks (CNNs) to capture the diffraction inputs due to their spatial component [49]. A series of dense layers were used to capture the chemistry.

4. Training 4.1. Training data

3.2. Hierarchical classification Due to the imbalance in membership at all levels of the hierarchy a leave one out cross validation split was used to generate training and validation data instead of a single training, validation and testing set as a single balanced testing set would have either been small or used all

Attempts to directly classify space group in the past were challenged by overrepresented classes. A hierarchical approach decomposes the difficult classification problem into smaller and manageable tasks. We 4

Computational Materials Science xxx (xxxx) xxxx

J.A. Aguiar, et al.

For diffraction, starting with combinations of convolutional blocks containing at least four layers was designed to extract spatially ordered data in Fig. 4A. Comparatively, Fig. 4B is composed of sequential convolution blocks ending with a flattening operation for classification. For chemistry modules composed of sequential dense blocks ending with a single dense layer in Fig. 4c created a simplified representation of chemistry, where dense blocks containing 3 layers in Fig. 4d are designed to find relationships between non-spatially ordered variables. These selections in layers highlight the refinements in the model for additional data and augmentations. For the genera and species level the ablation studies started with diffraction only and then added additional features, including chemistry.

the examples of the rare classes. Instead, reported in Fig. 1b we split the data into five folds, trained on four, and tested against the remaining portion. Models trained on each fold combination were aggregated and compared to determine how much the model was overfitting and to see if generalization occurred. Within the open materials data repository there are significant imbalances over number of space groups, point groups, and crystal families. In order to address the imbalance between classes a weighting was applied to the loss function at training to incentivize correctly predicting rarer classes and high symmetry. Considering uncertainties in peak position, ablated peaks, missing elements, or composition numbers lended itself to further data augmentation to further bolster rarer classes during training and generate more training data.

4.4. Family level ablation results

4.2. Hyperparameter search

To determine the roles of chemistry and diffraction have on classification, versions of the model that only incorporated one modality were trained for comparison. The model containing only chemistry had limited predictive power as seen in Fig. 5A compared to the other models. Within the cubic class the model performs at 98% accuracy and has a significantly higher chance than random. Fig. 5B contains the model trained on only diffraction. The model performs well across all families with an average accuracy ~88%, with the largest drop in accuracy between monoclinic and triclinic families. These classes represent those with minimal symmetric operations, resulting in primitive representation on atomic arrangement of materials. Fig. 5C shows the effect the number of bins has on the model’s ability to predict. The model was trained using a reduced feature vector with 180 features instead of 900, parameters including number of layers, stride, kernel size, normalization remain the same. The model accuracy suffers noticeably, dropping an average of ~30 % accuracy across all families and mode collapse is observed in the band of misclassifications surrounding orthorhombic. Confusion matrices in Fig. 5D–G are models utilizing both diffraction and chemistry data. These models had key features of chemistry augmentation, diffraction augmentation and normalization removed to evaluate the effectiveness and strength of features including, diffraction only without normalization during training (Fig. 5D), diffraction only with normalization (Fig. 5E), diffraction and chemistry (Fig. 5F), and diffraction and chemistry with diffraction augmentation (Fig. 5G). There are marginal improvements from adding in diffraction augmentation, where there is ~1–2% improvement in orthorhombic through the cubic family, but a decrease in performance in monoclinic of ~5%. Allowing the atomic percentage to vary by a margin of 5 at. %, decreases performance at lower symmetry without a noticeable increase in accuracy at higher symmetries when predicting family. This behavior suggests generically material chemistries are not organized over crystal

Starting with a modular architecture presents some challenges when optimizing the complete architecture because it introduces several potential axes for tunable hyperparameters. For this case study, the architecture is composed of three modules that can contain variable number of layers and types of connections. Initial attempts to determine which hyper-parameters to hold fixed and which to include in the optimization were one off model comparisons were used to see if changing a specific parameter (e.g. layer depth, stride, and kernel size) yielded noticeable changes to model performance. Due to the size of the search space, initial comparisons were between partially trained models to increase the cycle time on iterations. Parameters of the augmentation functions were held constant across all levels of the hierarchy. 4.3. Hierarchical model ablation on crystal family, genera, and species classification To determine which portions of the model contributed most to the predictive accuracy an ablation study was performed at each level of the hierarchy that compared variations of the model. Augmentations produce different effects at each level of classification. We discuss the implications of these variations further as they appear during the hierarchical ablation study. In order to elucidate which portions of the deep learning model were most impactful for classification, variations on the model were trained with the same hyperparameters. At the family level, we tested models that contained only the diffraction module, only the chemistry module, permutations on augmentation and without normalization. Due to the combinatorial nature of possible combinations and specific attributes for the ablation study, a selection of variations are shown in Fig. 4 showing a progression through the different case studies for deducing predictions based on either or both diffraction and chemistry.

Fig. 4. Module Architectures and Block Descriptions. a) Convolution blocks contain four layers designed to extract spatially ordered data. b) Diffraction modules are composed of sequential convolution blocks ending with a flatten operation. c) Chemistry modules are composed of sequential dense blocks ending with a single dense layer to create a simplified representation of chemistry. d) Dense blocks contain 3 layers and are designed to find relationships between non-spatially ordered variables. 5

Computational Materials Science xxx (xxxx) xxxx

J.A. Aguiar, et al.

Fig. 5. Confusion matrices of family level predictions. Predicted and expected family classification, where predicted is the vertical and expected is the horizontal starting with triclinic (1), monoclinic (2), orthorhombic (3), tetragonal (4), trigonal (5), hexagonal (6), and cubic (7). Confusion matrices trained on: A) chemistry only, B) diffraction only, C) diffraction with wider bins, D) diffraction only without normalization during training, E) diffraction only with normalization, F) diffraction and chemistry, G) diffraction and chemistry with diffraction augmentation, H) diffraction and chemistry with combined augmentations. Values reported in percentages.

Table 1 Results of Family to Genera Ablation Study. Numbers reported in the table are averages across all genera present within the family, individual genera may perform better than the average for the family. Within each family common genera have higher accuracy than rarer genera. Values reported in percentages.

Chemistry augmentation had a positive effect on more balanced datasets, with an increase of an average of 2–7% accuracy for most classes. Orthorhombic and Tetragonal crystal families having only chemistry augmentation during training decreased correct classifications of uncommon genera by ~10–15%. Diffraction augmentations had a positive effect for predicting genera within the trigonal and hexagonal families, with an average increase of 2–4%. For the other crystal families, it had a negative effect, lowering accuracies by 8–12%. Within the cubic family distinguishing peaks are tightly clustered distributions with less variance, allowing diffraction peak position to shift by more than 0.02 Angstroms or 3 bins obscures critical information and produces worse models that are heavily prone to mode collapse. At lower symmetry, there is a significant imbalance within the data, where the peak distributions have higher variance consistent with a reduced order of symmetry operations for these more primitive classes. Combined augmentations produced more consistent models for predicting genera within Tetragonal, Trigonal, and Hexagonal crystal

family, however within crystal family at the point group level, considering chemistry improves the classification by as much as 9% as reported in Table 1. 5. Genera level ablation results With general trends apparent at the family level, a reduced set of model variations were used in the ablation study of the genera level classification. A set of five variations tested were: diffraction, with chemistry and no augmentation, with chemistry and chemistry augmentation, with chemistry and diffraction augmentation, with chemistry and both variants of augmentation. Table 1 shows the average changes across all genera within each family as different features in the model. Adding chemistry improved the predictive power of models across all crystal families except hexagonal. The largest improvement in accuracy was distinguishing between cubic genera with an average 9% increase in accuracy, but contains significantly higher improvements for less common genera. 6

Computational Materials Science xxx (xxxx) xxxx

J.A. Aguiar, et al.

extracting crystallographic structure from high resolution imaging, diffraction-based, and first principles based datasets [31,50]. Differing in implementation the model discussed here provides a tool posed for determining phase utilizing a modular architecture that takes clear advantages of augmentation, structural and chemical data. In light of the significant imbalances of data, we were able to train a hierarchical set of models and test to a high accuracy above 80% at all levels of crystallography. There are however identified limitations and challenges worth noting. Challenges include distinguishing between near symmetrical similar space groups, overlapping diffraction peaks, or multiphases captured in a single diffraction pattern. In these complications there are suggested strategies for utilizing less prevalent diffraction peaks and combinations that lend themselves to crystal family and space group differentiation. Similarly, additional sample orientations can be acquired to disambiguate the prediction further in scenarios were ambiguity may exist. An additional challenge in our approach is the potential for determining multiple unknown phases captured in a single or stack of diffraction patterns. Utilizing the same predictive framework an implemented strategy has been employed to predict over all permutations forming a statistical distribution rather than a single set of predictions. Leveraging the high throughput and predictive nature of our deep learning framework that occurs over a few milliseconds the architecture and speed lends itself naturally to additional implementations and development beyond the initial training for crystallography. For the purposes of evaluating a deep learning model for material classification from diffraction and chemistry data, we were able to develop, evaluate and deploy a modular predictive architecture for materials prediction. As additional information becomes available and experiments are performed the modular architecture can be further evolved.

families with lower variance. Between the models trained on different folds, bettered the accuracy for rare and uncommon classes, decreased the accuracy for common classes. The tradeoff between higher accuracy for common classes or better predictive power for rare classes represents a choice when considering how the model will be used. The variance in performance from augmentation appears to be a function of data balance and prevalence, as well as symmetry within a crystal family consistent with the arrangement and organization of materials and crystallography. 5.1. Species level ablation results At the species level imbalances between classes creates a noticeable effect with mode collapse affecting several genera within the orthorhombic and cubic crystal families. With individual species comprising greater than 90% of the population in their genera presents a significant imbalance when considering the training and implementation of our models. Two different accuracies are reported in the table to highlight the disparity between common and rare species. Raw accuracy is the percentage of correctly classified profiles across all species, scaled accuracy is the average accuracy of each species within the genera. Raw accuracy being higher than scaled accuracy is a symptom of imbalance where the trained model becomes preferential to common species due prevalence in the training set. Even outside of the extreme cases, imbalance between classes increased when going from the genera level to the species level. Models with just diffraction, combined diffraction and chemistry, and combined data with augmentation were compared for the ablation study and results are captured within Table 2. Chemistry had a pronounced effect, improving performance predicting species within all genera between 10% and 35%. Despite significant ability to classify materials we note information contained in the training data is not uniformly distributed across all space group, crystal families, or material classes. It is unclear if the abundance of crystals in the common classes is representative of the true distribution of materials or sampling bias that is a product of past research efforts being concentrated on specific materials. The imbalance between space groups within the data set proved to be one of the greatest challenges in producing good models. Prior developments by Ward et al., and Ovierado et al. are similar tools utilizing machine learning for the purposes of evaluating and

6. Concluding remarks This paper shows the development and demonstration of a deep learning hierarchial-based model for materials classification and discovery from either or combined prospectives of material structure and chemistry. Modular neural networks provide a flexible framework to build multimodal models . With an average accuracy above 85% at each level of the hierarchy, the deep learning model can predict the space

Table 2 Genera to Species Ablation Study Summary. Change in accuracy is the change in raw accuracy. Genera are color coded for to their respective families in Table 1 to highlight the structure of the species (space groups). Genera 1, 2, 5, and 22 are omitted because they contain only a single species. Genera 19 and 21 are omitted because insufficient profiles were available to train models, with less than 1000 profiles within the cleaned training set.

7

Computational Materials Science xxx (xxxx) xxxx

J.A. Aguiar, et al.

group of an unknown crystal structure without any a priori information. By providing a ranked list of possible space groups and potential chemistries, the deep learning-based model and workflow represents a milestone towards fully automated materials applications, where readily identifying materials and their behavior is a focus. [14]

Disclaimer This information was prepared as an account of work sponsored by an agency of the U.S. Government. Neither the U.S. Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. References herein to an specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the U.S. Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the U. S. Government or any agency thereof.

[15]

[16]

[17]

[18]

[19]

Credit authorship contribution statement

[20]

Jeffery A. Aguiar: Supervision, Conceptualization, Data curation, Writing - review & editing, Validation, Methodology. Matthew L. Gong: Data curation, Software, Writing - original draft. Tolga Tasdizen: Supervision.

[21]

[22]

Conflict of interest [23]

No conflicts of interest are expressed at this time.

[24]

References [25] [1] E. Pomarico, Y.-J. Kim, F.J.G. de Abajo, O.-H. Kwon, F. Carbone, R.M. van der Veen, Ultrafast electron energy-loss spectroscopy in transmission electron microscopy, MRS Bull. 43 (2018) 497–503, https://doi.org/10.1557/mrs.2018.148. [2] R.F. Egerton, Electron energy-loss spectroscopy, In the Electron Microscope, ssecond ed., Plenum Press, 1996. [3] A. Kumar, O. Ovchinnikov, S. Guo, F. Griggio, S. Jesse, S. Trolier-McKinstry, S.V. Kalinin, Spatially resolved mapping of disorder type and distribution in random systems using artificial neural network recognition, Phys Rev B. 84 (2011), https:// doi.org/10.1103/PhysRevB.84.024203. [4] S.V. Kalinin, B.G. Sumpter, R.K. Archibald, Big–deep–smart data in imaging for guiding materials design, Nat. Mater. 14 (2015) 973–980, https://doi.org/10.1038/ nmat4395. [5] M.L. Green, C.L. Choi, J.R. Hattrick-Simpers, A.M. Joshi, I. Takeuchi, S.C. Barron, E. Campo, T. Chiang, S. Empedocles, J.M. Gregoire, A.G. Kusne, J. Martin, A. Mehta, K. Persson, Z. Trautt, J. Van Duren, A. Zakutayev, Fulfilling the promise of the materials genome initiative with high-throughput experimental methodologies, Appl. Phys. Rev. 4 (2017) 011105, , https://doi.org/10.1063/1.4977487. [6] I. Takeuchi, M. Lippmaa, Y. Matsumoto, Combinatorial Experimentation and Materials Informatics, MRS Bull. 31 (2006) 999–1003, https://doi.org/10.1557/ mrs2006.228. [7] R.D. King, J. Rowland, W. Aubrey, M. Liakata, M. Markham, L.N. Soldatova, K.E. Whelan, A. Clare, M. Young, A. Sparkes, S.G. Oliver, P. Pir, The Robot Scientist Adam, Computer. 42 (2009) 46–54, https://doi.org/10.1109/MC.2009.270. [8] Networking chemical robots for reaction multitasking | Nature Communications, (n. d.). https://www.nature.com/articles/s41467-018-05828-8. (accessed July 26, 2019). [9] D. Xue, P.V. Balachandran, J. Hogden, J. Theiler, D. Xue, T. Lookman, Accelerated search for materials with targeted properties by adaptive design, Nat. Commun. 7 (2016) 11241, https://doi.org/10.1038/ncomms11241. [10] N. Bonnet, Artificial intelligence and pattern recognition techniques in microscope image processing and analysis, in: P.W. Hawkes (Ed.), Advances in Imaging and Electron Physics, Elsevier Academic Press Inc., San Diego, 2000, p. 114, , https:// doi.org/10.1016/S1076-5670(00)80020-8. [11] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15 (2003), https://doi.org/10.1162/ 089976603321780317. [12] S. Jesse, S.V. Kalinin, Principal component and spatial correlation analysis of spectroscopic-imaging data in scanning probe microscopy, Nanotechnology. 20 (2009), https://doi.org/10.1088/0957-4484/20/8/085714. [13] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A.

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

8

Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems, ArXiv:1603.04467 [Cs]. (2016). http://arxiv.org/abs/1603.04467. (accessed July 27, 2019). K. Lee, J. Caverlee, S. Webb, Uncovering Social Spammers: Social Honeypots + Machine Learning, in: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA, 2010: pp. 435–442. doi:10.1145/1835449.1835522. C.-H. Liu, Y. Tao, D. Hsu, Q. Du, S.J.L. Billinge, Using a machine learning approach to determine the space group of a structure from the atomic pair distribution function, Acta Cryst. A 75 (2019) 633–643, https://doi.org/10.1107/ S2053273319005606. J. Dongarra, P. Beckman, T. Moore, P. Aerts, G. Aloisio, J.C. Andre, D. Barkai, J.Y. Berthou, T. Boku, B. Braunschweig, F. Cappello, B. Chapman, X. Chi, A. Choudhary, S. Dosanjh, T. Dunning, S. Fiore, A. Geist, B. Gropp, R. Harrison, M. Hereld, M. Heroux, A. Hoisie, K. Hotta, Z. Jin, Y. Ishikawa, F. Johnson, S. Kale, R. Kenway, D. Keyes, The international exascale software project roadmap, Int. J. High Perform. Comput. Appl. 25 (2011), https://doi.org/10.1177/ 1094342010391989. A. Agrawal, A. Choudhary, Perspective: Materials informatics and big data: realization of the “fourth paradigm” of science in materials science, APL Mater. 4 (2016) 053208, , https://doi.org/10.1063/1.4946894. Direct Phasing in Crystallography - Carmelo Giacovazzo - Oxford University Press, (n.d.). https://global.oup.com/academic/product/direct-phasing-in-crystallography-9780198500728?cc=us&lang=en&. (accessed October 13, 2019). P. Dewolff, On the determination of unit-cell dimensions from powder diffraction patterns, Acta Crystallogr. A 10 (1957) 590–595, https://doi.org/10.1107/ S0365110X57002066. J. Visser, A fully automatic program for finding unit cell from powder data, J. Appl. Crystallogr. 2 (1969) 89-, https://doi.org/10.1107/S0021889869006649. A.A. Coelho, Indexing of powder diffraction patterns by iterative use of singular value decomposition, J. Appl. Crystallogr. 36 (2003) 86–95, https://doi.org/10. 1107/S0021889802019878. A. Altomare, G. Campi, C. Cuocci, L. Eriksson, C. Giacovazzo, A. Moliterni, R. Rizzi, P.-E. Werner, Advances in powder diffraction pattern indexing: N-TREOR09, J. Appl. Crystallogr. 42 (2009) 768–775, https://doi.org/10.1107/ S0021889809025503. A. Boultif, D. Louer, Powder pattern indexing with the dichotomy method, J. Appl. Crystallogr. 37 (2004) 724–731, https://doi.org/10.1107/S0021889804014876. M.A. Neumann, X-Cell: a novel indexing algorithm for routine tasks and difficult cases, J. Appl. Crystallogr. 36 (2003) 356–365, https://doi.org/10.1107/ S0021889802023348. A.J. Markvardsen, K. Shankland, W.I.F. David, J.C. Johnston, R.M. Ibberson, M. Tucker, H. Nowell, T. Griffin, ExtSym: a program to aid space-group determination from powder diffraction data, J. Appl. Crystallogr. 41 (2008) 1177–1181, https://doi.org/10.1107/S0021889808031087. A.A. Coelho, An indexing algorithm independent of peak position extraction for Xray powder diffraction patterns, J Appl Cryst. 50 (2017) 1323–1330, https://doi. org/10.1107/S1600576717011359. A.V. Ievlev, M.A. Susner, M.A. McGuire, P. Maksymovych, S.V. Kalinin, Quantitative analysis of the local phase transitions induced by laser heating, ACS Nano 9 (2015) 12442–12450, https://doi.org/10.1021/acsnano.5b05818. C.J. Long, D. Bunker, X. Li, V.L. Karen, I. Takeuchi, Rapid identification of structural phases in combinatorial thin-film libraries using x-ray diffraction and nonnegative matrix factorization, Rev. Sci. Instrum. 80 (2009) 103902, , https://doi. org/10.1063/1.3216809. S. Jesse, P. Maksymovych, S.V. Kalinin, Rapid multidimensional data acquisition in scanning probe microscopy applied to local polarization dynamics and voltage dependent contact mechanics, Appl. Phys. Lett. 93 (2008), https://doi.org/10. 1063/1.2980031. N. Artrith, A. Urban, G. Ceder, Efficient and accurate machine-learning interpolation of atomic energies in compositions with many species, Phys. Rev. B. 96 (2017) 014112, , https://doi.org/10.1103/PhysRevB.96.014112. F. Oviedo, Z. Ren, S. Sun, C. Settens, Z. Liu, N.T.P. Hartono, S. Ramasamy, B.L. DeCost, S.I.P. Tian, G. Romano, A.G. Kusne, T. Buonassisi, Fast and interpretable classification of small X-ray diffraction datasets using data augmentation and deep neural networks, NPJ Comput. Mater. 5 (2019) 60, https://doi.org/10. 1038/s41524-019-0196-x. J.B. MacQueen, Some methods for classification and analysis of multivariate observations, in: L.M.L. Cam, J. Neyman (Eds.), Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, 1967, pp. 281–297. J.J. de Pablo, B. Jones, C.L. Kovacs, V. Ozolins, A.P. Ramirez, The Materials Genome Initiative, the interplay of experiment, theory and computation, Curr. Opin. Solid State Mater. Sci. 18 (2014) 99–117, https://doi.org/10.1016/j.cossms. 2014.02.003. A. Jain, S.P. Ong, G. Hautier, W. Chen, W.D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, K.A. Persson, Commentary: the Materials Project: a materials genome approach to accelerating materials innovation, APL Mater. 1 (2013) 011002, , https://doi.org/10.1063/1.4812323. J.J. de Pablo, N.E. Jackson, M.A. Webb, L.-Q. Chen, J.E. Moore, D. Morgan, R. Jacobs, T. Pollock, D.G. Schlom, E.S. Toberer, J. Analytis, I. Dabo, D.M. DeLongchamp, G.A. Fiete, G.M. Grason, G. Hautier, Y. Mo, K. Rajan, E.J. Reed,

Computational Materials Science xxx (xxxx) xxxx

J.A. Aguiar, et al.

[36]

[37]

[38]

[39]

[40]

[41]

[42]

E. Rodriguez, V. Stevanovic, J. Suntivich, K. Thornton, J.-C. Zhao, New frontiers for the materials genome initiative, NPJ Comput. Mater. 5 (2019) 41, https://doi.org/ 10.1038/s41524-019-0173-4. W.B. Park, J. Chung, J. Jung, K. Sohn, S.P. Singh, M. Pyo, N. Shin, K.-S. Sohn, Classification of crystal structure using a convolutional neural network, IUCrJ. 4 (2017) 486–494, https://doi.org/10.1107/S205225251700714X. S.R. Hall, F.H. Allen, I.D. Brown, The crystallographic information file (CIF): a new standard archive file for crystallography, Acta Cryst A. 47 (1991) 655–685, https:// doi.org/10.1107/S010876739101067X. S. Curtarolo, W. Setyawan, S. Wang, J. Xue, K. Yang, R.H. Taylor, L.J. Nelson, G.L.W. Hart, S. Sanvito, M. Buongiorno-Nardelli, N. Mingo, O. Levy, AFLOWLIB.ORG: a distributed materials properties repository from highthroughput ab initio calculations, Comput. Mater. Sci. 58 (2012) 227–235, https:// doi.org/10.1016/j.commatsci.2012.02.002. S. Curtarolo, W. Setyawan, G.L.W. Hart, M. Jahnatek, R.V. Chepulskii, R.H. Taylor, S. Wang, J. Xue, K. Yang, O. Levy, M.J. Mehl, H.T. Stokes, D.O. Demchenko, D. Morgan, AFLOW: an automatic framework for high-throughput materials discovery, Comput. Mater. Sci. 58 (2012) 218–226, https://doi.org/10.1016/j. commatsci.2012.02.005. A. Merkys, A. Vaitkus, J. Butkus, M. Okulič-Kazarinas, V. Kairys, S. Gražulis, ıt COD::CIF::Parser: an error-correcting CIF parser for the Perl language, J. Appl. Crystallogr. 49 (2016), https://doi.org/10.1107/S1600576715022396. S. Gražulis, A. Merkys, A. Vaitkus, M. Okulič-Kazarinas, Computing stoichiometric molecular composition from crystal structures, J. Appl. Crystallogr. 48 (2015) 85–91, https://doi.org/10.1107/S1600576714025904. S. Gražulis, A. Daškevič, A. Merkys, D. Chateigner, L. Lutterotti, M. Quirós, N.R. Serebryanaya, P. Moeck, R.T. Downs, A. Le Bail, Crystallography open database (COD): an open-access collection of crystal structures and platform for world-

[43] [44]

[45]

[46]

[47]

[48]

[49] [50]

9

wide collaboration, Nucleic Acids Res. 40 (2012) D420–D427, https://doi.org/10. 1093/nar/gkr900. R.T. Downs, M. Hall-Wallace, The American mineralogist crystal structure database, Am. Mineral. 88 (2003) 247–250. S. Gražulis, D. Chateigner, R.T. Downs, A.F.T. Yokochi, M. Quirós, L. Lutterotti, E. Manakova, J. Butkus, P. Moeck, A. Le Bail, Crystallography open database – an open-access collection of crystal structures, J. Appl. Crystallogr. 42 (2009) 726–729, https://doi.org/10.1107/S0021889809016690. C. Nyshadham, C. Oses, J.E. Hansen, I. Takeuchi, S. Curtarolo, G.L.W. Hart, A computational high-throughput search for new ternary superalloys, Acta Mater. 122 (2017) 438–447, https://doi.org/10.1016/j.actamat.2016.09.017. S. Kirklin, J.E. Saal, V.I. Hegde, C. Wolverton, High-throughput computational search for strengthening precipitates in alloys, Acta Mater. 102 (2016) 125–135, https://doi.org/10.1016/j.actamat.2015.09.016. S.P. Ong, W.D. Richards, A. Jain, G. Hautier, M. Kocher, S. Cholia, D. Gunter, V.L. Chevrier, K.A. Persson, G. Ceder, Python materials genomics (pymatgen): a robust, open-source python library for materials analysis, Comput. Mater. Sci. 68 (2013) 314–319, https://doi.org/10.1016/j.commatsci.2012.10.028. A.R. Brodtkorb, T.R. Hagen, M.L. Sætra, Graphics processing unit (GPU) programming strategies and trends in GPU computing, J. Parallel Distributed Comput. 73 (2013) 4–13, https://doi.org/10.1016/j.jpdc.2012.04.003. S.S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, New York, NY, 1999. L. Ward, K. Michel, C. Wolverton, Automated crystal structure solution from powder diffraction data: validation of the first-principles-assisted structure solution method, Phys. Rev. Mater. 1 (2017), https://doi.org/10.1103/PhysRevMaterials. 1. 063802.