2020 Volume 61 Issue 11 Pages 2058-2066
A Data-driven analysis system developed in the first-term SIP “Structural Material for Innovation” is briefly explained using several practical applications. The developed system is composed of two major systems: the data-driven prediction system and the 3D/4D analysis system. In the data-driven prediction system, the two methods in data science, that is, data assimilation and sparse modeling, are applied to optimize model parameters for the physical and phenomenological models developed in other MI systems, such as the structure and performance prediction and microstructure prediction modules, using experimental and numerical databases. Whereas, in the 3D/4D analysis system, it is demonstrated that the microstructural database can be efficiently utilized to predict mechanical properties, as well as to extract detailed geometrical information concerning the constituent microstructures.
This Paper was Originally Published in Japanese in Materia Japan 58 (2019) 503–510.

Fig. 11 Descriptors available in the MIPHA and forward analysis/inverse analysis.
The first-term SIP “Structural Materials for Innovation” started in 2014.1) We aimed to establish “Materials Integration” (MI) as the basis of an integrated system for materials development, seamlessly linking the basic elements of the field (processes, structures, properties, and performance) on computers. With such a system, predicting the structure and performance of almost any material would be possible. We envision a “property-space analysis system”, which is the theme of this article, as a system in which the methods of data science are applied to optimizing model parameters for a physical and phenomenological model, using a database based on experiments and numerical calculations, and to improving rapidly the accuracy of various modules in the MI system, such as the structure and performance prediction modules. The theme of the second-term SIP is “inverse problems”. We have done research and development about not only how to improve predictive accuracy, but also how to select explanatory variables automatically, how to develop physical and phenomenological models to explain phenomena, how to manage large-scale three-dimensional material structure data effectively, and how best to utilize three-dimensional structure information.
The property-space analysis system is composed of two major systems: the data-driven type prediction system and the 3D/4D analysis system. In the following, we will briefly explain the contents of each system, considering actual application examples.
The data-driven type prediction system employs two methods from data science: the data assimilation method, which is essential for improving the accuracy of individual modules, and sparse modeling, which enables us to select explanatory variables and a model. We have attempted to modularize these two methods, and to apply the system to concrete problems in materials science.
2.1 Data assimilationData assimilation is a method that has been studied intensively in the fields of meteorology and oceanography.2,3) In these fields, one might think it possible to predict actual weather changes by constructing a spatiotemporal model based on physical laws and performing a large-scale numerical analysis. However, when such a procedure is attempted, the nonlinearity of the system caused the prediction result to significantly differ from the observation, depending on the initial condition, the boundary condition, and the given parameters included in the model. Therefore, it became essential to have a method that enables us to perform a numerical analysis in which an actual phenomenon can be reproduced more accurately, by setting the initial condition, the boundary condition, and the model parameters in the numerical analysis to be consistent with the actual ones based on observations. Data assimilation made this possible. For structural materials, the difficulty in making a prediction is due, as it is in meteorology and oceanography, to multi-scale non-uniformity and the incompleteness of a physical model. Perhaps, the most significant merit of using data assimilation instead of general machine learning is that with the former it is possible not only to make a prediction considering such non-uniformity and incompleteness, but also to make a quantitative evaluation of the uncertainty in the result of the numerical analysis (Uncertainty Quantification, UQ4)). There are two methods of data assimilation: one is the successive data assimilation method, which follows time in the forward order, and the other is the four-dimensional variation method, which follows time backward.
2.1.1 Successive data assimilation methodThe successive data assimilation method is a very useful method if we already have a numerical model of the target physical phenomenon in hand. To provide an overview of successive data assimilation, we will use a typical method, the Ensemble Kalman Filter (EnKF5)). The EnKF is widely used not only in meteorology and oceanography, but also in other fields, because it can be applied even when a model is nonlinear, and the implementation (including parallelization) is relatively easy.
In the successive data assimilation method, we first assume that the time development equation (system model) that explains a phenomenon is given in the following discrete form:
| \begin{equation*} x_{t + 1} = f(x_{t},v_{t}) \end{equation*} |
| \begin{equation*} y_{t} = h(x_{t}) + w_{t} \end{equation*} |
| \begin{equation*} P(x_{t}|y_{1:t}) = \frac{P(y_{t}|x_{t})P(x_{t}|y_{1:t - 1})}{P(y_{t})} \end{equation*} |
In the EnKF, we express the probability density function P(xt|y1:t) for the state vector xt using a set of discrete state vectors $x_{t|t}^{(i)}$ based on an ensemble approximation shown in the following equation:
| \begin{equation*} P(x_{t}|y_{1:t}) = \frac{1}{N}\sum\nolimits_{i = 1}^{N} \delta (x_{t} - x_{t|t}^{(i)}) \end{equation*} |
| \begin{equation*} x_{t|t - 1}^{(i)} = f(x_{t - 1|t - 1}^{(i)},v_{t - 1}) \end{equation*} |
| \begin{equation*} P(x_{t}|y_{1:t - 1}) = \frac{1}{N}\sum\nolimits_{i = 1}^{N} \delta (x_{t} - x_{t|t - 1}^{(i)}) \end{equation*} |
| \begin{equation*} y_{t} = H_{t}x_{t} + w_{t} \end{equation*} |
| \begin{equation*} x_{t|t}^{(i)} = x_{t|t - 1}^{(i)} + \skew3\hat{K}_{t}(y_{t} + w_{t}^{(i)} - H_{t}x_{t|t - 1}^{(i)}) \end{equation*} |
In the four-dimensional variation method,6) unlike the EnKF, we cannot use an existing numerical model as it is. However, this method enables us to perform data assimilation effectively using a limited computer resources, and is applicable to larger-scale simulations.
In the four-dimensional variation method, we first assume that the system model is given by the following time-evolution equation:
| \begin{equation*} \frac{\partial x}{\partial\mathrm{t}} - F(x) = 0,\quad x(0) = x_{0} \end{equation*} |
| \begin{equation*} J = \int\nolimits_{0}^{T} \mathcal{J}(x,y)dt \end{equation*} |
| \begin{equation*} \mathcal{L} = \int\nolimits_{0}^{T} \mathcal{J}(x,y)dt + \int\nolimits_{0}^{T} \lambda^{\dagger}\left(\frac{\partial x}{\partial t} - F(x)\right)dt \end{equation*} |
| \begin{align*} \delta\mathcal{L} & = \int\nolimits_{0}^{T}(\nabla_{x}\mathcal{J})^{\dagger}\delta x\,dt + \int\nolimits_{0}^{T}\delta\lambda^{\dagger}\left(\frac{\partial x}{\partial t} - F(x)\right)dt \\ &\quad + \int\nolimits_{0}^{T}\lambda^{\dagger}\left(\frac{\partial\delta x}{\partial t} - (\nabla_{x}F)\delta x\right)dt\\ & = \int\nolimits_{0}^{T} \delta\lambda^{\dagger}\left(\frac{\partial x}{\partial t} - F(x)\right)dt + \lambda^{\dagger}(T)\delta x(T) - \lambda^{\dagger}(0)\delta x(0)\\ &\quad + \int\nolimits_{0}^{T} \left((\nabla_{x}\mathcal{J})^{\dagger} - \lambda^{\dagger}(\nabla_{x}F) - \frac{\partial\lambda^{\dagger}}{\partial\mathrm{t}}\right)\delta x\,dt \end{align*} |
| \begin{equation*} \frac{\partial x}{\partial t} - F(x) = 0 \end{equation*} |
| \begin{equation*} \nabla_{x}\mathcal{J}(x,y) - (\nabla_{x}F(x))^{\dagger}\lambda - \frac{\partial\lambda}{\partial\mathrm{t}} = 0 \end{equation*} |
| \begin{equation*} \lambda(0) = \lambda(T) = 0 \end{equation*} |
| \begin{equation*} \delta\mathcal{L} = \delta J = (\nabla_{x_{0}}J)^{\dagger}\delta x_{0} = \lambda^{\dagger}(0)\delta x_{0} \end{equation*} |
As shown above, the four-dimensional variation method can obtain the optimum solution with high accuracy with a small number of update steps, using the gradient method. Therefore, it has an advantage that this method requires far fewer computational resources than does the EnKF, which uses an ensemble approximation. Also, the formulation is highly compatible with other numerical analysis methods based on the variation method such as the finite element method. On the other hand, since the conventional four-dimensional variation method is a maximum likelihood estimation method, we can obtain only the one point of the optimum solution. However, as mentioned above, for the analysis of structural materials that have large uncertainties, it is necessary to clearly quantify the uncertainty in the optimum solution (UQ) in order to discuss the reliability of estimation results. For this purpose, in the MI project, we also expanded the method to clearly show the uncertainty of estimation.7,8) Please refer to Refs. 7), 8) for more details.
2.1.3 Example of the application of data assimilationAs an application of data assimilation, we will show an example where the recrystallization behavior of an Al alloy was investigated by data assimilation.9) Figure 1 shows a result of the recrystallization behavior of an Al alloy with high deformation processing on which ageing treatment was performed and observed using an SEM-EBSD. It has been shown that the recrystallization behavior of an Al alloy with a large stacking-fault energy can be well explained by the sub-grain growth model of Humphreys.10) Hence, we used the Multi-Phase-Field (MPF) model as the system model. Also, it is known that the mobility of the interface and the surface energy can be well explained by a model in which the crystal orientation dependence is taken into account.10–12) Although the strain energy introduced by processing varies depending on the crystal orientation, several cases of bottom-up type model construction have been reported thus far. However, as Humphreys pointed out,13) one can say that the modeling is still under development and being improved. Therefore, in this study, we performed analysis with a view to estimate the strain energy and its time evolution, using experimental data.

Recrystallization behavior of Al alloy (A1050, 573 K).
Figure 2 shows the time evolution of the area fraction and the strain energy in various crystallization orientations, estimated by the EnKF. From this analysis, we can see various things. A lager strain energy is stored in the S band than in the Brass band, and only a very small strain energy is stored in the Cu grain scattered in the S band. Also, we can see that the difference between the strain energies stored in the Cu grain and in the S band corresponds to a dislocation density of about 1014∼1015/m2, in good agreement with the dislocation density in the Al alloy with high deformation processing.14) In this example, only the information about the change of the aggregate structure was used; hence, the information obtained was also limited. Nevertheless, we have obtained a deep insight, and we expect that we can obtain more detailed knowledge in the future, by combining this method with more advanced measurements such as extracting feature quantities from surface structure photographs15) and the X-ray tomography.

Results of assimilation by EnFK. (a) Transition of the area fraction; (b) Strain energy.
The cost of compiling a database of structural material properties is huge in many cases. Therefore, despite the massive effort presently underway, the information stockpile is limited. In many cases, applying machine learning methods is not realistic, as these require a large amount of data. By contrast, sparse modeling is a technique in information science that automatically extracts important parameters to represent a phenomenon properly using a limited amount of information. Therefore, this method is extremely compatible with structural materials, for which the amount of information is limited. Also, sparse modeling can automatically extract the essence of data. It has the potential to contribute greatly to the development of structural materials, concerning which many phenomena remain to be elucidated.
Using Bayes’ theorem, we can formulate sparse modeling as follows: The posterior distribution P(x|y, Mi) of a model parameter x in a model Mi after obtaining an observed value y is given by the following proportionality:
| \begin{equation*} P(x|y,M_{i}) \propto P(y|x,M_{i})P(x|M_{i}) \end{equation*} |
| \begin{equation*} P(M_{i}|y) \propto P(y|M_{i})P(M_{i}) \end{equation*} |
| \begin{equation*} P(y|M_{i}) = \int P(y|x,M_{i})P(x|M_{i})dx \end{equation*} |
Next, we introduce variable selection in multiple regression analysis, and a method for selecting a particular physical model out of many possible ones.
2.2.1 Variable selection in multiple regression analysisWhen only a limited amount of data is available, if there are many candidate explanatory variables for the target phenomenon, a simple multiple-regression analysis causes over-fitting. Therefore, it is necessary to select effective explanatory variables from the many candidates. Various methods have been proposed for variable selection over the years. For example, Pearson’s coefficient correlation16) and so on have been used since a long time, and this method gives nothing but an index representing the degree of linear relation between two variables: an explanatory variable and the output. Therefore, this method itself cannot tell us the optimum combination of variables. At present, the L1 regularization method called the LASSO (Least Absolute Shrinkage and Selection Operator)17) is widely used. The LASSO is a method to compress the dimension of explanatory variables by adding a constraint condition that $\sum | x| \leq t$ for coefficients x for the explanatory variables, when the evaluation function J(x, y) for an observed value y is given, and t is a regularization parameter. The combination of explanatory variables selected by the LASSO is indeed the optimum solution for a hyperparameter λ. However, the optimum combination varies depending on the hyperparameter, and the information about the degree of priority for the selected optimum solution compared with other combinations will not be shown. In structural materials, many phenomena are entangled complexly, and it is hard to accept that there exists a unique absolute model completely describing those phenomena. Therefore, in the MI system, a module that uses a method called ES-LiR (Exhaustive Search for Linear Regression)18,19) was introduced for linear multiple regression analysis, instead of the LASSO. ES-LiR is a method to compute BFEs comprehensively by linear regression; it not only shows the optimum solution clearly, but also provides information about the state density in the vicinity of the optimum. Because of limitations of the length of the article, all details are not discussed; please refer to the Refs. 18), 19) for details about the formulation.
As an application of ES-LiR, we will show an example dealing with an estimated formula for the start point of the martensitic transformation (Ms point) of steel.20) It is known that the Ms point of steel strongly depends on the concentration of alloying elements such as carbon, and many estimated formulas have been proposed over the years.21–23) An example is the following:23)
| \begin{align*} \textit{Ms} &= 512 - 453C + 15\textit{Cr} - 16.9\textit{Ni} - 9.5\textit{Mo} \\ &\quad + 217C^{2} - 71.5\textit{CMn} - 67.6\textit{CCr} \end{align*} |

(a) BFE distribution and estimated formulas; (b) Optimum solution.

(a) Variable selection in ten best models (by BFE); (b) Importance of variables.
In problems of structural materials, it is not uncommon that multiple mechanisms are proposed to explain a phenomenon. Also, it is no exaggeration to say that a judgement about which mechanism governs the actual phenomenon is based on the intuition and experience of each researcher. Making a proper judgement on such a matter requires long years of experience, and the depth of the individual researcher’s tacit knowledge is essential in the end. However, it is difficult to pass down such tacit knowledge to other researchers, and it is necessary to establish a methodology that does not depend on intuition and experience. Here, we will show an example in which we obtain the BFEs for multiple proposed mechanisms and determine whether it is possible to select the most likely model.
In general, it is difficult to obtain the BFE analytically when a model is nonlinear. Therefore, approximate estimation methods using the MCMC method are widely used.24) Among these, the exchange MCMC method is the one adopted in the MI system, because the result does not easily end with a localized solution even when the posterior distribution is multimodal, and also the BEF can be obtained relatively inexpensively.25) In the exchange MCMC method, the inverse temperature βi is defined, and the posterior distribution for each inverse temperature βi is set to be $P(y|x)^{\beta _{i}}P(x)$, so that the exchange between different inverse temperatures is done in conjunction with the conventional MCMC method. In this case, the BFE is given by the following equation:
| \begin{align*} \mathit{BFE} &= -{\ln} \int P(y|x,M_{i})P(x|M_{i})dx \\ &= -\sum_{j}\ln\langle\exp(-(\beta_{j + 1} - \beta_{j})\log P(y|x,M_{i})\rangle_{\beta_{j}} \end{align*} |
As an application of the exchange MCMC method, we will show an example where we selected a model that explains the actual transformation behavior of steel in the most likely way from various models for the transformation behavior, using experimental data.26) Figure 5(a) shows the transition of the transformation rate of ferrite derived by a measurement of thermal expansion when steel (Fe–0.15C–1.5Mn) was cooled down from 1400°C at 0.3°C/s. The problem here is whether we can discuss the kinetics of the ferrite transformation using only the transition of the transformation rate. In this problem, there are three ferrite morphologies that can be supposed when we have no prior knowledge about formed structures. The first is the lenticular grain boundary ferrite (LF), composed of only incoherent interfaces; the second is a grain-boundary ferrite called the pill-box ferrite (PF), which is composed of both coherent and incoherent interfaces; and the third is a planar ferrite called the ferrite side plate (FSP). Also, there are a total of four observed combinations: two cases in which an LF or a PF is formed alone, respectively; one case in which an FSP is formed after an LF; and one case in which an FSP is formed after a PF (Fig. 5(b)). Note that changes of the transformation rate in individual models are given using the Johnson-Mehr-Avrami-Kolmogorov (JMAK) equation.27,28) Figure 6 shows a graph in which we obtain model parameters for all the models by maximum likelihood estimation, and compare the results of the three models with the smallest estimation errors to the experimental results. We see that in all the models the estimation error is sufficiently small, and it is difficult to select the optimum model based on only this. However, as shown in Table 1, from the BFE we see that the model of PF+FSP shows much larger likelihood than the other models. In fact, in metallurgical observations, it is observed that an FSP is formed after a pill-box grain boundary ferrite is formed. Also, the transition temperature is consistent with the PF+FSP model, which supports the validity of the estimation result using the BFE.26)

Transition of the ferrite transformation rate (1400°C, 0.3°C/s) and ferrite transformation models.

Results of maximum likelihood estimation of the ferrite transformation rate.

For 3D/4D analysis systems, we developed the Material Image Communication Cloud (MICC),29) a cloud system that deals effectively with image data to obtain three-dimensional information of huge material structures, and the Materials Integration Phase Analyzer (MIPHA),30) which utilizes three-dimensional structures in the prediction of properties.
3.1 MICCWhen we consider the properties of structural materials, the details of three-dimensional material structures are important. To obtain such details, it is necessary to collect two-dimensional images of cross-section structures, and to have a three-dimensional image processing environment able to construct a three-dimensional image. However, in order to obtain such an environment, we need to prepare for a high cost and have specialized knowledge. Hence, in the conventional analysis of material structures, only two-dimensional images of cross-section structures have been used. Also, even when a three-dimensional image of a material structure is constructed, deep knowledge about the structure formation of the material is needed to extract the complex geometric form of the material, which is not an easy task even for experts with years of experience. Therefore, it is necessary to have a system that processes images quickly using trial and error, not one based on knowledge and experience.
The MICC is a cloud-based three-dimensional image processing environment accessible through a network from local users’ terminals, by accessing through a network (Fig. 7). In addition, it can manage access authorizations for image data for each laboratory or researcher, and is equipped with a mechanism for data sharing between users. Therefore, it is a system that can support cooperation between multiple research institutions and the optimization of research collaborations. In addition, it has functions to show and manage image processing history, and keeps records of processing done on the same image by different users (Fig. 8). Using such branched image histories, we expect that in the future we can find the optimum image processing method by several trials and errors by many researchers.

Overview of the MICC.

Compilation of a database containing image processing histories.
As an application example of the MICC, let us show a case in which it is applied to a welded part of steel. Figure 9(a) shows a situation where cross-section images in various orientations of a three-dimensional structure of the material have been obtained by reconfiguring photomicrographs of a steel structure taken by serial sectioning. In this way, extraction of a material structure in arbitrary cross-sections enables us easily to extract a hierarchical three-dimensional structure, such as a prior austenite grain boundary or a block/packet of a martensite structure. Figure 9(b) shows a three-dimensional structure of a block extracted in this way. In this example, it was an expert who extracted the structure boundary from the structure forms in various cross-sections, but it is expected that by learning data in which all the work histories of such extraction processes are accumulated, we can automatically discriminate a structure and perform the extraction using photomicrographs obtained by serial sectioning.

Reconstruction of a three-dimensional structure and extraction of a steel structure.
As mentioned before, every material researcher supposes that the three-dimensional geometrical structure of a material should be important for predicting its properties. However, how to collect actual three-dimensional information about material structures, how to use the collected information, and how effective it is are not necessarily obvious. Therefore, before the MI system started to operate, it was necessary to demonstrate how to collect and utilize the three-dimensional information of materials structures and the effect of three-dimensional structure information. We developed the MIPHA as an advance prototype.
In keeping with the overall philosophy of MI, the MIPHA is an integration system in which the following five elements, each of which is essential for dealing with the three-dimensional information in material structures, can be performed as individual modules: (1) structure discrimination, (2) two-dimensional feature extraction, (3) three-dimensional feature extraction, (4) property estimation, and (5) inverse estimation. In particular, there are feature quantities that are important for material engineering (such as the grain size and the volume fraction) and feature quantities that are important for mathematics and image engineering (such as the auto correlation function, mutual information, the distance for the two-point correlation function, and persistent homology). The MIPHA is a system that extracts the former from an image with one click; the shiny MIPHA (under development) extracts the latter. Furthermore, the MIPHA is a tool that easily predicts properties with high accuracy, using these feature quantities.
As an application of the MIPHA, we will show an example in which the property of DP steel was estimated. Figure 10 shows an example in which ferrite and martensite were automatically recognized by machine learning from photomicrographs obtained by serial sectioning, and three-dimensional structures were reconstructed using that information. Also, Fig. 11 shows the material structure factors that influence stress-strain curves. Many geometrical descriptors included here can be extracted by the MIPHA, and we that the estimation of the stress-strain curve is easy by applying various machine learning techniques to these.30)

Reconstruction of a three-dimensional structure and extraction of DP steel structures.

Descriptors available in the MIPHA and forward analysis/inverse analysis.
The goal of the property-space analysis system was to rapidly enhance the accuracy of both structure and performance prediction using methods from data science. It would clearly have been possible to have followed the lead of preceding projects, e.g., the Material Genome Initiative (MGI) in the United States, and to have chosen general machine learning to choose general machine learning models such as neural networks, support vector machines, and Gaussian processes. However, as researchers who have directly been dealing with and observing structural materials, we were not able to erase our doubts about whether the “evaluation by point estimation alone” provided by those methods is truly sufficient. The other possibilities we have noted as promising alternatives in our survey of various fields are data assimilation and sparse modeling. We also noticed that three-dimensional material structures have long been said to be of great importance, and yet they are obtained and used by only a limited number of researchers, and how to use them is far from clear. The MICC is a system that aims to break down barriers to the availability of three-dimensional structures, while the MIPHA provides a prototype for their utilization. The libraries and modules developed in the first-term SIP and introduced in the MI system are just beginning to come into use. There are still many phenomena that are not fully elucidated, but we are confident that the systems we have developed are highly compatible with structural materials exhibiting multi-scale non-uniformity. In the future, in the second-term SIP, we expect that the systems will prove their merits by applying to many metallurgy problems.
This research was performed under the research subjects of ‘Structural Materials for Innovation’ and ‘“Materials Integration” for Revolutionary Design System of Structural Materials’ (management corporation: JST), in Cross-ministerial Strategic Innovation Promotion Program (SIP) by Council for Science, Technology and Innovation, Cabinet Office. We would like to express our gratitude here.