Interpretability of Deep Learning Classification for Low-Carbon Steel Microstructures

Tatsuya Maemura; Hidenori Terasaki; Kazumasa Tsutsui; Kyohei Uto; Shogo Hiramatsu; Kotaro Hayashi; Koji Moriguchi; Shigekazu Morito

doi:10.2320/matertrans.MT-M2020131

Abstract

In this paper, a model is developed to identify the microstructure of low-carbon steel by deep learning. In classifying steel microstructures using a machine learning model, predictions are interpreted using local interpretable model-agnostic explanations (LIME) for the first time. The constructed model can accurately distinguish between eight microstructure types, including upper bainite, lower bainite, martensite, and their mixed structures. The model accuracy is 94.1% when individually predicted and 97.9% when predicted by majority vote. In addition, as a result of interpreting the predictions of the model by LIME, it is evident that the recognition criterion of the constructed model is partially consistent with the classic recognition criterion.

Fig. 8 Input SEM images and LIME output images of T1 (Upper Bainite), T6 (Lower Bainite), and T8 (Martensite). (a), (b), and (c) ((d), (e), and (f)) correspond to the input (output) images of T1, T6, and T8, respectively.

1. Introduction

Steel is used for many applications, such as automobiles, ships, pipes, and structures. To obtain the desired properties, it is necessary to add alloying elements and conduct thermomechanical treatment to obtain an appropriate microstructure. An important additive element is carbon. The final form of carbon is diverse; for example, in low-carbon steel, in response to heat treatment, the following microstructures exist: upper bainite, wherein carbides form at the lath interface; lower bainite, wherein carbides form within the lath; martensite, wherein no carbides form (and it remains in solid solution); and pearlite with a lamellar morphology.¹⁾ In particular, in upper bainite and lower bainite, carbon dissolved in the austenitizing temperature range condenses in the ferrite temperature range to form carbide, which means that the final carbide precipitation form depends on the holding temperature and holding time in the ferrite range.²⁾ Therefore, upper bainite and lower bainite change depending on the heat treatment.

In material development and production, subtle microstructures are identified by skilled personnel based on micrographs by scanning electron microscopy (SEM) or optical microscopy (OM); they are highly dependent on evaluators. To examine microstructures more accurately, it is necessary to adopt a classification method based on mathematical algorithms. In recent years, with the rise of deep learning, steel microstructures have been increasingly examined using image identification and segmentation. For example, using deep learning, Adachi et al.³⁾ classified ferrite, pearlite, ferrite–pearlite, and ferrite–martensite structures using a convolutional neural network (CNN). Likewise, Mulewicz et al.⁴⁾ used a CNN to classify the microstructure OM images of eight steel specimens with different compositions with high accuracy. In addition, DeCost et al.⁵⁾ and Ling et al.⁶⁾ conducted feature extraction by CNN from ultrahigh-carbon steel microstructure images and classified the microstructures accordingly. Azimi et al.⁷⁾ conducted pixel-wise segmentation of low-carbon steel microstructure images using fully CNN with a max voting scheme. Using CNN, DeCost et al.⁸⁾ applied a segmentation model to conduct segmentation on characteristic objects in ultrahigh-carbon steel microstructure images. Chokshi et al.⁹⁾ presented a neural network-based model to predict the phase distribution of boron steel. These reports suggest that the use of deep learning can result in the development of new methods for identifying steel microstructures. However, deep learning guarantees its accuracy by optimizing a large number of parameters; sometimes, this number is in the millions. Therefore, it is not easy to judge why such prediction was performed and to ensure its rationality. For example, in order to avoid taking into account the imaging environment, source differences, deterioration in SEM images, it is possible that identification may not be robust to new images and that false judgments may be made based on background noise.¹⁰⁾

In the existing literature, there are also studies that conduct identification through feature extraction, such as texture analysis, instead of directly inputting images;¹¹^–¹⁴⁾ other approaches to identification are based on crystallographic features.¹⁵^,¹⁶⁾ These approaches are advantageous insofar that they can be implemented with relatively small datasets. However, in a situation where the number of data can be secured above a certain level and the parameters can be optimized, more complex deep learning of the model is considered to be superior in terms of identification accuracy.

In recent years, algorithms for interpreting predictions of machine learning models called LIME have been devised.¹⁷⁾ This method uses a trained deep learning network that takes an image as an input, and creates a local classification model for the prediction result to estimate which region in the image contributed to the classification. Therefore, it is expected that the use of this in combination with the deep learning model will enable a quantitative and high discrimination performance that does not depend on the evaluator and present the basis for steel microstructure classification.

In this study, a microstructure classification model is constructed end to end using deep learning with SEM images as the inputs. Furthermore, the interpretability of the predictions is investigated by LIME. The specific contributions of this research are outlined below.

(1) Identification of eight microstructures using deep learning, including the typical microstructures of low-carbon steel, such as upper bainite, lower bainite, martensite, and their mixed structures
(2) Using LIME to interpret the predictions made using the classification model for upper bainite, lower bainite, and martensite

In doing so, it is suggested that the microstructure identification criteria of the constructed deep learning model are partially consistent with the classical identification criteria.

2. Experimental Procedures

2.1 Materials

General low-carbon steel with a chemical composition of Fe–0.1 C–0.01 Si–2.0 Mn–0.008 P–0.001 S (mass%) was used. For the specimens, eight sheets (200 × 20 × 2 mm) were cut from the hot-rolled flat plate and subjected to different heat treatments using an electric heating device. At first, the eight specimens were heated to 1,273 K (1,000°C) and isothermally maintained at said temperature for 30 s to austenitize the microstructures. Thereafter, the specimens were cooled to room temperature using different processes, which are outlined below.

The specimens T1–T5 were cooled from 1,273 K (1,000°C) to 773 K (500°C), 723 K (450°C), 673 K (400°C), 623 K (350°C), and 573 K (300°C), respectively, at a rate of 50 K/s. They were then isothermally maintained at these temperatures for 1,000 s, after which they were cooled to room temperature at a rate of 50 K/s. Next, T6–T8 were cooled to room temperature using 50 K/s cooling rate, helium gas cooling, and water cooling, respectively. These heat treatments are shown in Fig. 1. As a result of the heat treatment, different microstructures formed in the eight specimens. Table 1 shows the labeling results of the test materials. T1 is a single-phase structure of upper bainite, T2–T5 consist of a mixed structure of upper bainite and lower bainite, T6 is a single-phase structure of lower bainite, T7 is a mixed structure of lower bainite and martensite, and T8 is a single-phase structure of martensite. Although T2–T5 all have mixed structures, their microstructures vary because of the different heat treatments.

Fig. 1

Thermal treatment.

Table 1 Isothermal holding temperature.

2.2 SEM image acquisition conditions and dataset

In preparation for SEM imaging, the specimens were polished using #320 sheet, 9 and 3 µm diamond abrasive grains and alumina and then etched using 3% nital liquid. The SEM was equipped with a tungsten hairpin thermionic emission electron gun. The SEM imaging conditions consisted of the following: acceleration voltage of 15 kV, a working distance of 8 mm, and a magnification of 1,500×. Overall, 30 SEM images were taken for each sample, and a dataset of 240 SEM images was prepared.

2.3 Method

This section describes the scheme of this research. An overview is shown in Fig. 2. First, the prepared SEM image dataset was divided into 80% and 20% sections for each test material, with the 80% used as training data and the 20% as test data. Overall, 24 SEM images of each test material were used as the training data, and six were used as the test data; the training data were augmented. Next, contrast limited adaptive histogram equalization (CLAHE) processing was conducted on the training data and test data for the purposes of preprocessing. We trained the CNN, ResNet50, by deep learning with the training data after preprocessing. At this time, a method called fine tuning was used, which increases learning efficiency by conducting additional learning using the weights trained on a large image dataset as the initial values. This has been used in previous studies concerning microstructure classification and its effectiveness has been confirmed.⁴^,⁵^,⁷⁾ Preprocessed test data were input to the fine-tuned model and its behavior was evaluated. Then, we used the constructed microstructure identification model to explain the predictions of the SEM images of upper bainite (T1), lower bainite (T6), and martensite (T8), and attempted to interpret them.

Fig. 2

Scheme of our work with preprocesses, deep learning, test and lime.

2.3.1 Data augmentation

When training deep learning models, a large dataset is generally required. Indeed, if a large amount of high-quality data is prepared, the possibility of developing a highly accurate model increases. Accordingly, for deep learning, it is common to extend the dataset using preprocessing. In this paper, preprocessing was only conducted on the training data. First, five squares (896 × 896 px) were cut out from one SEM image from the center, upper left, upper right, lower left, and lower right. In addition, in image classification with deep learning, it is common to shift the field of view of an image before inputting into a model to inflate the data. In this study, however, instead of performing this processing immediately before inputting, with respect to the five squares, we allowed the extracted regions to overlap. Then, the extracted squares were further divided, vertically and horizontally, creating 16 squares (224 × 224 px) for each. Furthermore, these squares were augmented four times by rotating them by 90, 180, and 270 degrees. With the extended processing up to this point, 320 training data images were generated from one SEM image. Therefore, 9,600 training images were prepared for each sample, and a total of 76,800 training images were prepared for all samples. Finally, these 76,800 images were divided into two groups: 61,440 images for training and 15,360 images for validation. In addition, perform horizontal flip and vertical flip processing was randomly added to augment the training data. This took place before inputting the data into the model using deep learning.

2.3.2 Preprocessing

The images in the dataset underwent CLAHE processing. This was done to adjust the brightness and contrast, which fluctuate during imaging, as well as to adjust the light source environment. For this, the OpenCV library¹⁸⁾ was used. The two parameters, cliplimit and tilegridsize, were set to 2.0 and (8,8), respectively.

2.3.3 Deep learning method

In this study, a CNN was used, which is a neural network with a structure suitable for handling data with spatial information, such as images. This time, a residual net was adopted,¹⁹⁾ which can reduce gradient losses and improve model accuracy in proportion to layer depth.

Table 2 shows the computing environment used in this study. In addition, Keras²⁰⁾ was selected as a deep learning framework using Google TensorFlow²¹⁾ as a back end, and these were used in Anaconda virtual environment. For each version, 1.13.1 was selected for Tensorflow-gpu and 2.2.4 for Keras-gpu. At this time, Python version 3.6 was specified, and NVIDIA’s CUDA and cuDNN, which are required to use the GPU for deep learning, were selected as 10.1 and 7.6.0, respectively.

Table 2 Computing environment.

Next, we explain the five types of hyperparameters set during learning: batch size, loss function, number of learning epochs, learning rate, and optimization method. First, the batch size was determined as 48. For the loss function, categorical_crossentropy was used and the number of training epochs was set to 200. As the learning progressed, the learning rate decreased with the schedule shown in Table 3. Adam²²⁾ was selected as the optimization method. Adam has the characteristic that bias correction is conducted when parameters are updated. As recommended by Ref. 22), the parameters specified when using Adam in Keras were set as β_1 = 0.9 and β_2 = 0.999. The loss function, the number of training epochs, the learning rate, and the optimization method were determined according to Keras’ published code examples.²³⁾

Table 3 Schedule of learning rate decay.

Finally, we explain the fine tuning. Material data (such as SEM images) are often expensive to acquire, and it is difficult to prepare large datasets.²⁴⁾ The fine tuning is an effective method to efficiently train deep learning models when the size of the dataset is small. Specifically, train the model on a large dataset before training the model on the dataset at hand. Then, using the parameters of the trained model as initial values, retraining can be conducted using the dataset at hand. Keras provided an application to conduct fine tuning, a mechanism that can use parameters learned from a large dataset (called ImageNet²⁵⁾) as initial values. In this study, Keras was used to conduct fine tuning. The CNN model has a convolutional layer on the input layer side and a fully connected layer that plays the role of a classifier, as well as an output layer corresponding to the number of classification classes on the output layer side. ImageNet used for transfer learning this time is a dataset that performs 1,000 class classification. Therefore, the model trained on this dataset was not suitable to classify the microstructure because of the shape of the classifier on the output layer side; thus, it was changed. The pre-trained parameters were used as the initial values of the convolutional layer; the structure of the output layer was changed to match the microstructure dataset. In particular, the output of the fine tuning part was configured to be input to the average pooling layer of filter size (7, 7). The pooling layer received the output of the convolutional layer and reduced the space size, which allowed the position errors of the object in the input image. Then, the output of the average pooling layer was added to the fully connected layer as the input and the final output was made. In this fully connected layer, activation was set to the Softmax function, and kernel_initializer was set to the initial value of He.²⁶⁾

2.3.4 Evaluation method

In this study, evaluation was conducted in two ways, as shown in Fig. 3. First, predictions were made for one input image, and the accuracy was evaluated. Second, predictions were made for one original SEM image by conducting a majority vote with 20 individual predictions.²⁷⁾ The former is called “not voting,” and the latter is called “voting.”

Fig. 3

Test method (“Not voting” and “Voting”).

2.3.5 Interpreting predictions using LIME

We attempted to interpret the model predictions using LIME,¹⁷⁾ which can be used to explain the predictions made by machine learning by locally approximating the nonlinear discriminant function of the machine learning model with an interpretable model. The explanation by LIME is made according to eq. (1) $\xi (x) = \mathop{\text{argmin}}\limits_{g \in G}\mathcal{L}(f,g,\pi_{x}) + \Omega (g)$.¹⁷⁾ The following are obtained by linear regression: original model, f; the interpretable model, g; the loss, L, that takes the similarity, π_x, between the input and its sampling data as arguments; and the g that minimizes the sum of Ω that limits the complexity of g. Finally, an explanation of the prediction based on the weights of the linear regression is provided.

When this LIME is conducted on the image recognition model, the part of the image that contributed to the prediction is extracted and presented as a description of the prediction. In this study, we compared the explanation with classical recognition criteria.

3. Results and Discussions

Figure 4 shows the SEM images obtained in this study. We acquired a secondary electron image. T1, T2, T3, T4, and T5 are bainite structures formed at 773, 723, 673, 623, and 573 K, respectively; in particular, T1 corresponds to the single-phase structure of upper bainite. The bainite structure is known to change according to the temperature at which it is formed; accordingly, T2, T3, T4, and T5 have different structures. T6 corresponds to the single-phase structure of lower bainite, T7 corresponds to the mixed structure of lower bainite and martensite, and T8 corresponds to the single-phase structure of martensite. Here, we focus on the single-phase structures, T1, T6, and T8. T1 has carbide precipitated between the laths, whereas T6 has carbide precipitated in the laths. In addition, T8 has no carbide precipitation. These features are consistent with the classical identification criteria. Moreover, it is difficult to visually distinguish between the bainite structures of T2, T3, T4, and T5. The SEM images were acquired as 30 grayscale images with 256 gradations for each sample.

Fig. 4

SEM images.

In the computational environment, learning took roughly 21 h. Here, we discuss the results of verifying the generalization performance of the trained model using two different test methods. On the one hand, the not voting method has an accuracy of 94.1%. This is the result of correctly classifying 903 images in a total of 960 test images, indicating a high identification performance. The details of the test results are shown in the confusion matrix in Table 4, from which it is evident that, although the SEM images of T2 and T8 were correctly classified, the SEM images of T3, T4, and T6 were misclassified.

Table 4 Confusion Matrix of “Not voting”.

On the other hand, the voting method has an accuracy of 97.9%. This is the result of correctly classifying 47 images out of a total of 48 test images, with six images in each sample. Details of the test results are shown in the confusion matrix in Table 5, from which it is evident that one SEM image of T3 was misclassified as T4.

Table 5 Confusion Matrix of “Voting”.

The accuracy of both methods requires comparison. Obviously, accuracy was improved by incorporating the voting mechanism. This can be attributed to the differences in the target viewing area between the two test methods. In particular, the SEM images contain local noise that affected classification; the not voting method, which conducts prediction in the narrow field of view 224 × 224 px, was negatively affected by the said noise. Alternatively, in the case of the voting method, which conducts prediction in the wide field of view of 896 × 1,120 px, the effect of noise was reduced because the area where the noise was located was relatively small compared with the field-of-view area, thereby increasing the overall accuracy of the method. Figures 5 and 6 show examples when the voting method was effective. On the one hand, Fig. 5 is an SEM image of T1 where two of the 20 small images were misclassified as T4.

Fig. 5

T1 SEM image correctly classified with the voting method.

Fig. 6

T6 SEM image correctly classified with the voting method.

On the other hand, Fig. 6 shows an SEM image of T6 where four of the 20 small images were misclassified as T3 and two as T8. In comparing the not voting method and the voting method, it is evident that, when conducting microstructure recognition by machine learning, it is vital to use a large field of view and to make judgments using multiple images.

Figure 7 shows the images misclassified using the voting method. In particular, Fig. 7 shows an SEM image for T3 that was misclassified as T4 as a result of the voting method. Regarding the number of votes for each, the correct answer label T3 was 6, the others were 11 for T4, and 3 for T5, respectively. The predictions varied between bainite structures with formation temperatures that differed by 50 K between 573 and 673 K. Indeed, the deep learning model for microstructure classification constructed in this study is highly accurate; however, it failed to identify the aforementioned microstructures. In general, these microstructures are not easily identifiable. Accordingly, identifying bainite structures generated in said temperature range is a vital issue that, in future studies, requires further examination.

Fig. 7

T3 SEM image misclassified with the voting method.

3.1 Interpretation by LIME

We attempted to interpret the rationality of the model directly related to the trust of the machine learning model. Figure 8(a), (b), and (c) show the results obtained by using LIME to interpret the predictions made with the constructed deep learning model for T1, T6, and T8, respectively. The constructed model correctly classified these SEM images. Figure 8(d), (e), and (f) are the result of LIME of the predictions for (a), (b), and (c), respectively. In (d), a block boundary and a large carbide between laths were selected, which is consistent with the characteristics of upper bainite. In (e), elongated thin blocks aligned on the same habit plane were selected, which is consistent with the characteristics of lower bainite. In (f), the packet and packet boundary were selected, which is consistent with the characteristics of martensite. Therefore, the microstructure identification criterion of the constructed deep learning model is partially consistent with the classical identification criterion.

Fig. 8

Input SEM images and LIME output images of T1 (Upper Bainite), T6 (Lower Bainite), and T8 (Martensite). (a), (b), and (c) ((d), (e), and (f)) correspond to the input (output) images of T1, T6, and T8, respectively.

4. Conclusion

In this study, we constructed an end-to-end microstructure classification model for eight specimens of low-carbon steel using deep learning. The obtained conclusions are outlined below.

(1) The identification of low-carbon steel microstructures can be automated by using deep learning.
(2) In deep learning with SEM images as the input, eight types of microstructures were identified with an accuracy of 94.1% when using the not voting method and 97.9% when using the voting method.
(3) The voting method is effective with respect to improving accuracy.
(4) As a result of interpreting the model predictions using LIME, it is evident that the deep learning model constructed in this study can recognize the microstructure features that conform to classical criteria.

REFERENCES

1) B.L. Bramfitt and J.G. Speer: Metall. Trans. A 21 (1990) 817–829.
2) H.I. Aaronson, M. Enomoto and J.K. Lee: Mechanisms of Diffusional Phase Transformations in Metals and Alloys, (CRC Press, Boca Raton, 2016).
3) Y. Adachi, M. Taguchi and S. Hirokawa: Tetsu-to-Hagané 102 (2016) 722–729.
4) M. Bartłomiej, K. Grzegorz, K. Jan and P. Ulrich: Mater. Sci. Forum. Vol. 949, (Trans Tech Publications, Stafa-Zurich, 2019).
5) B.L. DeCost, T. Francis and E.A. Holm: Acta Mater. 133 (2017) 30–40.
6) J. Ling, M. Hutchinson, E. Antono, B. DeCost, E.A. Holm and B. Meredig: Materials Discovery 10 (2017) 19–28.
7) S.M. Azimi, D. Britz, M. Engstler, M. Fritz and F. Mücklich: Sci. Rep. 8 (2018) 2128.
8) B.L. DeCost, B. Lei, T. Francis and E.A. Holm: Microsc. Microanal. 25 (2019) 21–29.
9) P. Chokshi, R. Dashwood and D.J. Hughes: Comput. Struct. 190 (2017) 162–172.
10) C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow and R. Fergus: arXiv:1312.6199 (2013).
11) J. Webel, J. Gola, D. Britz and F. Mücklich: Mater. Charact. 144 (2018) 584–596.
12) J. Gola, J. Webel, D. Britz, A. Guitar, T. Staudt, M. Winter and F. Mücklich: Comput. Mater. Sci. 160 (2019) 186–196.
13) D.L. Naik, H.U. Sajid and R. Kiran: Metals 9 (2019) 546.
14) D.S. Bulgarevich, S. Tsukamoto, T. Kasuya, M. Demura and M. Watanabe: Sci. Rep. 8 (2018) 2078.
15) H. Terasaki, Y. Miyahara, K. Hayashi, K. Moriguchi and S. Morito: Mater. Charact. 129 (2017) 305–312.
16) K. Tsutsui, H. Terasaki, T. Maemura, K. Hayashi, K. Moriguchi and S. Morito: Comput. Mater. Sci. 159 (2019) 403–411.
17) M.T. Ribeiro, S. Singh and C. Guestrin: KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (2016) pp. 1135–1144.
18) G. Bradski and A. Kaehler: Dr. Dobb’s Journal of Software Tools 25(11) (2000) 120–123.
19) K. He, X. Zhang, S. Ren and J. Sun: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016) pp. 770–778.
20) F. Chollet et al.: “Keras: The python deep learning library.” Astrophysics Source Code Library (2018).
21) M. Abadi et al.: arXiv:1603.04467 (2016).
22) D.P. Kingma and J.L. Ba: arXiv:1412.6980 (2014).
23) Keras Deep Learning for humans https://github.com/keras-team/keras/blob/master/examples/cifar10_resnet.py, (accessed 2020-04-20).
24) D.M. Dimiduk, E.A. Holm and S.R. Niezgoda: Integr. Mater. Manuf. Innov. 7 (2018) 157–172.
25) J. Deng, W. Dong, L.-J. Li, K. Li and L. Fei-Fei: 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, (2009) pp. 248–255.
26) K. He, X. Zhang, S. Ren and J. Sun: 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, (2015) pp. 1026–1034.
27) B. Zhang, P. Jaiswal, R. Rai, P. Guerrier and G. Baggs: Rapid Prototyping J. 25 (2019) 530–540.

Corresponding author

Register with J-STAGE for free!