Non-negative Matrix Factorization and Its Extensions for Spectral Image Data Analysis

Scanning transmission electron microscopy combined with electron energy-loss spectroscopy and energy-dispersive X-ray spectroscopy is useful for analyzing chemical states and elemental components in a new material. Using these instruments, the spectra over the spatial grid points in a region of interest can be observed. This measurement technique is called spectral imaging (SI). Because of the large size of SI data, the analysis cost is a bottleneck in the evaluation process of the material. To re-duce the analysis cost, machine learning techniques can be applied, which can automatically extract essential information from the data. This paper reviews our developed machine learning method, which is based on non-negative matrix factorization and its extensions. A spatial orthogonality constraint and a generalized noise model, which includes Gaussian and Poisson noise models, are introduced. Numerical experiments demonstrate the effectiveness and characteristics of our developed methods.


I. INTRODUCTION
Scanning transmission electron microscopy (STEM), combined with electron energy-loss spectroscopy (EELS), and energy-dispersive X-ray spectroscopy are useful for analyzing the chemical states and elemental components of a material. Using these instruments, the spectra over the spatial grid points in a region of interest (ROI) can be observed. This measurement technique is called spectral imaging (SI). Because of the large size of SI data, the analysis cost is a bottleneck of this evaluation process.
Machine learning and data mining techniques are useful for analyzing massive datasets. In materials science, these techniques have been used to automatically identify novel rules, such as groups of examples or relationships between a set of descriptors and a target variable, for real measurement datasets as well as for theoretically computed datasets [1−5]. The advantages of these techniques are as follows: i) reduction of analysis cost; and ii) utilization of automatic computational procedures to avoid the subjectivity of experts (which might be an obstacle to novel findings).
This paper reviews our machine learning method developed to automatically identify chemical components and their spectra from SI datasets [4]. A goal of SI analysis is to identify chemical compositions (or chemical states) and their quantities at each measured spatial point solely from observed SI data. This is an unsupervised learning problem because supervised labels such as chemical components are unavailable. One useful approach to solve this problem is matrix factorization, such as singular value decomposition (SVD) or principal component analysis (PCA), which assumes that an observed spectrum at each spatial observed point is generated by linear mixing of the spectra of chemical components. Non-negative matrix factorization (NMF) -the main topic of this review paper-is an extension of matrix factorization in which matrix elements are constrained to non-negative values. This assumption is reasonable in many datasets, such as natural images [6], because such data consist of non-subtractive signals of several components. The seminal NMF study involved decomposition of face images into face parts such as eyes, nose, and mouth [6]. NMF can naturally decompose a face image into face parts correctly although the PCA of the subtractive approach, in which both non-negative and negative values are assumed, cannot identify face parts. Following the seminal study [6], NMF has been widely applied to other data such as audio signals [7], remotely sensed hyperspectral data [8], and microscopic SI data [3−5]. The non-negative assumption is also natural in SI data because the intensities of spectral signals and component densities are non-negative. Thus, NMF is expected to identify physically natural chemical components from observed SI data. Our developed NMF is an extension realized by introducing a spatial orthogonality constraint and a sparsity constraint to obtain physically meaningful results. This paper also introduces a generalization of a noise model, including a Gaussian noise model and a Poisson noise model. Numerical experiments are used to demonstrate the effectiveness of our developed methods.
The remainder of this paper is organized as follows: Section II explains the spectrum imaging dataset and the problem setting to automatically identify the spatial intensity distributions and spectra of the chemical components. Additionally, the basic machine learning methods and our proposed approach-which is based on non-negative matrix factorization-are also described. Section III presents the empirical demonstration of our developed methods using synthetically generated SI datasets. Finally, Section IV summarizes our developed methods and outlines possible future work.

II. NON-NEGATIVE MATRIX FAC-TORIZATION
A. Spectral image data and matrix factorization model A spectral image dataset consists of a data cube (or a three-dimensional array) having two-dimensional spatial axes and a spectral channel axis, such as electron energy loss in an EELS spectrum. By stacking each observed spectrum along a row in a matrix, the data cube can be converted into a two-dimensional array called a data matrix.
When the essential chemical components in the ROI are limited to a small number, the data matrix can be described by a mixture of spectra and spatial intensity distributions of these chemical components. In particular, under a linear mixture assumption, the i-th spatial position and the j-th spectral channel in the data matrix Xij is modeled by a linear combination where K is the number of chemical components; N xy is the number of spatial measured points; N ch is the number of spectrum channels; C ik is the intensity of the i-th spatial position of the k-th chemical component; and S jk is the intensity of the j-th spectrum channel of the k-th component. The model can be expressed as the matrix representation where ∈ ℝ × is a factorized matrix of chemical component intensity distributions and S∈ ℝ ℎ × is that of chemical component spectra. The approximation model of Eq. (2) is called matrix factorization. When K is much smaller than both and ℎ , the total number of elements in and is much lower than that in the original data matrix . This model is equivalent to a low rank approximation in linear algebra. In this problem setting, is known, and and are unknown. C and S can be estimated by minimizing the reconstruction error between and � . This error is usually measured by the mean squared error under an additive Gaussian noise model. For another noise model, another error measure could be used, which is discussed in a later section.
Several matrix factorization methods exist with various constraints of matrices and [1,5]. A basic method is SVD, which decomposes a data matrix into two orthonormal matrices, ∈ ℝ × and ∈ ℝ ℎ × , and a diagonal matrix of singular values, ∈ ℝ × . These matrices consist of the basis vectors corresponding to the K largest singular values, which is given by After the decomposition, is considered as a spatial intensity matrix (or a redundant matrix); and are considered as spectrum matrices (or basis matrices); S. Matrices and S are orthogonal even when the true component spectral matrix and the true intensity distribution matrix are not orthogonal. This can cause an unnatural decomposition when the assumption is violated in the true chemical components.
Another well-known method is PCA. The main difference between SVD and PCA is that SVD finds basis vectors (basis spectra) from the origin, while PCA finds basis vectors from the mean vector (i.e., the averaged spectrum). Thus, the procedure in PCA is to first compute the averaged spectra and then subtract the spectra from each observed spectrum. The obtained matrix is used in the SVD procedure and then factorized matrices and S are obtained. Similar to SVD, and by PCA are also orthogonal. This can also cause an unnatural decomposition when the assumption is violated.
For SI analysis, there is another useful method called ver-tex component analysis (VCA) [9]. In VCA, it is assumed that data is distributed on a convex cone whose vertexes correspond to spectra of pure chemical components that include only a chemical state. The VCA algorithm first projects the original data into a low-dimensional space by SVD or PCA and then find vertexes corresponding to pure spectra. Next, VCA estimates the spatial intensity of each component included in an observed spectrum. VCA is powerful when the observed SI has pure spectra of all components to be extracted. However, it may drastically degrade when an observed SI does not include pure spectra. To mitigate this problem, signal subspace sampling (SSS), in which pure spectra are virtually generated and estimated by sampling procedures, was proposed [10]. However, the problem is not fully solved when the observed spectra are heavily overlapped in a SI dataset. This is a limitation of inductive approaches in statistical inference and machine learning, rather than in the SSS procedure. The solution requires an additional assumption (or an effective constraint) in the estimation model.

B. Non-negative matrix factorization
A natural assumption in SI data is non-negativity, for both the spectrum and the spatial intensity of the chemical component at an observed spatial position. NMF is a matrix factorization model with a non-negativity assumption for the data matrix ∈ ℝ + × ℎ and factorized matrices ∈ ℝ + × and S ∈ ℝ + ℎ × . The optimization problem is formulated as where ( , ) is the reconstruction error, such as a mean squared error: The additional assumption is simple, but powerful for obtaining a physically meaningful decomposition. However, the assumption makes the optimization much more difficult. The optimization of both C and S is a non-convex optimization problem, and the global optimum by each optimization procedure with an initialization cannot be guaranteed. Thus, NMF should be implemented numerous times from different initial matrices C and S, and the best result minimizing the reconstruction error is chosen.
To avoid incorrect local minima, in which the reconstruction error is much larger than the global minimum, and to converge to an optimum quickly, a number of optimization algorithms have been proposed. To the best of our knowledge, in all these algorithms, the optimizing variables in ( , ) are not updated simultaneously, but a subset of the variables are updated while the others are fixed. One of the seminal optimization algorithms is the matrix multiplication (MM) algorithm [11]. It can be implemented by only matrix products, resulting in a small computation cost for an update. However, the convergence speed is not fast when the optimal matrices include a lot of zeros. Another efficient algorithm is the alternative least squares (ALS) algorithm [12]. This algorithm updates a matrix with the other matrix fixed by solving a linear equation to minimize the error without the non-negative constraint. The non-negative constraint is enforced after solving the equation. Empirically the ALS algorithm converges faster than the MM algorithm. However, it requires a high computation cost for an update to solve the linear equation. An extension of the ALS algorithm is the hierarchical ALS (HALS) algorithm [13], in which a column vector of C or S, rather than a matrix, is updated. The optimal vector for the update is analytically solved without solving a linear equation, resulting in a low computation cost. Empirically, the HALS algorithm converges to a local optimum faster than the ALS algorithm.

C. Soft spatial orthogonality constraint
The abovementioned optimization algorithms do not impose any additional constraint for factorized matrices except for the non-negativity. Although novel optimization techniques can improve the convergence speed, they still suffer from the problem of incorrect local optima, which causes physically unnatural results. To mitigate this problem, an additional constraint in analyzing datasets is necessary. Our paper [4] proposed a constraint in which the spatial positions of different chemical components cannot be perfectly overlapped. When a pair of components are spatially non-overlapped and these intensities are non-negative, the inner product between these column vectors of is zero, that is, • T • = 0, ≠ . Thus, the spatial overlap can be measured by an inner product of different column vectors in matrix . The total overlap between all pairs of different components can be measured by the sum of the off-diagonal elements of T [14]. This condition of perfectly orthogonality is too strict for real observation situations, where the components may be slightly overlapped. To relax the strict orthogonality constraint, our paper [4] introduced a weight parameter for the orthogonality penalty, called soft orthogonality. Using the weighted penalty, we proposed a new NMF called soft orthogonalized NMF (NMF-SO). NMF-SO can be used to resolve unnatural spatial overlaps by adjusting the orthogonality penalty.
The NMF-SO optimization algorithm was derived based on the HALS approach. To update the k-th column vector • , the cost function with the orthogonality penalty can be formulated by where w is a weight used to adjust the orthogonality penalty; is the Lagrange multiplier for the perfect orthogonality constraint; and ( ) = ∑ • ≠ . The second term is the orthogonality penalty term, which evaluates the overlap between the k-th column vector and all other columns; it is zero when the k-th component is not spatially overlapped with all the other components. The cost function is a quadratic function of • . The update rule for minimizing Eq. (6) can be derived using the normal equation, in which the gradient of • is zero, as follows: where the function [•] + replaces negative values with zeros. This update rule is iteratively implemented for each column k = 1, 2, …, K. The update rule of • is the same as in the original HALS algorithm because any additional constraint is not imposed on matrix .

D. Generalized noise model
The reconstruction error measured by Eq. (5) assumes additive Gaussian noise. When the countable number (or the S/N ratio) of signals is limited, one should assume a different noise model, such as a Poisson noise model. A generalized noise model is a Tweedie distribution, which is a family of probability distributions, including a Gaussian (or normal) distribution, a gamma distribution, and a Poisson probability distribution [15]. An error measure corresponding to the Tweedie distribution is -divergence: where is an observed value, is a true value to be estimated, and When = 1, the divergence is equivalent to the Euclidean distance: This distance is the error measure under Gaussian noise. When = 0, the divergence is the Kullback-Leibler (KL) divergence: which is derived under the noise assumption of a Poisson distribution. When = −1, the divergence is the Itakura-Saito divergence: which is equivalent to assuming the noise model of a gamma distribution. Figure 1 shows the -divergence with = −1, 0, 1. It indicates the left-right asymmetrical shape at = when < 1. The divergence grows more rapidly for > than < when < 1. Thus, an approximation at a small observed value of should be more precise than that at a large value. This characteristic agrees with a Poisson noise assumption, whose variance is larger with a large mean than with a small mean.
Using the -divergence, the reconstruction error for NMF is given by Although the formulation of -divergence, which has nonlinear terms, is more complicated than the Euclidean distance, the HALS algorithm can be derived by the Taylor expansion of the divergence [16] as follows: where the ⨀ operator is the element-wise product between two matrices, known as the Hadamard product. Introducing the orthogonality constraint, the update rule based on HALS [14] is given by green lines are equivalent to Euclidian distance, KL divergence and Itakura-Saito divergence, respectively.
For the soft orthogonality penalty, the update rule can be derived as An experimental demonstration is provided in Section III.

E. Automatic relevance determination
There is another extension of NMF where the number of components is chosen from only observed data, which introduces sparse penalty terms in the cost function of NMF [4,17]. This approach is important when the number is unknown and one needs to evaluate the statistically significant number of components from observed data without expert knowledge. This approach is called automatic relevance determination (ARD). The ARD term is designed based on a Bayesian approach, and the optimization of the factorized matrices C and S is performed based on a maximum a posteriori probability (MAP) estimation. The ARD approach can be combined with our soft spatial orthogonality penalty. Furthermore, the HALS-based algorithm with -divergence can be derived by replacing the Euclidean distance term in    [4] with its Taylor expansion in [14,16]. Because of space limitations, the details are omitted.

III. NUMERICAL EXPERIMENTS A. Synthetic data
This numerical experiment assumes a limited number of signal counts, which is equivalent to a Poisson noise model. The probability of n event observations in an interval is given by where is the average number of events per interval. The expectation and variance under this distribution are . For a sufficiently large value of , the distribution can be approximated by a Gaussian distribution. Thus, this distribution is assumed only for the case where the number of counts is limited.
Synthetic datasets were generated using i) a theoretical computation of an atomic resolution STEM-EELS SI of Mn-L2,3 of Mn 3 O 4 , which includes two chemical components of trivalent manganese (Mn 3+ ) and divalent manganese (Mn 2+ ), i.e., = 2; and ii) an ideal decomposition result by NMF with = 3 for real STEM-EELS SI data measured in a Si diode. The details of the true chemical components are described in Ref. 18. These true chemical component maps and spectra are shown in Figures 2 and 3. Using the distribution matrix and spectrum matrix , an element of an SI data matrix was generated by is a parameter of the synthetic dataset generation to adjust the number of the averaged count in an SI dataset. Synthetic datasets were generated with = 10, 100 for Mn 3 O 4 data and Si device data.

B. Experimental results
Using our developed Python package [19], -NMF ( -NMF-SO with = 0) and -NMF-SO with error parameters = 1, 0 were implemented to analyze the STEM-EELS SI datasets. The soft orthogonality weight of -NMF-SO was adjusted for each data, which was set to = 0.1 for atomic resolution Mn 3 O 4 datasets and = 0.2 for Si device datasets. For datasets with essentially overlapped components, the smaller the weight is, the better the decomposition result is.
The decomposition results of Mn 3 O 4 datasets are shown in Figures 4 and 5. Figure 4(a, b), which illustrates results for a low count case, shows that results by -NMF-SO with  = 0 (KL divergence) are more similar to the true components, which demonstrates the effectiveness of the suitable noise assumption. Figure 5 shows the decomposition results by -NMF with the orthogonality penalty. In all four settings of -NMF-SO, the results in Figure 5 are more similar to true components in Figure 2 than those in Figure 4. These results demonstrate the effectiveness of the soft orthogonality penalty. The effect of the noise model assumption is more minor than the soft orthogonality penalty. Figures 6 and 7 show results for Si device datasets. For these datasets, a Poisson noise model does not improve decomposition performance, although the soft orthogonality penalty can drastically improve the performance. The optimum minimizing reconstruction error for this dataset may have a large arbitrary space due to non-sparse spectra. The orthogonality assumption could work by effectively removing the arbitrary space, which is equivalent to a set of physically unnatural decompositions, although the noise model could not remove the space.

IV. CONCLUSIONS
This paper reviewed automatic chemical component identification from spectral image data based on NMF. Extensions from the basic NMF consisting of i) a general noise model and ii) the soft spatial orthogonality penalty, were introduced and demonstrated using synthetic STEM-EELS-SI datasets. NMF with KL divergence, which is equivalent to a Poisson noise model, performed better than a Gaussian noise model when the number of observed signal counts was limited. However, the improvement was minor for these datasets. The main reason for this is that the S/N ratio is still incorrect even when a suitable noise model is used. On the other hand, another assumption of the spatial orthogonality constraint drastically improved the decomposition performance, by effectively restricting the solution space minimizing the reconstruction error of NMF. This analysis gives us the important message that the information quantity hidden in observed data (or the S/N ratio) does not increase even by using a machine learning method, but the method can efficiently extract all the information with a suitable assumption for the data.
A future study will address the treatment of background signals. Our previous study [4] summarized that there are disadvantages to our method caused by background signals. Thus, these signals need to be eliminated before implementation of our method. Additional modeling of background signals could result in improved identification of the chemical component via machine learning.