Learning a Class-specific and Shared Dictionary for Classifying Surface Defects of Steel Sheet

Shiyang Zhou; Youping Chen; Dailin Zhang; Jingming Xie; Yunfei Zhou

doi:10.2355/isijinternational.ISIJINT-2016-478

Abstract

An approach to a class-specific and shared dictionary learning (CDSDL) for sparse representation is proposed to classify surface defects of steel sheet. The proposed CDSDL algorithm is modelled as a unified objective function, covering reconstructive error, sparse and discriminative promotion constraints. With the high-quality dictionary, the compact, reconstructive and discriminative feature representation of an image can be extracted. Then the classification can be efficiently performed by discriminative information obtained from the reconstructive error or the sparse vector. Based on a dataset of surface images captured from a practical steel production line, the CDSDL algorithm is carried out to verify its effectiveness. Experimental results indicate that the CDSDL algorithm is more effective in classifying surface defects of steel sheet than other algorithms.

1. Introduction

Machine vision-based classification of surface defects of steel sheet has attracted much attention due to the potential value and challenges for some practical industrial applications.^{1,2,3,4,5,6,7,8)} As a classical pattern recognition problem, feature extraction (learning) and designing of classifier construct the core. Most of existing works have focused on these two aspects to enhance the classification performance over the past few years.^{9,10,11,12,13,14,15,16,17,18,19,20)} A set of features are studied, including grayscale features,^9,10) geometric features,^9,10) shape features,^9,10,11) texture features,^10,12) Gabor filter output features,¹³⁾ Fourier spectral,¹⁴⁾ wavelet transform,^14,15,16,17) local binary pattern (LBP),^18,19) shearlet transform;²⁰⁾ and the classifier of artificial neural network (ANN)^12,13,15) and support vector machine (SVM)^{9,10,14,17,18,19,20)} is discussed. However, these approaches, which treat feature extraction and classifier training as two separating steps, are difficult to control the interaction of two steps. In fact, sparse representation and dictionary learning based classification methods, with no explicit stage of feature extraction, may be more effective in classification of surface defects of steel sheet. The sparse representation and dictionary learning, which can be formulated as follows, has also been successfully applied in many machine vision applications,²¹⁾ such as face recognition.²²⁾

min x ‖ y-Dx ‖ 2 2 +λ ‖ x ‖ 1

(1)

where y∈ ℝ d is a given feature vector, D∈ ℝ d×K is a dictionary, ‖ x ‖ 1 is l₁-norm of a coding vector x, l₂-norm of an atoms d_i is less than one, λ is a positive parameter that balances the tradeoff between reconstructive error and sparsity. The optimization of x and D is carried out by an iterative method composing two steps: (a) fixing D to update x; (b) fixing x to update D. Equation (1) can be solved efficiently by many kinds of algorithms,^{36,37,38,39,40)} such as orthogonal matching pursuit (OMP).⁴¹⁾

A choice of dictionary D plays a key role to obtain a sparse, high-fidelity and discriminative feature representation of an image. Many state-of-the-art results have shown that a good dictionary is learned from the training dataset itself.^23,24) For example, given a set of feature vectors {y_i}_i _{= 1, 2, ...,} _N in ℝ d , the goal of dictionary learning is to learn a dictionary D = {d_i}_i _{= 1, 2, ...,} _K in ℝ d×K that each y_i can be compactly represented as a sparse linear combination of a few atoms from D, meanwhile keeping the reconstruction error ‖ y i -D x i ‖ 2 2 as small as possible. Many existing dictionary learning approaches can be divided into two categories: unsupervised dictionary learning and supervised dictionary learning. For the unsupervised dictionary learning methods, such as method of optimal directions (MOD)²⁵⁾ and K-SVD,²⁶⁾ the learned dictionary may not be best for classification because the class information of a training dataset isn’t utilized. In recent years, a large number of researches about supervised or discriminative dictionary learning have been proposed, such as sparse representation based classification (SRC).²²⁾ A reconstructive and discriminative dictionary, which has a finer adaptation for classification, can be learned. The state-of-the-art classification performance^{27,28,29,30,31,32,33,34,35)} suggests that sparse representation and dictionary learning is a promising method for classifying surface defects of steel sheet.

The surface defects of steel sheet as shown in Fig. 1 can be regarded as the local anomaly against the relatively homogeneous background. The defect area of an image (discriminative pattern) occupies a small portion of an image. The background texture of an image (shared pattern), which is useful for reconstruction rather than discrimination, is nearly the same. Because the class-specific information of defects is only a small proportion in the total information of an image, the feature representation of different defect images aren’t discriminative. If a traditional discriminative dictionary learning method is adopted, most of atoms of the learned dictionary are used to represent the background information that contributes nothing to classify different defects, and only a small part of atoms are used to represent the class-specific information of defects. Therefore, the discriminative ability of a coding vector between different defect images will reduce, and the poor classification performance may be achieved.

Fig. 1.

Examples of practical surface defects of steel sheet: (a) Folding; (b) Line; (c) Patch; (d) Scratch; (e) Non-defective.

Based on the above analysis, a class-specific and shared dictionary learning (CDSDL) algorithm, in which some class-specific sub-dictionaries and only one shared sub-dictionary can be learned simultaneously, is proposed. Each class-specific sub-dictionary encodes the exclusive pattern for corresponding class, and the shared sub-dictionary encodes the explicit and hidden common pattern that shared by all the classes. With these sub-dictionaries, the exclusive pattern and the shared pattern of an image can be explicitly separated. Because the atoms of these sub-dictionaries can capture the inherent structure of a defect image, the effective feature representation of the image for classification can be obtained.

The remainder of this paper is organized as follows. In Section 2, we briefly introduce some related works about classification of surface defects of steel sheet, sparse representation and dictionary learning based classification. In Section 3, the proposed CDSDL algorithm, including formulation and optimization, is introduced. In Section 4, experimental results are presented. Finally, conclusions are given.

2. Related Work

In the past decades, a lot of works about classification of surface defects of steel sheet have been studied by some researchers. Caleb et al. presented a classification method of surface defects of hot rolled steel based on the classifier of multi-layer perceptron (MLP) and self-organising maps network (SOM).¹²⁾ Medina et al. introduced an approach to classify six different kinds of defects of flat steel coils by combining an ANN, a k-nearest neighbor (KNN) and a naive Bayesian.¹³⁾ Ghorai et al. exploited an approach of automatic defect detection of hot rolled flat steel based on three-level Haar feature set.¹⁷⁾ Hu et al. established a SVM model to classify five types of surface defects of stripe steel.⁹⁾ Furthermore, they optimized feature vector of image and kernel function of SVM by hybrid chromosome genetic algorithm and had better classification performance.¹⁰⁾ Moreover, Chu et al. constructed an invariant feature extraction approach based on smoothed LBP.¹⁹⁾ However, these methods largely depend on an appropriate set of hand-crafted features, which need the special design for the special task.

Recently, some sparse representation and dictionary learning based classification methods, which are mainly divided into two categories, have also been paid more attention. One category aims to improve the discriminative capability of a dictionary.^27,28) Ramirez et al. introduced a structured incoherence term for promoting the independence of sub-dictionary associated with different classes.²⁷⁾ Gao et al. learned both some category-specific sub-dictionaries and a shared dictionary by imposing cross-incoherence constraint between different sub-dictionaries and self-incoherence constraint in each sub-dictionary.²⁸⁾ For these methods, the reconstructive error associated with different classes can be used for classification, but a coding vector, which isn’t discriminative, isn’t suitable for classification. Another category aims to improve the discriminative capability of a coding vector based on a discriminative dictionary.^29,30,31,32) Huang et al. presented a method that quantified the discriminative ability of a coding vector by Fisher’s discrimination criterion.²⁹⁾ Zhang et al. introduced a classification error term of a linear classifier.³⁰⁾ Furthermore, Jiang et al. exploited a label consistency regularization term.³¹⁾ These methods have learned a shared dictionary by all the classes and treat a coding vector as a new feature representation of an original image. Although the coding vector is discriminative, the classification based on the class-specific reconstructive error doesn’t work. The main difference between these two kinds of method is the discriminative capability of a dictionary or a coding vector. If the dictionary discrimination and the coding vector discrimination can be combined, better classification performance may be gained.

3. Class-specific and Shared Dictionary Learning (CDSDL)

Given a training dataset of c classes, the CDSDL algorithm constructs c discriminative class-specific sub-dictionaries for each class and a shared sub-dictionary for all classes, respectively. Some structural incoherence terms are added in the objective function to promote the incoherence for the sub-dictionary and the coding vector. The coding vector over the learned dictionary, which can be directly treated as a new feature representation of an image, contains crucial information for reconstruction and discriminative information for classification. The coding vector of a defect image of the same class is similar, while the coding vector of different class is dissimilar. Therefore, both the reconstructive error associated with different classes and the coding vector can be used for classification. In addition, we can amplify the discriminative part and suppress the shared part for feature representation of a defect image by controlling the number of atoms of sub-dictionary. The illustration of the CDSDL algorithm is shown in Fig. 2.

Fig. 2.

An illustration of the CDSDL algorithm. Each image can be sparsely represented by a few atoms from corresponding class-specific sub-dictionary D_i (purple) and shared sub-dictionary D_c₊₁ (red).

3.1. Formulation of CDSDL

Let Y = [Y₁, ..., Y_i, ..., Y_c]∈ ℝ d×N denote a training dataset of c classes, Y_i∈ ℝ d× n i is a matrix constructed by stacking n_i samples of length d, ∑ i=1 c n i =N , where d is the dimension of a sample, n_i is the number of sample of the i-th class, N is the total number of the whole training dataset. Let D = [D₁, ..., D_i, ..., D_c, D_c₊₁]∈ ℝ d×K denote the learned dictionary of K atoms, D_i∈ ℝ d× k i , i = 1, 2, ..., c, is the class-specific sub-dictionary that trained from a corresponding training dataset Y_i, D_c₊₁∈ ℝ d× k c+1 is the shared sub-dictionary that trained from the whole training dataset Y, where K = ∑ i=1 c+1 k i , k_i is the number of atoms of the i-th sub-dictionary. Therefore, the objective function of the CDSDL algorithm is shown as follows.

J= min D,X F( D,X;Y ) + λ 1 ‖ X ‖ 1 + λ 2 G( X ) +H( D ) s.t. ‖ D i ( :, j ) ‖ 2 =1, ∀i, j

(2)

where

F( D,X;Y ) = ∑ i=1 c ( ‖ Y i -D X i ‖ F 2 + ‖ Y i -( D Q ˜ i ) ( ( Q ˜ i ) T X i ) ‖ F 2 + λ 1 ‖ X i ‖ 1 )

(3)

G( X ) =[ ∑ i=1 c ‖ X i - M i ‖ F 2 - ∑ i=1 c ‖ M i -M ‖ F 2 + ∑ i=1 c ‖ X i ‖ F 2 ]

(4)

H( D ) = ∑ i=1 c+1 ( w 1 n i k i 2 ‖ D i T D i - I k i ‖ F 2 + w 2 n i k i ( K- k i ) ‖ D i T ( D Q -i ) ‖ F 2 )

(5)

F(D, X; Y) is a reconstructive error term, G(X) is a Fisher discriminative term imposed on the coding vector matrix X, H(D) is a structural incoherence term imposed on the dictionary D, each atom d_j of D is constrained to unit l₂-norm to avoid trivial solution X.

‖ ⋅ ‖ F denotes Frobenius-norm, ‖ X ‖ 1 is a sparsity regularizer, I k i is an identity matrix, λ₁, λ₂, w₁ and w₂ are the scalar weight parameters, n_c₊₁ = N.

X_i = [X_i¹; ...; X_iⁱ; ...; X_i^c; X_i^c+1]∈ ℝ K× n i is a coding vector matrix corresponding to a training dataset Y_i, each column x_i^j is the coding vector corresponding to the j-th training sample of the i-th class, X_iⁱ is the coding vector matrix of Y_i corresponding to a sub-dictionary D_i.

Q_i∈ ℝ K× k i is a selective operator, each column has only one nonzero element 1 that the row index equals the column index of an atom d_j, Q ˜ i = [Q_i, Q_c₊₁]∈ ℝ K×( k i + k c+1 ) , Q_−i = [Q_i, ..., Q_i−₁, Q_i₊₁, ..., Q_c, Q_c₊₁ ]∈ ℝ K×(K- k i ) .

3.2. Analysis of CDSDL

The term ‖ Y i -D X i ‖ F 2 indicates that the dictionary D should represent Y_i well.

Because D_i is a class-specific sub-dictionary and D_c₊₁ is a shared sub-dictionary, the term ‖ Y i -(D Q ˜ i )( ( Q ˜ i ) T X i ) ‖ F 2 indicates that Y_i should be well represented by [D_i, D_c₊₁], but not by [D_j, D_c₊₁], j ≠ i, which forces the samples from the same class to have similar feature representation.

The term G(X) directly operates the coding vector and promotes them to more discriminative between different classes. Based on Fisher’s linear discriminant,⁴³⁾ which maximizes the ratio of between-class scatter matrix to within-class scatter matrix, we can jointly minimize the within-class scatter matrix S_W(X) and maximum the between-class scatter matrix S_B(X). S_W(X) and S_B(X) are defined as follows.

S W ( X ) = ∑ i=1 c ∑ j=1 n i ( x i ( j ) - m i ) ( x i ( j ) - m i ) T

(6)

S B ( X ) = ∑ i=1 c n i ( m i -m ) ( m i -m ) T

(7)

where x_i⁽^j⁾ denotes the j-th training sample of the i-th class, m_i and m are mean vector of X_i and X respectively, m_i = 1 n i ∑ j=1 n i x i ( j ) ∈ ℝ K , m = 1 N ∑ i=1 c ∑ j=1 n i x i ( j ) ∈ ℝ K .

Therefore

tr( S W ) = ∑ i=1 c ‖ X i - M i ‖ F 2

(8)

tr( S B ) = ∑ i=1 c n i ‖ m i -m ‖ 2 2 = ∑ i=1 c ‖ M i -M ‖ F 2

(9)

where M_i∈ ℝ K× n i , each column equals m_i. For corresponding M_i, M∈ ℝ K× n i , each column equals m.

The combination of ‖ X ‖ 1 and ‖ X ‖ F 2 can make the solution X more stable.³⁸⁾

The term ‖ D i T D i - I k i ‖ F 2 shows that each sub-dictionary is self-incoherent, which enhances incoherence for each sub-dictionary and has a direct impact on the speed of computation.^27,44)

The term ‖ D i T D -i ‖ F 2 denotes a cross-incoherent constraint between different sub-dictionaries, which promotes atoms of different sub-dictionary to be as independent as possible.^27,44)

The weight w₁ n i k i 2 and w₂ n i k i (K- k i ) alleviate the effect of imbalance between the number of samples and atoms.²⁸⁾

An incoherent constraint on sub-dictionary was proposed in,^27,28,35) a Fisher discrimination constraint on coding vector was presented in,^33,34) and a class-specific and shared dictionary was adopted in.^28,34,35) However, there are mainly two differences between the proposed CDSDL algorithm and these algorithms. Firstly, Ramirez²⁷⁾ and Yang³³⁾ only learn some class-specific sub-dictionaries, which can’t separate the class-specific features and the shared features. Secondly, the constraints of these algorithms are imposed on either sub-dictionary^27,28,35) or coding vector,^33,34) while the proposed CDSDL algorithm explicitly imposes the constraints on both sub-dictionary and coding vector.

3.3. Optimization of CDSDL

The optimization of the CDSDL algorithm is not jointly convex, but is convex with respect to each variable when the others are fixed. The optimization problem can be divided into three sub-problems: (1) Solving X with fixed D; (2) Solving D_i with fixed X and D-_j, j ≠ i, where D-_j is the sub-matrix by removing D_j from D; (3) Solving D_c₊₁ with fixed X and D-₍_c₊₁₎. An effective iterative algorithm that alternatively optimizes objective function J with respect to coding vector matrix X, class-specific sub-dictionary D_i and shared sub-dictionary D_c₊₁ is introduced as follows. The CDSDL algorithm is summarized in Table 1.

Table 1. Algorithm of class-specific and shared dictionary learning.

Algorithm 1: Class-specific and Shared Dictionary Learning
Input:Training dataset Y = {Y_i}_i_{= 1, 2, ...,}_c; size of sub-dictionary k_i, i = 1, 2, ..., c, c+1; parameters λ₁, λ₂, w₁, w₂, and T.
Output:The learned dictionary D = {D_i}_i_{= 1, 2, ...,}_c_,_c₊₁, D_i is a class-specific sub-dictionary, i ≠ c+1, D_c₊₁ is a shared sub-dictionary.
Initialize:The class-specific sub-dictionary D_i is initialized by K-SVD in Y_i, the shared sub-dictionary D_c₊₁ is initialized by K-SVD in Y.
While stop criterion isn’t reached
Update of X Equation (11) is solved by FISTA and TwIST
Update of D_i Equation (18) is solved by MOCOD
Update of D_c₊₁ Equation (22) is solved by MOCOD
End While

3.3.1. Update of X

With fixed D, J can be rewritten as follows.

min Xi ‖ Y i -D X i ‖ F 2 + ‖ Y i -D Q ˜ i ( Q ˜ i ) T X i ‖ F 2 + λ 2 ( ‖ X i - M i ‖ F 2 - ∑ i=1 c ‖ M i -M ‖ F 2 + ‖ X i ‖ F 2 ) + λ 1 ‖ X i ‖ 1

(10)

Equation (10) can be rewritten as follows.

min Xi F( X i ) +2τ ‖ X i ‖ 1

(11)

where F(X_i) = ‖ Y i -D X i ‖ F 2 + ‖ Y i -D Q ˜ i ( Q ˜ i ) T X i ‖ F 2 + λ₂ ( ‖ X i - M i ‖ F 2 - ∑ i=1 c ‖ M i -M ‖ F 2 + ‖ X i ‖ F 2 ) , τ = λ₁/2.

Because F(X_i) is convex and differentiable to X_i, a fast iterative shrinkage thresholding algorithm (FISTA)⁴²⁾ and a two-step iterative shrinkage/thresholding (TwIST) algorithm⁴⁵⁾ can be employed to solve Eq. (11).

The first derivative of F(X_i) with respect to X_i is calculated, some detailed steps are listed in Appendix 1, then we have

∇ X i F( X i ) =2 D T ( D X i - Y i ) +2 Q ˜ i ( Q ˜ i ) T D T [ D Q ˜ i ( Q ˜ i ) T X i - Y i ] +2 λ 2 [ X i O i O i T + X i P i P i T -R P i T + X i Q i j ( Q i j ) T - T j ( Q i j ) T ]

(12)

where E i j = ( 1 ⋯ 1 ⋮ ⋱ ⋮ 1 ⋯ 1 ) n i × n j , O_i = I n i × n i - E i i n i , Q i j = E i j N , P_i = E i i n i - Q i i , R = ∑ j=1 j≠i c X j Q j i , T j = X j E j j n j - ∑ l=1 l≠i c X l Q l j .

According to FISTA and TwIST algorithm, we have

{ X i } t+1 =( 1-α ) { X i } t-1 +( α-β ) { X i } t +β S τ σ [ { X i } t - 1 2σ ∇ X i F( X i ) ]

(13)

where S τ σ ( ⋅ ) is a soft-threshold shrinkage operator, {X_i}_t₋₁ is the previous value, {X_i}_t is the current value, {X_i}_t₊₁ is the next value, α>0, β>0, σ>0.

3.3.2. Update of D_i

With fixed X and D-_j, j ≠ i, J can be rewritten as follows.

min D i ∑ i=1 c ( ‖ Y i -D X i ‖ F 2 + ‖ Y i -D Q ˜ i ( Q ˜ i ) T X i ‖ F 2 ) + w 1 n i k i 2 ‖ D i T D i - I k i ‖ F 2 + w 2 n i k i ( K- k i ) ‖ D i T ( D Q -i ) ‖ F 2

(14)

Then, we have

min D i ‖ Y-D( X 1 , ⋯, X c ) ‖ F 2 + ‖ Y-D[ Q ˜ 1 ( Q ˜ 1 ) T X 1 , ⋯, D Q ˜ c ( Q ˜ c ) T X c ] ‖ F 2 + w 1 n i k i 2 ‖ D i T D i - I k i ‖ F 2 + w 2 n i k i ( K- k i ) ‖ D i T ( D Q -i ) ‖ F 2

(15)

Denote A = [Y, Y]∈ ℝ d×2N , B = [X₁, …, X_c, Q ˜ 1 ( Q ˜ 1 ) T X 1 , …, D Q ˜ c ( Q ˜ c ) T X c ]∈ ℝ K×2N , C = DQ₋_i∈ ℝ d×(K- k i ) , we have

min D i ‖ A-DB ‖ F 2 + w 1 n i k i 2 ‖ D i T D i - I k i ‖ F 2 + w 2 n i k i ( K- k i ) ‖ D i T C ‖ F 2

(16)

Therefore

min D i ‖ A- ∑ j=1 j≠i c+1 D j B j - D i B i ‖ F 2 + w 1 n i k i 2 ‖ D i T D i - I k i ‖ F 2 + w 2 n i k i ( K- k i ) ‖ D i T C ‖ F 2

(17)

Denote A ˜ = A− ∑ j=1 j≠i c+1 D j B j ∈ ℝ d×2N , Eq. (17) is equivalent to the following problem.

min D i ‖ A ˜ - D i B i ‖ F 2 + w 1 n i k i 2 ‖ D i T D i - I k i ‖ F 2 + w 2 n i k i ( K- k i ) ‖ D i T C ‖ F 2

(18)

where Bⁱ denotes k_i rows of B corresponding to the sub-dictionary D_i.

Equation (18) can be solved by method of optimal coherence-constrained directions (MOCOD)⁴⁴⁾ algorithm.

3.3.3. Update of D_c₊₁

With fixed X and D-₍_c₊₁₎, J can be rewritten as follows.

min D c+1 ∑ i=1 c ‖ Y i -D X i ‖ F 2 + ∑ i=1 c ‖ Y i - D i X i i - D c+1 X i c+1 ‖ F 2 + w 1 n c+1 k c+1 2 ‖ D c+1 T D c+1 - I k c+1 ‖ F 2 + w 2 n c+1 k c+1 (K- k c+1 ) ‖ D c+1 T (D Q -(c+1) ) ‖ F 2

(19)

Because ∑ i=1 c ‖ Y i -D X i ‖ F 2 denotes the global reconstructive error, therefore

min D c+1 ‖ Y- ∑ j=1 c D j X j - D c+1 X c+1 ‖ F 2 + ∑ i=1 c ‖ Y i - D i X i i - D c+1 X i c+1 ‖ F 2 + w 1 n c+1 k c+1 2 ‖ D c+1 T D c+1 - I k c+1 ‖ F 2 + w 2 n c+1 k c+1 ( K- k c+1 ) ‖ D c+1 T ( D Q -( c+1 ) ) ‖ F 2

(20)

where X^c⁺¹ = ( X 1 c+1 , X 2 c+1 , ... , X c c+1 ) .

Denote A = Y – DQ₋₍_c₊₁₎(Q₋₍_c₊₁₎)^TX, B = (Y₁−D₁ X 1 1 , Y₂−D₂ X 2 2 , …, Y_c−D_c X c c ), C = DQ₋₍_c₊₁₎, then we have

min D c+1 ‖ A- D c+1 X c+1 ‖ F 2 + ∑ i=1 c ‖ B- D c+1 X c+1 ‖ F 2 + w 1 n c+1 k c+1 2 ‖ D c+1 T D c+1 - I k c+1 ‖ F 2 + w 2 n c+1 k c+1 ( K- k c+1 ) ‖ D c+1 T C ‖ F 2

(21)

Denote U = [A, B], V = [X^c⁺¹, X^c⁺¹], Eq. (21) is equivalent to the following problem.

min D c+1 ‖ U- D c+1 V ‖ F 2 + w 1 n c+1 k c+1 2 ‖ D c+1 T D c+1 - I k c+1 ‖ F 2 + w 2 n c+1 k c+1 ( K- k c+1 ) ‖ D c+1 T C ‖ F 2

(22)

Similar to Eqs. (18), (22) can be solved by MOCOD algorithm.

3.4. Classification

For the CDSDL algorithm, both the reconstructive error associated with different class and the coding vector can achieve good classification performance. Based on the coding vector, there are two kinds of classifier as follows.

For a global classifier, a test sample y ˆ is computed over the whole dictionary D as follows.

x ˆ =arg mi n x ‖ y ˆ -Dx ‖ 2 2 +λ ‖ x ‖ 1

(23)

According to arg min_i ‖ y ˆ - D i x ˆ i ‖ 2 2 , the class of y ˆ is the index i of the smallest reconstruction error ‖ y ˆ - D i x ˆ i ‖ 2 2 , where x ˆ = [ x ˆ 1 ; …; x ˆ i ; …; x ˆ c ; x ˆ c+1 ]∈ ℝ K , i = 1, 2, ..., c, x ˆ i is the sub-vector of x ˆ that corresponding to the sub-dictionary D_i, the global coding vector x ˆ G equals x ˆ .

For a local classifier, we stack a class-specific sub-dictionary D_i and a shared sub-dictionary D_c₊₁ to construct c compositional dictionaries. A test sample y ˆ is computed over each compositional dictionary as follows.

x ˆ i =arg mi n x ‖ y ˆ -( D i , D c+1 ) x ‖ 2 2 +λ ‖ x ‖ 1

(24)

According to arg min_i ‖ y ˆ -( D i , D c+1 ) x ˆ i ‖ 2 2 , the class of y ˆ is the index i of the smallest reconstruction error ‖ y ˆ -( D i , D c+1 ) x ˆ i ‖ 2 2 , where x ˆ i = [ x ˆ i i ; x ˆ i c+1 ]∈ ℝ k i + k c+1 , i = 1, 2, ..., c, x ˆ i i is the sub-vector corresponding to the class-specific sub-dictionary D_i, x ˆ i c+1 is the sub-vector corresponding to the shared sub-dictionary D_c₊₁.

In fact, x ˆ i c+1 , which just works for reconstruction, can be omitted. Therefore, we stack c remaining x ˆ i i together, and denote the local coding vector x ˆ L = [ x ˆ 1 i ; …; x ˆ i i ; …; x ˆ c i ]∈ ℝ K- k c+1 .

The x ˆ G and x ˆ L can be directly treated as the input feature vector of linear one versus rest (OVR) SVM, respectively.

4. Experimental Results

The dataset of steel surface images as shown in Fig. 1 was collected from practical steel sheet production line. The dataset comprises four kinds of typical surface defects (Folding, Line, Patch, and Scratch) and one kind of non-defective (Original). There are 300 8-bit grayscale images per class, whose image size is 200×200 pixels. Each image is resized from 200×200 pixels to 40×40 pixels to avoid over-fitting. Because of non-uniform brightness of the images, each image is normalized to have zero mean and unit l₂-norm. We randomly choose 50% images per class for a training dataset, and the rest is a testing dataset. The classification accuracy is defined as follows.

Acc= N R N

(25)

where N_R is the number of test samples that are correctly classified, N is the total number of test samples. To deal with random influence, we do each experiment 10 times and have the mean and standard deviation of results. The weight parameters λ₁, λ₂, w₁, w₂ and sparsity T are tuned with 5-fold cross validation. Finally, we set λ₁ = 0.1, λ₂ = 0.001, w₁ = 0.01, w₂ = 0.1, T = 5, respectively. We consider further fine-tuning may further improve the classification performance.

The relationship between classification accuracy and size of sub-dictionary is shown in Fig. 3. Figure 3(a), in which the size of class-specific sub-dictionary is fixed, shows that classification performance decreases with increasing the size of shared sub-dictionary. The possible reasons are that smaller shared sub-dictionary is enough to capture the shared features of the defect images, and that larger shared sub-dictionary may cause redundancy of the feature vector. Figure 3(b), in which the size of shared sub-dictionary is fixed, shows the classification performance improves with increasing the size of class-specific sub-dictionary. The possible reason is that more discriminative information can be captured by a larger class-specific sub-dictionary. In fact, larger size of dictionary may have stronger representative ability of details and achieve better classification performance at the expense of increasing computational load. When the class-specific sub-dictionary reaches to a certain size, classification performance won’t develop. Therefore, we should make a tradeoff between classification performance and computational efficiency. As shown in Fig. 3, when the size of shared sub-dictionary k_s = 2 and the size of class-specific sub-dictionary k_c = 25, CDSDL can still have higher classification accuracy 94.25%. From k_c = 50 to k_c = 25, classification accuracy only drops by 1.58%. Therefore, the CDSDL algorithm is simple and efficient to learn a compact, reconstructive and discriminative dictionary, which can boost classification performance and reduce computational load.

Fig. 3.

The classification accuracy versus the number of atoms in sub-dictionary, where the legend denotes the number of atoms in the class-specific sub-dictionary (top) and the shared sub-dictionary (bottom), respectively.

Figure 4 shows that objective function value of the CDSDL algorithm decreases fast and converges within about 5–8 iterations.

Fig. 4.

The convergence curve of the CDSDL algorithm.

Figure 5 shows that the global coding vector matrix of training dataset and testing dataset are nearly diagonal block. Each block indicates the main part of coding matrix for the corresponding class-specific sub-dictionary. Therefore, the discrimination of the coding vector which can improve classification performance is enhanced by the CDSDL algorithm.

Fig. 5.

The visualization image of global coding vector matrix (Folding, Line, Non-defect, Patch, Scratch): training dataset (top); testing dataset (bottom).

The CDSDL algorithm is compared with several state-of-the-art methods, including COPAR,³⁵⁾ FDDL,³³⁾ LCKSVD³¹⁾ and SRC.²²⁾ As shown in Table 2, CDSDL with a local classifier achieves 94.25% classification accuracy, compared to 91.85% for COPAR, 51.89% for FDDL, 66.99% for LCKSVD and 56.96% for SRC. Compared to SRC, which is the baseline method in the experiment, CDSDL improves the classification accuracy with a margin of more than 37%. CDSDL outperforms FDDL by about 42.36%. CDSDL is competitive with 2.4% improvement to 91.85% achieved by COPAR. Therefore, CDSDL has better classification performance than these methods. Besides, a local classifier (94.25%) of CDSDL achieves much better performance than a global classifier (83.51%) with about 10% advancement. The probable reason is that the number of training samples of each class is relatively large, or the learned class-specific sub-dictionary has large size.

Table 2. The comparison of classification accuracy of different methods.

Method	Accuracy (%)
COPAR	91.85±1.02
FDDL	51.89±4.31
LCKSVD	66.99±1.42
SRC	56.96±1.20
CDSDL	94.25±0.42

According to Fig. 6, error rate of scratch is bigger than other defects. There are 33 images of scratch are misclassified to 9 folding, 10 line and 13 patch. The possible reasons for this phenomenon are listed as follows. (1) These defects are similar with scratch in size, direction and illumination. (2) The CDSDL algorithm is sensitive to scratch defect, and thus generates the improper feature representation. We will make a further research for this phenomenon. Figure 7 shows that some scratch samples are misclassified to other defects.

Fig. 6.

The confusion matrix, where the element located in the i-th row and the j-th column indicates the number of the j-th class sample is misclassified to the i-th class, j ≠ i.

Fig. 7.

Some scratch samples are misclassified to other defects: (a) Scratch→Folding; (b) Scratch→Line; (c) Scratch→Patch.

To evaluate robustness of the CDSDL algorithm, additive Gaussian noise was added to some surface images. We randomly choose 50% images per class as a training dataset, and the rest is a testing dataset. All the training samples are added to Gaussian noise with different signal to noise ratio (SNR), including 15 dB, 20 dB and 25 dB. As shown in Fig. 8, the same image is polluted by Gaussian noise with different SNR. According to Table 3, CDSDL can achieve 92.45% classification accuracy even at 25 dB noise.

Fig. 8.

The same image is polluted by Gaussian noise: (a) 15 dB; (b) 20 dB; (c) 25 dB; (d) Original.

Table 3. The classification accuracy of the CDSDL algorithm for different Gaussian noise.

SNR (dB)	Accuracy (%)
15	65.77±2.50
20	85.88±1.63
25	92.45±0.29

5. Conclusion

A classification method based on class-specific and shared sub-dictionary learning is presented in the paper. The class-specific sub-dictionary learning and shared sub-dictionary learning are jointly formulated with reconstructive error, sparse and discriminative promotion constraints. Each class-specific sub-dictionary has good performance to represent the sample of corresponding class, and the shared sub-dictionary contributes to reconstruction of all the samples. Therefore, the associated feature representation based on class-specific sub-dictionary and shared sub-dictionary not only faithfully represent the sample from anyone class, but also are more compact and discriminative for classification. Experimental results show that the CDSDL algorithm can achieve better accuracy in classification of surface defects of steel sheet.

Acknowledgements

This work was supported by the Natural Science Foundation of China, under grant No. 51174151, and the Specialized Research Fund for the Key Science and Technology Innovation Plan of Hubei Province, under grant No. 2013AAA011. The authors would like to thank Kechen Song and Yungang Tan for providing the defect images. MATLAB procedure was revised and optimized from.^27,33,35,44)

Appendix

Appendix 1: Derivation of F(X_i) in Eq. (12)

Denote E i j = ( 1 ⋯ 1 ⋮ ⋱ ⋮ 1 ⋯ 1 ) n i × n j , I n i × n i = ( 1 ⋯ 0 ⋮ ⋱ ⋮ 0 ⋯ 1 ) n i × n i , O i = I n i × n i - E i i n i , then, we have

‖ X i - M i ‖ F 2 = ‖ X i ( 1 ⋯ 0 ⋮ ⋱ ⋮ 0 ⋯ 1 ) n i × n i - X i ( 1 n i ⋯ 1 n i ⋮ ⋱ ⋮ 1 n i ⋯ 1 n i ) n i × n i ‖ F 2 = ‖ X i I n i × n i - X i E i i n i ‖ F 2 = ‖ X i O i ‖ F 2

Denote Q i j = E i j N , P i = E i i n i - E i i N = E i i n i - Q i i , R= ∑ j=1 j≠i c X j Q j i , then, we have

∑ i=1 c ‖ M i -M ‖ F 2 = ‖ X i E i i n i -( X i E i i N + ∑ j=1 j≠i c X j E j i N ) ‖ F 2 = ‖ X i E i i n i -( X i Q i i + ∑ j=1 j≠i c X j Q j i ) ‖ F 2 = ‖ X i E i i n i - X i Q i i - ∑ j=1 j≠i c X j Q j i ‖ F 2 = ‖ X i ( E i i n i - Q i i ) - ∑ j=1 j≠i c X j Q j i ‖ F 2 = ‖ X i P i - ∑ j=1 j≠i c X j Q j i ‖ F 2

For the i-th class, we have ‖ X i P i - ∑ j=1 j≠i c X j Q j i ‖ F 2

For the non i-th class, we have ∑ j=1 j≠i c ‖ X j P j - ∑ l=1 l≠j c X l Q l j ‖ F 2

Choose the j-th part of the i-th class, we have

‖ X j P j - ∑ l=1 l≠j c X l Q l j ‖ F 2 = ‖ X j E j j n j - X j Q j j - ∑ l=1 l≠j c X l Q l j ‖ F 2 = ‖ X j E j j n j -( X j Q j j + ∑ l=1 l≠j c X l Q l j ) ‖ F 2

Separate X_i, we have ‖ X j E j j n j -( X i Q i j + ∑ l=1 l≠i c X l Q l j ) ‖ F 2 = ‖ X j E j j n j - ∑ l=1 l≠i c X l Q l j - X i Q i j ‖ F 2 = ‖ T j - X i Q i j ‖ F 2

Expand the anyone non i-th class, ∑ j=1 j≠i c ‖ T j - X i Q i j ‖ F 2 , then, we have

‖ X i - M i ‖ F 2 + ∑ i=1 c ‖ M i -M ‖ F 2 = ‖ X i O i ‖ F 2 - ‖ X i P i -R ‖ F 2 - ∑ j=1 j≠i c ‖ T j - X i Q i j ‖ F 2

Calculate the first derivative with respect to X_i, we have

∂ ‖ X i O i ‖ F 2 ∂ X i =2 X i O i O i T ∂ ‖ X i P i -R ‖ F 2 ∂ X i =2( X i P i P i T -R P i T ) ∂ ∑ j=1 j≠i c ‖ T j - X i Q i j ‖ F 2 ∂ X i = 2 ∑ j=1 j≠i c [ X i Q i j ( Q i j ) T - T j ( Q i j ) T ]

References

1) T. S. Newman and A. K. Jain: Comput. Vis. Image Underst., 61 (1995), 231.
2) D. Brzakovic and N. Vujovic: Pattern Recogn., 29 (1996), 1401.
3) D. M. Tsai and T. Y. Huang: Image Vis. Comput., 21 (2003), 307.
4) E. N. Malamas, E. G. M. Petrakis, M. Zervakis, L. Petit and J. D. Legat: Image Vis. Comput., 21 (2003), 171.
5) X. H. Xie: ELCVIA, 7 (2008), 1.
6) R. Shanmugamani, M. Sadique and B. Ramamoorthy: Measurement, 60 (2015), 222.
7) B. K. Kwon, J. S. Won and D. J. Kang: Int. J. Precis. Eng. Manuf., 16 (2015), 965.
8) N. Neogi, D. K. Mohanta and P. K. Dutta: EURASIP J. Image. Video Process., 1 (2014), 1.
9) H. J. Hu, Y. X. Li, M. F. Liu and W. H. Liang: Multimed. Tools Appl., 69 (2014), 199.
10) H. J. Hu, Y. Liu, M. F. Liu and L. Q. Nie: Neurocomputing, 181 (2015), 86.
11) M. F. Liu, Y. Liu, H. J. Hu and L. Q. Nie: J. Vis. Commun. Image. R., 37 (2016), 70.
12) P. Caleb and M. Steuer: Proc. IEEE KES, IEEE, Piscataway, NJ, (2000), 103.
13) R. Medina, F. Gayubo, L. M. González-Rodrigo, D. Olmedo, J. Gómez-García-Bermejo, E. Zalama and J. R. Perán: Int. J. Adv. Manuf. Tech., 57 (2011), 1087.
14) X. W. Zhang, Y. Q. Ding, Y. Y. Lv, A. Y. Shi and R. Y. Liang: Expert Syst. Appl., 38 (2011), 5930.
15) C. S. Lee, C. H. Choi, J. Y. Choi, Y. K. Kim and S. H. Choi: Proc. IEEE ICIP, IEEE, Piscataway, NJ, (1996), 673.
16) F. Pernkopf: Pattern Anal. Appl., 7 (2004), 333.
17) S. Ghorai, A. Mukherjee, M. Gangadaran and P. K. Dutta: IEEE Trans. Instrum. Meas., 62 (2013), 612.
18) K. C. Song and Y. H. Yan: Appl. Surf. Sci., 285 (2013), 858.
19) M. X. Chu and R. F. Gong: ISIJ Int., 55 (2015), 1956.
20) K. Xu, S. H. Liu and Y. H. Ai: Image Vis. Comput., 35 (2015), 23.
21) J. Wright, Y. Ma and J. Mairal: Proc. IEEE, 98 (2015), 1031.
22) J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry and Y. Ma: IEEE Trans. Pattern Anal., 31 (2008), 210.
23) I. Tošić and P. Frossard: IEEE Signal Proc. Mag., 28 (2011), 27.
24) R. Rubinstein, A. M. Bruckstein and M. Elad: Proc. IEEE, 98 (2010), 1045.
25) K. Engan, S. O. Aase and J. H. Husøy: Signal Process., 80 (2000), 2121.
26) M. Aharon, M. Elad and A. K. Bruckstein: IEEE Trans. Signal Process., 54 (2006), 4311.
27) I. Ramirez, P. Sprechmann and G. Sapiro: Proc. IEEE CVPR, IEEE, Piscataway, NJ, (2010), 3501.
28) S. H. Gao, I. W. H. Tsang and Y. Ma: IEEE Trans. Image Process., 23 (2014), 623.
29) K. Huang and S. Aviyente: Proc. NIPS, Vol. 19, IEEE, Piscataway, NJ, (2006), 609.
30) Q. Zhang and B. X. Li: Proc. IEEE CVPR, IEEE, Piscataway, NJ, (2010), 2691.
31) Z. L. Jiang, Z. Lin and L. S. Davis: IEEE Trans. Pattern Anal., 35 (2013), 2651.
32) J. Mairal, F. Bach and J. Ponce: IEEE Trans. Pattern Anal., 34 (2012), 791.
33) M. Yang, L. Zhang, X. C. Feng and D. Zhang: Int. J. Comput. Vis., 109 (2014), 209.
34) N. Zhou and J. P. Fan: IEEE Trans. Pattern Anal., 36 (2013), 715.
35) D. H. Wang and S. Kong: Pattern Recogn., 47 (2014), 885.
36) M. Elad, M. A. T. Figueiredo and Y. Ma: Proc. IEEE, 98 (2010), 972.
37) A. M. Bruckstein, D. L. Donoho and M. Elad: SIAM Rev., 51 (2009), 34.
38) H. Zou and T. Hastie: J. R. Stat. Soc. B, 67 (2005), 301.
39) A. Y. Yang, A. Genesh and Z. H. Zhou: IEEE T. Image Process., 22 (2013), 3234.
40) Z. Zhang, Y. Xu, J. Yang, X. L. Li and D. Zhang: IEEE Access, 3 (2015), 490.
41) Y. C. Pati, R. Rezaiifar and P. S. Krishnaprasad: Proc. of 27th Asilomar Conf. on Signals, Systems and Computers, IEEE, Piscataway, NJ, (1993), 40.
42) A. Beck and M. Teboulle: SIAM J. Imaging Sci., 2 (2009), 183.
43) R. O. Duda, P. E. Hart and D. G. Stork: Pattern Classification, 2nd ed., John Wiley&Sons, New York, (2001), 30.
44) I. Ramirez, F. Lecumberry and G. Sapiro: Technical Report 2279, Institute for Mathematics and its Application, University of Minnesota, MN, (2009), 1.
45) J. M. Bioucas-Dias and M. A. T. Figueiredo: IEEE Trans. Image Process., 16 (2007), 2992.

Corresponding author

Register with J-STAGE for free!