Journal of Forest Planning
Online ISSN : 2189-8316
Print ISSN : 1341-562X
Article
Convolutional Neural Network Applied to Tree Species Identification Based on Leaf Images
Yasushi Minowa Yui Nagasaki
Author information
JOURNAL FREE ACCESS FULL-TEXT HTML

2020 Volume 26 Pages 1-11

Details
Abstract

We identified tree species based on leaf images with a convolutional neural network (CNN). We sampled approximately 200 to 300 leaves per tree from five tree species at Kyoto University Campus. Twenty to thirty 1.0 × 1.0 cm (256 × 256 pixel) leaf images were taken per leaf, from which 10,000 leaf images (2,000 × 5 individual tree species) were prepared for the sample data. Color, grayscale, and binary images were used as image types. We constructed 36 learning models using based on differences in learning patterns, image types, and learning iterations. Performance evaluation of the proposed model was conducted using the Matthews correlation coefficient (MCC). Both training and test data had high classification accuracy. The mean MCC of the five tree species ranged from 0.881 to 0.998 for the training data and 0.851 to 0.994 for the test data. Classification accuracy was generally high for color images and low for grayscale images. We found that there were many cases where Cinnamomum camphora was misclassified as Quercus myrsinifolia, or Quercus myrsinifolia was misclassified as Quercus glauca; most Quercus glauca, Ilex integra, and Pittosporum tobira trees were correctly classified using the training data; and misclassification using test data for Ilex integra was very low.

INTRODUCTION

The use of mobile terminals has been widely examined in various fields. For example, in forest science, a tree-retrieval system was proposed by Kumar et al. (2012) using a smartphone application called Leafsnap to retrieve and identify tree species. This application evaluates leaf shape images using a smartphone camera and automatically identifies 184 tree species primarily inhabiting North America. In addition, the digital picture book BIOME was developed to promote the conservation of biological diversity while generating a profit. This application can recognize various living organisms by transmitting a photograph to an internet server and uses a game-like function to collect species information (BIOME, 2017). Similarly, we have focused on developing a high-precision auto-tree-identification system based on leaf images for various mobile terminals. Easily identifying tree species using this system may improve forest investigations or forest environmental education. For example, the system can be used as a substitute for picture books of flora or offer various digital tree-distribution maps on the web through GPS information, which may contribute to the forest science field. Moreover, by adding a game function as in BIOME, the system can support the learning and education of many people. Thus, we developed tree-retrieval algorithms to promote these objectives. Minowa et al. (2011) classified tree species based on leaf shape images using a self-organization map and a decision-tree algorithm. Generally, most tree-retrieval systems perform identification based on leaf-shape images (Gouveia et al., 1997; Wang et al., 2000; Nam and Hwang 2005; Shen et al., 2005; Lee and Chen, 2006; Du et al., 2007; Kumar et al., 2012). However, it is difficult to identify all tree species using only these images (Minowa et al., 2011). For example, when the leaf size is large or compound, a photograph must be taken from a long distance using a smartphone to include the whole leaf. This causes the details of the leaf image to become indistinct. Thus, Minowa et al. (2019) classified tree species based not only on leaf shapes but also on venation patterns. This improved the classification accuracy using only venation information as image features without leaf shape information. Although this approach shows high classification accuracy for training data, it does not always perform well on test data. Moreover, the authors used a fractal dimension or histograms of oriented gradients (Dalal and Triggs, 2005; Yamasaki, 2010a) as image features for venation patterns. It is difficult to use this type of information because several image processing functions are necessary to extract leaf image features to apply the classification model.

Since McCulloch and Pitts (1943) proposed the formal neuron model, studies of artificial intelligence and its applications have greatly progressed. Particularly, with the evolution of computer resources and techniques after 2000, results obtained using "deep learning" techniques have attracted attention (Yamashita, 2016). Convolutional neural networks (CNNs), which were proposed by LeCun et al. (1998), have contributed substantially to the image-recognition field (LeCun et al., 1998; Okaya, 2015; Yamashita, 2016). A notable example of a CNN, AlphaGO, which was developed by the Google DeepMind Company, was the first Go computer program superior to a professional human Go player in Tagai-sen (no handicap) using the deep Q-network method (Silver et al., 2016, 2017; Ohtsuki, 2019). Deep learning produces superior results in image recognition because of innovations in advanced computing hardware as well as the use of a method that differs substantially from past image recognition methods. Previous image recognition methods extract image features in advance, which are inputted as training data into a learning model. The types of image features used have important effects on classification accuracy; however, image features are difficult to specify because they vary between objects. Thus, extracting image features is greatly affected by the experience of researchers or developers (Yamashita, 2016). By contrast, deep learning can extract image features by itself and therefore is highly accessible, which means non-experts in image recognition can easily use deep learning (Makino and Nishizaki, 2018). In addition, most applications that perform deep learning use free-license software available to the public. Therefore, researchers in various fields can perform image recognition using deep learning. In tree identification with machine learning or deep learning, the former uses venation patterns extracted from leaves based on the National Cleared Leaf Collection, housed at the Smithsonian Institution (Wilf et al., 2016), whereas the latter is based on point cloud data from laser-scanned forest data (Mizoguchi et al., 2016) or tree species identification using fine-tuning based on aerial photography images collected by a drone (Nakane and Wakatsuki, 2018). A previous study performed tree identification based on venation patterns but used machine learning with leaf-image features (Wilf et al., 2016). Moreover, even if tree identification uses deep learning, such as that performed by Mizoguchi et al. (2016) and Nakane and Wakatsuki (2018), the classification is approximate and varies widely. Few reports are available on tree identification by deep learning based on venation patterns. However, this method would be advantageous for identifying tree species by mobile terminals, as it is not necessary to use the whole leaf image. This study was conducted to identify tree species based on leaf images with a CNN. We divided one section of a leaf image into dozens of pieces of sample data that were inputted into the CNN model. We constructed various learning models based on differences in learning patterns, image types, and learning iterations, and verified the classification accuracy of the models for both training and test data.

MATERIALS AND METHODS

Study Site

We sampled 200–300 leaves from each of five tree species, Cinnamomum camphora, Ilex integra, Pittosporum tobira, Quercus glauca, and Quercus myrsinifolia, at Kyoto University campus. We chose these five species as our test species because they had greater intraspecific variance in leaf shape and lower classification accuracy than other species in a previous study (Minowa et al., 2019). Tree species planted at Kyoto University campus were used primarily so that repeated tree sampling can be performed at the same sites using the same sampling protocol (Minowa et al., 2019); thus, samples can be compared more easily in future studies.

Leaf Image Processing

Leaves sampled at the Kyoto University campus were scanned using a GT-X970 (EPSON Co., Ltd., Suwa, Japan) with a color image resolution of 650 dpi. We used ImageJ 1.50 software (NIH, 2014), an open-source image-processing program, for leaf-image processing, which was performed as follows. We randomly extracted 20–30 1.0 × 1.0 cm (256 × 256 pixel) leaf images from one piece of a leaf, excluding the edges, and prepared 10,000 leaf images (= 2,000 × 5 tree species) as sample data. Color, grayscale, and binary images were used as image types. Figure 1 illustrates the three image types for the five tree species. Although our goal was to develop an auto-tree-identification system using mobile terminals for analyzing leaf images photographed with mobile devices, we used scanned images mainly because their resolution is higher than that of photographs with mobile devices. The results of this study may become a benchmark for analyses using mobile devices. Moreover, grayscale and binary images were used mainly because color images are extremely useful when photography conditions or environments are maintained, such as in indoor photography. However, color images are not always useful for uncontrolled environments such as outdoors, as they greatly depend on the hardware characteristics of individual cameras, parameters such as sensitivity, and lighting color (Sato, 2011). For example, light- and shade-based techniques such as Harr-like features are mainstream approaches used for facial recognition (Papageorgiou et al., 1998), which uses grayscale. Moreover, because binarization is important in preprocessing for image recognition, binary images are used. Thus, we used grayscale and binary images in addition to color images in this study.

Fig. 1

Samples of leaf images for each tree species.

a) Color image; b) Grayscale image; c) Binary image.

CNN and GoogLeNet

A CNN is a neural network model mainly used in the image recognition field. In the 2000s, histograms of oriented gradients and scale-invariant feature transform (Lowe, 1999; Yamasaki, 2010b) were commonly used for image features in classification problems that used support vector machine classifiers (Vapnik and Lermer, 1963; Yamasaki, 2010a). However, in 2012, at the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a CNN with an AlexNet algorithm won the championship because of its overwhelming classification accuracy (Sanchez and Perronnin, 2011). A CNN is a deep neural network consisting of deep layers in several steps: convolution layers, pooling layers, and a fully connected layer (Fig. 2). The convolution layer extracts local image features from a small area using convolution calculation for the input image. The pooling layer compresses image features by extracting the convolution layer. After many repetitions of these processes, the fully connected layer runs as the final process and the output is the final result (Yamashita, 2016).

Fig. 2

Structure of a convolutional neural network.

In the present study, we used GoogLeNet in some of the CNN algorithms. GoogLeNet is the 2014 ILSVRC champion (Szegedy et al., 2015). Similar to the network in network (NIN) algorithm proposed by Liu et al. (2014), GoogLeNet utilizes a micro-network that can conduct a full connection between feature maps rather than the activation function. GoogLeNet consists of 22 layers, such as inception modules, which are composed of plural convolutions or pooling layers (Yamashita, 2016).

Learning Environment and Models for the CNN

The learning environment for the CNN was a computer with a Linux operating system (Ubuntu 16.04 LTS), Intel Core i7–7700K central processing unit, and an NVIDIA GeForce GTX 1080Ti graphics processing unit. We used CUDA 8.0 and cuDNN 5.1 to support the deep learning by the graphics processing unit (NVIDIA, 2019; Yamashita, 2016; Shimizu, 2017). The learning model for the CNN was DIGITS 5.0.0 (NVIDIA website), which enables web-based learning, and Caffe 0.15.13 (NVIDIA, 2019) as a learning framework. Many previous studies have used DIGITS and Caffe for applications such as automatic recognition of the microstructure of steel materials (Adachi et al., 2016), medical-field methodology (Izaki, 2017), and automated classification of coronary angiography (Hasegawa et al., 2018). Morino (2017) also recommended that scientists and technicians use Caffe as the CNN method in deep learning for image analyses. Finally, Caffe is a representative framework in the deep learning field; therefore, it involves advanced tuning of hyper-parameters and speeds up learning and calculations. In addition, Caffe can mount DIGITS, which can be used with web-based methods. Thus, we used DIGITS, Caffe, and GoogLeNet as tools for deep learning.

Simulation Conditions and Evaluation Performance of the Learning Models

We divided 10,000 leaf images into 10 equal sets that included each tree species and set up four learning model types (Fig. 3). Learning model-1 (LM-1) used one set (= 1,000 images) as training data and verified itself. Learning model-2 (LM-2) estimated nine sets from the remainder as test data using the parameters learned by LM-1. Learning model-3 (LM-3) used nine sets (= 9,000 images) as training data and verified itself. Finally, learning model-4 (LM-4) estimated one set from the remainder as test data using the parameters learned by LM-3. We conducted 10 iterations without duplication. Although deep learning is not necessary to prepare image features in advance, we did this with both LM-1 and LM-3 primarily because researchers or developers need to input a large amount of image data as training data to deep learning models. Large amounts of training data must be used in deep learning, but the actual amount necessary is not specified. Using LM-1 with LM-3, we hypothesized that the data requirements to identify tree species based on leaf images could be determined. We applied hyper-parameters used by GoogLeNet to all default values (Table 1). The numbers of epochs in this study were 50, 100, and 500. Finally, we constructed 36 learning models (= 4 learning model types × 3 image types × 3 epoch types) based on the learning patterns, image types, and learning iterations.

Fig. 3

Learning patterns for tree identification.

Note: The number in each figure is the number of data points in the dataset.

Table 1 Hyper-parameters of the GoogLeNet algorithm
Parameters Setting Explanation
Base learning rate 0.01 Begin training at a learning rate of 0.01
Learning rate policy step By a factor of gamma following the step size
Gamma 0.1 Value to use in a learning rate policy
Momentum 0.9 Weight off the previous update
Weight decay 0.001 Heaviness decrement level of a learning rate for overfitting
Learning environments GPU Run using the GPU
Type of optimization algorithm SGD Stochastic Gradient Descent
Number of epochs 50, 100, 500 The number of learning iterations
Max iterations Automatic every model Number of times the parameters update
Step size Automatic every model The number of iterations to lower a learning rate
Snapshot Automatic every model Store frequency of the parameters

The performance of the proposed models was evaluated with the Matthews correlation coefficient (MCC) (Eq.(1)), which is an index for determining whether a classification is conducted without bias. The MCC ranges from –1 to 1 (Motoda et al., 2006; Witten and Frank, 2011). In Eq. (1), both true positive (TP) and true negative (TN) are accurate classifications according to the classifier; the former is the positive example whereas the latter is the negative example for each piece of training data. A false positive (FP) occurs when the outcome is incorrectly predicted as 'yes' (or positive) when it is actually 'no' (or negative). A false negative (FN) occurs when the outcome is incorrectly predicted as negative when it is actually positive (Witten and Frank, 2011). Here, the MCC is the average of the sum of ten iterations for each tree species.   

M C C = T P × T N F P × F N ( T P + F P ) ( T P + F N ) ( T N + F P ) ( T N + F N ) (1)

RESULTS

Classification Accuracy for Both Training and Test Data

Table 2 shows the classification accuracy based on differences in learning patterns, image types, and learning iterations. The classification accuracy for the training data ranged from 0.881 (LM-1, grayscale, 50 epochs) to 0.998 (LM-1, color, 500 epochs) (Fig. 4). Similarly, that of the test data ranged from 0.815 (LM-2, grayscale, 50 epochs) to 0.994 (LM-4, color, 500 epochs) (Fig. 4). Both the training and test data showed remarkably high classification accuracy. For each learning model, if the number of epochs in LM-1 was low, the MCC value was relatively low. The MCC in LM-1 tended to improve as the number of epochs increased. The MCC in LM-2 was lower than that of other learning models. Similar to LM-1, the MCC tended to improve as the number of epochs increased. The MCC in LM-3 indicated extremely high classification accuracy for all image types. For color images in LM-3, 100 epochs (MCC = 0.990) showed lower accuracy than 50 epochs (MCC = 0.993). The MCC values in LM-4 were similar to those of LM-3. Color images had the highest classification accuracy among all learning models. In addition, the MCC for all epoch numbers was over 0.9 when using color images, although there was much more test data than training data in LM-2. In both LM-1 and LM-2, the classification accuracy of binary images was higher than that of grayscale images, while In LM-3 and LM-4, the classification accuracy of grayscale images was similar to that of binary images, except for the case of 500 epochs in LM-3.

Table 2 Classification accuracy according to the differences in learning patterns, image types, and learning iterations
(1) Color image
Learning models Number of data Number of epochs Matthews correlation coefficient
Training Test Cinnamomum camphora Ilexintegra Pittosporum tobira Quercus glauca Quercusmyrsinifolia
1 1,000 1,000 50 0.966 1.000 0.996 0.962 0.928
100 0.981 1.000 1.000 0.999 0.981
500 0.996 1.000 1.000 0.999 0.996
2 1,000 9,000 50 0.890 0.991 0.905 0.936 0.812
100 0.922 0.994 0.894 0.944 0.834
500 0.934 0.996 0.920 0.941 0.862
3 9,000 9,000 50 0.990 1.000 1.000 0.999 0.989
100 0.989 1.000 1.000 0.999 0.988
500 0.993 1.000 1.000 1.000 0.993
4 9,000 1,000 50 0.985 0.999 1.000 0.998 0.982
100 0.983 0.999 0.994 0.991 0.981
500 0.989 0.999 0.998 0.997 0.986
(2) Grayscale image
Learning models Number of data Number of epochs Matthews correlation coefficient
Training Test Cinnamomum camphora Ilexintegra Pittosporum tobira Quercus glauca Quercusmyrsinifolia
1 1,000 1,000 50 0.825 0.978 0.997 0.860 0.747
100 0.854 0.989 0.996 0.906 0.828
500 0.936 0.998 0.999 0.967 0.931
2 1,000 9,000 50 0.687 0.958 0.946 0.813 0.672
100 0.721 0.957 0.911 0.863 0.769
500 0.793 0.967 0.937 0.890 0.827
3 9,000 9,000 50 0.951 0.998 0.999 0.969 0.950
100 0.969 0.999 1.000 0.979 0.969
500 0.988 0.999 0.999 0.992 0.988
4 9,000 1,000 50 0.936 0.993 0.997 0.954 0.929
100 0.955 0.996 0.999 0.962 0.955
500 0.980 0.997 0.999 0.983 0.975
(3) Binary image
Learning models Number of data Number of epochs Matthews correlation coefficient
Training Test Cinnamomum camphora Ilexintegra Pittosporum tobira Quercus glauca Quercusmyrsinifolia
1 1,000 1,000 50 0.899 0.975 0.997 0.909 0.835
100 0.950 0.989 0.999 0.945 0.916
500 0.974 0.999 1.000 0.981 0.966
2 1,000 9,000 50 0.866 0.924 0.984 0.834 0.754
100 0.888 0.926 0.972 0.883 0.847
500 0.898 0.941 0.975 0.894 0.871
3 9,000 9,000 50 0.972 0.996 0.999 0.969 0.955
100 0.986 0.998 1.000 0.976 0.970
500 0.993 0.999 1.000 0.985 0.982
4 9,000 1,000 50 0.961 0.974 0.993 0.950 0.931
100 0.972 0.983 0.997 0.960 0.954
500 0.980 0.975 0.996 0.964 0.959

Note: Underlines in each table show that the MCC equals 1.00.

Fig. 4

Classification accuracy according to the average for five species.

Classification accuracy for each tree species varied according to the image type. Using color images, the MCC of I. integra in LM-1 and both I. integra and P. tobira in LM-3 were 1.00 for all numbers of epochs (Table 2). The classification accuracy of Q. myrsinifolia was lower than that of other tree species for all learning model types; in particular, that of LM-2 ranged from 0.812 (50 epochs) to 0.862 (500 epochs), which was much lower than in other cases. Using grayscale images, only the MCC of P. tobira in LM-3 (100 epochs) was 1.00 (Table 2). Most MCC values for both I. integra and P. tobira were over 0.99, and the lowest was 0.911 (P. tobira, LM-2, 100 epochs). Similar to the color images, the classification accuracy in LM-2 was relatively low overall, and MCC values for both C. camphora and Q. myrsinifolia with 50 epochs were much lower than those of other tree species (0.687 and 0.672, respectively). Using binary images, some MCC values for P. tobira were 1.00 (Table 2); this tree species tended to be identified with higher accuracy than other species for both color and grayscale images. The MCC values of C. camphora, Q. glauca, and Q. myrsinifolia were lower than those of other tree species in the same image types, while most MCCs in binary images were higher than those of the grayscale images.

Tendency for Misclassification of Each Tree Species

Figure 5 illustrates the misclassification patterns according to tree species. Using color images, LM-1 correctly classified all tree species except for C. camphora, which was misclassified as Q. myrsinifolia (12 misclassifications), and Q. myrsinifolia, which was misclassified as Q. glauca (2). The misclassification in LM-2 using color images was greater than that of other learning models. In LM-3 using color images, all tree species were correctly classified except for C. camphora, which was misclassified as Q. myrsinifolia (196), and Q. myrsinifolia, which was misclassified as Q. glauca (7). In LM-4 using color images, Q. glauca was correctly classified. Overall C. camphora and Q. myrsinifolia tended to be misclassified as Q. myrsinifolia and Q. glauca, respectively; most Q. glauca, I. integra, and P. tobira species were correctly classified for all training data, while misclassifications of I. integra for the test data were very low.

Fig. 5

Tendency for misclassification by learning patterns.

C.c: Cinnamomum camphora; I.i: Ilex integra; P.t: Pittosporum tobira; Q.g: Quercus glauca; Q.m: Quercus myrsinifolia. CL: Color image; GS: Grayscale image; WB: Binary image.

For grayscale images, misclassifications occurred in all tree species in LM-1 and LM-2. Generally, classifications using grayscale images were similar to those using color images; however, misclassification rates were higher than those of color images, and there were numerous misclassification types not observed when using color images.

In LM-1 using binary images, I. integra and P. tobira were correctly classified. Conversely, LM-2 had many cases of misclassification overall. Classifications made using binary images were similar to those using color and grayscale images, and misclassification rates were lower than those of grayscale images. There were some misclassification types that were not observed in the color images; for example, P. tobira and Q. glauca were misclassified as I. integra.

In general, there were numerous cases of misclassification: C. camphora was misclassified as Q. glauca or Q. myrsinifolia, and Q. myrsinifolia as Q. glauca, for both the training and test data, respectively. The misclassification of I. integra was very limited, although there was some misclassification as Q. glauca. Both P. tobira and Q. glauca tended to be misclassified based on the image types.

DISCUSSION

We found that CNNs are among the most effective methods for tree identification because they can identify test data with high accuracy. This study included diverse leaf images from the same species by randomly extracting 20–30 1.0 × 1.0 cm images from various parts of a given leaf, which resulted in high-accuracy identification models for the training data. Similarly, Wilf et al. (2016) used scale-invariant feature transform as image features for a support vector machine classifier to identify trees and achieved 72% accuracy for 19 botanical families with ≥100 images. Although the present study differed in terms of the amount of training data and types of species, our CNN model still showed high classification accuracy for both training and test data.

In addition, the CNN model can identify tree species with high accuracy even without specifying where on the leaf the data should be extracted. Thus, the CNN model offers advantages for mobile terminal users with a smartphone, because a photograph of the entire leaf is not required. We speculate that the much lower classification accuracy in LM-2 was due to the amount of verification data (9,000 samples) being considerably greater than the amount of training data (1,000 samples), and leaf images used for the training and test data in the present study included image types that were randomly extracted from various parts of the leaf. Although the classification accuracy in LM-2 was lower than that of other learning models overall, it was much higher than that of previously proposed models (Minowa et al., 2011, 2019; Minowa and Asao, in press); however, the number of tree species in this study was lower. A previous study using a CNN to identify five types of weeds reported classification accuracies of 41% to 100% (Shindo et al., 2018), which was very low compared to the results of the present study, although the learning conditions differed (e.g., algorithms, image types). However, the authors suggested that a high classification accuracy cannot be assured when the number of tree species is low. Nevertheless, the classification accuracy of both the training and test data were high in the present study.

Among image types, classification accuracy was highest for color images and lowest for grayscale images. The amount of information decreased in the order of color, grayscale, and binary. Thus, we expected that the classification accuracy would improve according to this order. In the present study, however, grayscale images provided lower accuracy than the other image types. In addition, when users actually photograph a leaf with a smartphone, color images may be greatly influenced by the photo environment (Sato, 2011). Thus, when developing a tree identification system for mobile terminals, binary images should be considered because they provide robust results under different environmental conditions at the cost of slightly lower accuracy than color images (Sato, 2011).

CONCLUSIONS

In this study, we identified tree species based on leaf images using a CNN method. Although we used sub-regional leaf images composed mainly of venation, it was possible to identify tree species with high classification accuracy with sample data comprising only partial leaf images randomly extracted from the whole leaf. Compared to tree identification using decision-tree or neural-network models based on leaf shapes (Minowa et al., 2011, 2019; Minowa and Asao, in press), image recognition with deep learning models, such as a CNN, make it possible to build a classification model without processing the extracted leaf image. The proposed models with the CNN show higher classification accuracy than previous proposed models (Minowa et al., 2011, 2019; Minowa and Asao, in press). In addition, the CNN is among the most effective techniques for building an auto-tree-identification system for mobile terminals because it can identify a tree based on a part of the leaf image without the whole leaf. However, there were some limitations to this study; namely, only five tree species were used. We also did not verify the amount of training data necessary to identify tree species. Thus, future studies should consider a larger number of tree species, types of leaf images used for training data (e.g., venation patterns or image types), differences in deep-learning algorithms, and differences in the amount of training and test data.

ACKNOWLEDGEMENTS

We are grateful to the Terra Green Network (TGN) project members for their helpful comments and discussions at the TGN meeting, and to Ayumi Taniguchi at Kyoto University for support with tree sampling. Two anonymous referees provided valuable comments on earlier drafts of the manuscript.

LITERATURE CITED
 
© 2020 Japan Society of Forest Planning
feedback
Top