Improvement of object detection in rice field environment with a fisheye-lens camera for robot combine based on a guided training method

Sikai CHEN; Michihisa IIDA; Jiajun ZHU; Masahiko SUGURI; Ryohei MASUDA

doi:10.37221/eaef.19.1_1

Abstract

To achieve fully autonomous harvesting by robot combines, recognizing the external environments in rice fields is paramount. Two deep neural networks were used to detect the rice lodging, FCNResNet50 and FCNVGG16. In this study, a fisheye-lens camera was deployed to obtain a wider range of images. However, fisheye-lens cameras have a problem over the imaging frame, distortion. To resolve it, we propose to train models by adding cropped images from the specific area to the primitive dataset. By applying designated cropped images, these models achieved improved performance in testing original scenes. For FCNResNet50, the performance was 1–5 % better, while 10 % on average on FCNVGG16. Still, mean intersection over union (IoU), pixel accuracy and class accuracy also outperformed.

1. Introduction

To fill the gap between the demand and present labor force, Japan started to use robotic combines from 2019. However, these combines could only operate according to preset routes with a global navigation satellite system (GNSS), this research was reported by Iida et al. (2013a, 2013b). It was an early stage of robotic combines in which they do not have the ability to detect external environments.

Fisheye-lens cameras are often used in around-view system in automobiles to watch around a car for safety. The around-view system consisted of four fisheye-lens cameras, one was installed in the front, two on each side, and one in the rear, just like the layout of private cars. Much research posed reports about using around-view systems to detect external environments with AI (Deng et al., 2019, Manzoor et al., 2024). Besides cameras, a 3D LiDAR mounted on top of the vehicle was used to detect the external environment and monitor the surroundings (Yogamani et al., 2024). Alternatively, a combination of cameras and LiDAR can be used for task execution and collision avoidance (Iida et al., 2018).

Though fisheye-lens cameras have wider field of view (FOV) than normal-lens cameras, they have more severe distortions over the whole image, and closer to edges, distortions tend to be severer, causing compressions of object imaging. Due to the distortion caused by fisheye-lens cameras, the same object can appear with different shapes depending on its position in the image plane (Blott et al., 2019; Cho et al., 2023). Kokilepersaud et al. (2023) further analyzed how different parts of the image are affected, highlighting that neural network models require significantly more training data to accurately recognize a single object. For general usage, the fisheye-lens camera is fixed at one spot as a surveillance camera, imaging of some objects varies in it. In this case, when the object was approaching the fisheye-lens camera, every object reveals big changes at different distances. This is one difficulty of using the fisheye-lens camera for object detection.

To enable the robot combine to detect external environments, semantic segmentation was applied to analyze if lodging rice exists in front of the combine. The research on developing AI detection for combine using a rectilinear camera evolved from Li et al. (2020a), detecting four classes. And then testified in fields (Li et al., 2020b); to Zhu et al. (2022, 2024), detecting seven classes with rectilinear cameras, showing high accuracy, the mean IoU (intersection over union) was about 0.8. The monitor range was limited, for giant machinery like combine, fisheye-lens cameras are a better option.

It is time-consuming and tedious for researcher to collect data and preprocess the images for trains and tests. Especially for fisheye images, they need more effort to reach the same level of detection performance as rectilinear images. Many researchers have done works to extend detail varieties based on origin images and achieved good results (Kumar et al., 2023; Huang et al., 2023; Playout et al., 2021). However, in our previous research, using transformed images was not as good as expected (Chen et al., 2023), using uniformed fisheye images was optimal. This time, different regions of the image were used. This approach saves time and effort, as it eliminates the need for additional tasks such as manual labeling. To effectively train and detect an object, a large variety of fisheye images is required—meaning the same object must appear in different positions within the camera frame. By combining full images with selected regions of those images, neural networks may achieve improved performance.

This research aims to 1) use cropped images to amplify feature varieties for improving model performance on fisheye images, and 2) decrease preparation time and effort on datasets.

2. Materials and methods

2.1. Experimental equipment

Fisheye images were collected while the combine harvester was operating in a rice field. The combine used in this research was a Kubota WRH1200A (Kubota Corp., Japan). A fisheye-lens camera was fixed on the left side of the combine’s cabin, 15 ° downward horizontally, serving as the in-vehicle camera. This fisheye-lens camera has large FOV of 210 ° horizontally, 170 ° vertically, compared to 50 ° of most rectilinear cameras. A desktop PC was used to train and test neural network models. Specifications of the PC are shown in Table 1.

Table 1 Desktop PC specs

Software		Hardware
Name	Version	Type	Model
Windows	10	CPU	i7-8700
Python	3.7.7	GPU	RTX2060 super
Pytorch	1.5.0	RAM	DDR4 16GB
CUDA	10.2
cuDNN	7.6.4

2.2. Semantic segmentation

Semantic segmentation (Long et al., 2015) is a classification method. It is a pixel-level method analyzing every pixel and sorting all of it which shares the same characteristics to one cluster. FCNVGG16 (Simonyan et al., 2014) and FCNResNet50 (He et al., 2016) were used to classify the objects in the fisheye image at the pixel-level. Examples of two pairs of an image and a label are shown in Fig. 1.

Fig. 1 Images and corresponding labels of semantic segmentation

2.3. Dataset

There are two datasets, one contained only fisheye-lens images, and another had both fisheye-lens images and cropped images, details are shown in Table 2. The original size of images captured by the fisheye-lens camera was (1280 × 960) pixels, and images used in networks was (640 × 480) pixels. The resize would cause a loss of information, while these cropped images, which were cropped from original images at the size of (640 × 480) pixels, still had the original features. Also, the ratio of trained pixels had risen due to the decrease in background areas. It was a good signal for training neural networks. Also, comparing to transformation method, this one took almost no time except for program processing time.

Table 2 Dataset consistent

	Feature	Training	Validation	Test
Dataset1	Fisheye-lens images	784	195	195
Dataset2	Fisheye-lens and cropped images	1,491	372	195

Figure 2 shows a fisheye-lens image and the cropped image. Cropped images use the center part with less distortion and uncompressed details. These two differences might be the key to improving the performance of models.

(a) Original image (1280 × 960)

(b) Cropped image (640 × 480)

Fig. 2 An original image and its cropped image

The other important factor was the ratio of background, it was treated as “not being trained”, the other words, smaller background areas are better for training. In the original image from Fig. 3, the yellow areas show background, red is where will be trained. The red area is very small compared to the yellow area. However, in the cropped image, the ratio raised a lot. In training, there would be more effective information memorized by the model, posing greater performance as expected.

Fig. 3 Areas of background (yellow) and informative area (red), in the black frame is the region for cropped images

Table 3 shows the ratio of each class. By mixing cropped images into Dataset1, the ratio of effective information was raised, and the ratio of background was decreased. And 10 % more effective information was increased.

Table 3 Training ratio of each class (%)

	Original images (Dataset1)	Cropped images	Dataset2
Background	76	55	66
Harvested rice areas	6	11	8
Unharvested rice areas	9	17	13
Lodging rice areas	5	11	8
Ridge areas	3	6	5
Humans	0	0	0
Other combines	0	0	0

3. Results and discussion

Tables 4 and 5 show the test results from 195 fisheye-lens images of FCNVGG16 and FCNResNet50.

In Table 4, in total six classes (except “background”) were tested and calculated, each model was trained with two datasets. For both models, Dataset2 has better results than Dataset1 detecting almost all classes.

Table 4 Test results, IoU of each class

Model	FCNVGG16		FCNResNet50
Training set	Dataset1	Dataset2	Dataset1	Dataset2
Harvested rice areas	0.730	0.794	0.790	0.802
Unharvested rice areas	0.642	0.745	0.797	0.811
Lodging rice areas	0.665	0.830	0.832	0.879
Ridge areas	0.650	0.744	0.747	0.775
Humans	0.730	0.794	0.790	0.802
Other combines	0.987	0.987	0.987	0.987

Table 5 showed the overall performance evaluated by mean IoU, mean class accuracy, and pixel accuracy. IoU means intersection over union, it is an integration of class, size, and position of classes. Pixel accuracy and class accuracy mean how many pixels and classes are correctly detected.

Table 5 Model performance

Model	FCNVGG16		FCNResNet50
Training set	Dataset1	Dataset2	Dataset1	Dataset2
Mean IoU	0.768	0.840	0.847	0.863
Mean class accuracy	0.769	0.832	0.838	0.849
Pixel accuracy	0.950	0.974	0.978	0.977

From these results above, models trained with Dataset2, which has cropped images, have better performance than that of Dataset1, FCNVGG16 was 10.6 %, 8.3 %, and 2.6 % greater in mean IoU, mean class accuracy and pixel accuracy; FCNResNet50 was 1.9 %, 1.3 % and 0.0 % greater in them. Cropped images do help network models achieve higher accuracy. As a result, using both fisheye-lens images and cropped images, models have better performance.

Two videos from different scenes were tested to varify the stability of these models, one video was 25 s long, another was 12 s. In the video test, the unit was one second, if a visable false detection, like shown in the first row in Dataset1, Fig. 4, was noticed, that second would be counted as “negative”. The results are shown in Table 6. For FCNVGG16 models, times of “negtive” dropped from 22 of 37 to 9, 41 % of the Dataset1 model result. For FCNResNet50, dropped from 18 to 7, 39 % of using Dataset1.

Fig. 4 Comparison between models trained by different datasets

Table 6 Negative times in video test for detection stability evaluation

Model	FCNVGG16		FCNResNet50
Training set	Dataset1	Dataset2	Dataset1	Dataset2
Video 1 (25 s)	15	6	11	4
Video 2 (12 s)	7	3	7	3
Total	22	9	18	7

There are four sets of images that show the difference between two neural networks and two training sets. Images from the second row are cropped from the first row. In the first row within Dataset1, both two models have false detection, however, in this second row, the performance was better. The third and fourth rows showed these models have good performance in three different classes. In Dataset2 results, all three sets have good performance on detection, in the third set, FCNVGG16 was less good compared to FCNResNet50 in the left part of the image.

Some results could be derived from the comparison within architecture. The model trained by Dataset2 revealed better results than that by Dataset1 owing to less false detection. Similar to the result from the first and third column, the FCNResNet50 model trained by Dataset2 was better except for data from the third row, meanwhile, some false detection at the bottom of the image disappeared after applying Dataset2.

By applying cropped images, in most cases, model performance was improved, the detection area was more stable, and the detection was more accurate. Before using the cropped dataset, the detection in corner areas was less good, but it was also improved after using it. FCNVGG16 got visible improvement both in test values and image results, for FCNResNet50, despite the variation in test values being diminutive, image results were prominently ameliorated. Generally, this method could improve the model performance with minimum workload.

Another important factor not fully addressed here was lighting conditions. The dataset includes images captured under various lighting scenarios—front light, back light, and side light—but the quantity was insufficient for conducting a more detailed analysis. Moreover, the intense sunlight between 10 a.m. and 2 p.m. often hindered the camer’s ability to capture clear images, further limiting the dataset’s quality and diversity.

4. Conclusions

In this paper, an improved method to construct the training dataset was proposed and tested, the workload was minimal, and extra annotation work was not necessary, according to the results of two models, FCNVGG16 and FCNResNet50, this dataset helped models achieve higher accuracy in detecting almost all classes. For the FCNVGG16 model, mIoU was 10.6 % greater. Despite a 1.9 % increment in FCNResNet50 model, the performance in the video test was remarkable, noticable false detection dropped about 40 % for two models. For the in-vehicle fisheye-lens camera, training with this dataset had a positive effect, the overall performance was improved. In the future, this improved method should be applied primarily. Due to the limited size of the dataset, it was still unclear how much improvement this method can ultimately bring, or whether it might even have a negative effect on the performance. Futhermore, the optimal ratio of cropped images within the dataset requires further investigation.

Acknowledgments

This work was supported by JST SPRING, Grant Number JPMJSP2110.

Declaration of conflicting interests

The authors declare no conflicts of interest.

Notes

(URLs on references were accessed on 2 October 2025.)

References

Corresponding author

Register with J-STAGE for free!