Proceedings of the Japan Academy, Series B
Online ISSN : 1349-2896
Print ISSN : 0386-2208
ISSN-L : 0386-2208

This article has now been updated. Please use the final version.

f4-statistics-based ancestry profiling and convolutional neural network phenotyping shed new light on the structure of genetic and spike shape diversity in Aegilops tauschii Coss.
Yoshihiro KOYAMA Mizuki NASUYoshihiro MATSUOKA
Author information
JOURNAL OPEN ACCESS FULL-TEXT HTML Advance online publication
Supplementary material

Article ID: pjab.101.023

Details
Abstract

Aegilops tauschii Coss., a progenitor of bread wheat, is an important wild genetic resource for breeding. The species comprises three genetically defined lineages (TauL1, TauL2, and TauL3), each displaying distinctive phenotypes in various agronomic traits, including spike shape. In the present work, we studied the relationship between population structure and spike shape variation patterns using a collection of 249 accessions. f4-statistics-based ancestry profiling confirmed the previously identified lineages and revealed a genetic component derived from TauL3 in the genomes of some southern Caspian and Transcaucasus TauL1 and TauL2 accessions. Spike shape variation patterns were analyzed using a convolutional neural network-based approach, trained on green and dry spike image datasets. This analysis showed that spike shape diversity is structured according to lineages and demonstrated that the lineages can be distinguished based on spike shape. The implications of these findings for the origins of common wheat and the intraspecific taxonomy of Ae. tauschii are discussed.

Tracing origins through spike shape using machine learning

Wild wheat, Aegilops tauschii, harbors a vast reservoir of alleles that remain untapped in breeding programs. These alleles could contribute valuable traits, such as drought tolerance and disease resistance, when introduced into bread wheat. To fully harness this genetic diversity, it is crucial to rapidly identify strains carrying potentially beneficially alleles. In Ae. tauschii, which is composed of strain groups (lineages) with unique genetic makeups, this can be done by determining a strain’s lineage based on spike shape. In this work, we trained machine learning models to predict lineage based on spike morphology and found that spike shape diversity mirrors lineage structure. These models demonstrated potential for practical use in assigning strains to their respective lineages based on spike shape. This work opens new avenues for the application of machine learning in wheat improvement, as well as in the genetic and evolutionary studies of wheat morphology.

1. Introduction

Aegilops tauschii Coss. (formerly known as Ae. squarrosa L.) is a wild annual diploid wheat species native to central Eurasia. Its range extends from Syria and eastern Turkey to Central Asia and China, passing through the Caucasus, northern Iran, Turkmenistan, Afghanistan, and Pakistan. The species’ geographic distribution pattern is distinctive because all other diploid species of the genus Aegilops spread westward from the Transcaucasia-northern Iran region. The habitats of this self-pollinating morphologically variable species are ecologically diverse: roadsides, wastelands, arable lands, grasslands, craggy slopes, temperate forests, and sandy seashore.1)3) Taxonomically, Ae. tauschii has two subspecies, Ae. tauschii Coss. subspecies tauschii and Ae. tauschii Coss. subspecies strangulata (Eig) Tzvel.4) Subspecies tauschii is described as having cylindrical spikes and elongated cylindrical spikelets, whereas subspecies strangulata is described as having pearly spikes and thick and bulbous spikelets. Some recent monographs do not formally describe these subspecific taxa, based on the observation that intermediate forms are often encountered.1),3)

Ae. tauschii (DD genome) is agronomically important because the species is thought to have given rise to common wheat (Triticum aestivum L., AABBDD genome) as the male parent through natural hybridization with T. turgidum L. (AABB genome) and its subsequent genome doubling of the F1 hybrid.5)8) Studies to understand the natural variation patterns of Ae. tauschii provide a valuable basis for wheat improvement because this species serves as a primary genetic resource in breeding programs.9) So far, much work has been done to clarify the genetic structure of Ae. tauschii.10)17)

Using nuclear and chloroplast DNA polymorphisms, recent population structure studies have shown that Ae. tauschii exhibits a genetically and geographically distinctive lineage structure: two large lineages (TauL1 and TauL2) and a small lineage (TauL3).18),19) TauL1 is widespread within the species range; its sublineage TauL1a occurs mainly in the western part, and TauL1b occurs mainly in the eastern part. TauL2, which is restricted to the western part of the species range with a few exceptions in Central Asia,20),21) has two sublineages: TauL2a, which tends to occur in the western part, and TauL2b, which tends to occur in the eastern part.22) TauL3 is currently known from Georgia only.

The genetic distance to the common wheat D genome is shorter in TauL2 and TauL3 than in TauL1. Along with other lines of evidence, this suggests that TauL2 may share a relatively recent ancestor with the common wheat D genome.18),23) Furthermore, recent studies have demonstrated the composite nature of the common wheat D genome.24),25) It primarily consists of a mosaic of TauL2 sublineage genomes interspersed with components from the TauL1 and TauL3 genomes. In addition, common wheat accessions with negligible to low proportions of the TauL3 component are spread across Eurasia and northern Africa, while those with substantial proportions are mainly confined to the Transcaucasus and nearby regions,24) suggesting a complex history involving the Ae. tauschii lineages in the formation of the bread wheat D genome.

No morphological trait that distinguishes those lineages is currently known. In addition, how those lineages relate to the subspecies is not clear. For example, Kihara and Tanaka26) classified only plants with markedly moniliform (i.e., pearly) spikes as subspecies strangulata (i.e., subspecies strangulata sensu stricto) and all the rest as subspecies tauschii (Fig. 1). This approach had an advantage in unambiguously assigning the intermediate forms to subspecies tauschii. However, recent studies showed that the subspecies defined using these criteria did not coincide with the major lineages TauL1 and TauL2: subspecies tauschii consists of TauL1 and a part of TauL2, whereas subspecies strangulata was nested within TauL2.27) In this case, subspecies strangulata may be viewed as a form derived from subspecies tauschii within TauL2.

Fig. 1.

(Color online) Green spikes of Ae. tauschii accessions. A cylindrical spike of IG 48747 (TauL1a) (far left), mildly moniliform spikes of KU-20-8 (TauL2a) (center left) and KU-20-10 (TauL2b) (center right), and a markedly moniliform spike of KU-2091 (TauL2b) (far right). Of these accessions, subspecies strangulata sensu stricto includes only KU-2091, while subspecies strangulata sensu lato includes KU-20-8 and KU-20-10, as well as KU-2091. Scale bar: 1 cm.

Another approach to dealing with the intraspecific classification was to include plants with markedly and mildly moniliform spikes in subspecies strangulata (subspecies strangulata sensu lato) and all the rest as subspecies tauschii. These criteria were found useful, because they made subspecies tauschii and subspecies strangulata largely correspond to TauL1 and TauL2, respectively.17) However, it is unclear to what extent the criteria are usable to classify TauL3. The use of the subspecies index “SI” (glume width/rachis segment width ratio) appears to be in line with these criteria: subspecies tauschii, defined by SI, corresponded to TauL1, whereas subspecies strangulata corresponded to TauL2 and TauL3.28),29) With these criteria, subspecies tauschii may be considered as the derived form of subspecies strangulata.30)

Each Ae. tauschii lineage displays distinctive phenotypes in various agronomic traits, including flowering time, salt tolerance, and disease resistance.31)33) Thus, methods for correctly assigning Ae. tauschii plants to specific lineages based on morphological grounds would be valuable in breeding programs that rely on Ae. tauschii as a source of novel alleles. Furthermore, if morphological traits that distinguish the lineages are identified, they may provide avenues for refining the within-species systematics in accordance with the genealogical evidence. To address these issues, it is important to clarify the extent to which spike shape, one of the key traits in the intraspecific classification of species, distinguishes the Ae. tauschii lineages.

In the present work, we studied the relationship between population structure and spike shape variation patterns using a collection of 249 Ae. tauschii accessions. Population structure was examined based on single nucleotide polymorphism (SNP) genotypes that were obtained using the genotyping by random amplicon sequencing-direct (GRAS-Di) method.34) Spike shape variation patterns were analyzed using a phenotyping method based on a convolutional neural network (CNN), which is a type of neural network capable of recognizing patterns in images. This method had an advantage in classifying the spike shape continuum between cylindrical and pearly because it can automatically extract spike shape features from digital images and develop classification models based on supervised training that uses the “lineages” (i.e., TauL1, TauL2, and TauL3) as the labels. Furthermore, explainable artificial intelligence (XAI) algorithms such as Gradient-weighted Class Activation Mapping (Grad-CAM)35) provided visual explanations for the classification models. Therefore, the use of this method had the potential to provide novel insights through computer vision that can complement human vision into how the spike shape differences are associated with the lineage structure.

Our goals in the present study were (1) to clarify the population structure of Ae. tauschii based on the genome-wide SNP genotypes of the accessions and discuss its implications for the origins of common wheat, and (2) to elucidate the structure of spike shape diversity using CNN phenotyping and to discuss its implications for applications in breeding and the intraspecific classification of Ae. tauschii.

2. Materials and methods

2.1. Plant materials.

A total of 249 Ae. tauschii accessions, which had been maintained in our laboratory by selfing since their introduction, were used (Supplementary Table S1). All accessions, except for AT 47, AT 55, AT 60, AT 76, AT 80, and AL8/78, were available from the respective genebanks. We used 211 accessions representing the entire natural species range for population structure analysis and CNN modeling. The remaining 38 accessions, sampled in North Caucasus (Dagestan and Russia), were used for testing the validity of CNN modeling in predicting the lineages based on spike shapes. When the geographical coordinates of the sampling sites were not available in the passport data, we estimated the latitude and longitude using Google Maps (https://maps.google.com/) based on locality information.

2.2. Spike samples.

For each accession, well-developed, non-disarticulated spikes were sampled about 14 days after flowering from a single healthy plant grown individually in a pot in a greenhouse during the winter-to-spring season. For 199 accessions, dried spike samples were prepared by air-drying well-developed, non-disarticulated spikes for about six months in envelopes.

2.3. SNP genotyping.

Total DNA was extracted from the young leaves of a single plant using the CTAB (cetyltrimethylammonium bromide) method.36) For each accession, about 10 million raw reads (150 base-pair long, paired-end) were obtained at Eurofins Genomics, Inc (Tokyo, Japan) using the GRAS-Di method.34) The sequencing libraries were prepared by pooling the final product of two sequential polymerase chain reaction steps performed for each accession, using total DNA as the template and a mix of 12 random primers. The raw reads were generated from the libraries using a NovaSeq 6000 instrument (flow cell type S4) and were subsequently adaptor-trimmed and quality-filtered (by setting the ILLUMINACLIP option to SLIDINGWINDOW:4:30 MINLEN:50) using Trimmomatic.37)

Quality reads were aligned to the Ae. tauschii AL8/78 reference genome sequence (Aet v5.0)38) using BWA version 0.7.18 (r1243) with the mem option.39) Reads failing to align to the chromosomal sequences were excluded from further analysis. Aligned reads with library insert sizes shorter than 70 base pairs in length or larger than 600 base pairs in length were filtered out using SAMtools version 1.20,40) applying the command 'samtools view -e '((pnext + 150) - pos) >= 70 && ((pnext + 150) - pos) <= 600''. Duplicated reads were identified and removed using SAMtools. Subsequently, SNPs were called across the accessions relative to the reference genome sequence using BCFtools mpileup version 1.2041) with the setting '-q 30' (minimum mapping quality) and BCFtools call with the setting '-G' (Hardy-Wineberg equilibrium not assumed). Filtering was performed using BCFtools with the setting '-i DP>=5 & MQ >= 40'. Sites with a minor allele frequency of less than 5% and missing data frequency exceeding 20% were further filtered out using VCFtools version 0.1.16.42) The resultant SNPs were pruned using PLINK version 1.90b7, setting the '--indep-pairwise' option to '20000 2000 0.5' for subsequent analyses.43)

2.4. Principal component analysis.

Principal component analysis (PCA) was performed based on a covariance matrix generated from the SNP genotypes of the accessions, with the genotype at each site coded as ‘-1’ for homozygous reference, ‘0’ for heterozygous, and ‘1’ for homozygous alternate, using the probabilistic PCA method available in the pcaMethods (version 1.94.0) package44) for R version 4.4.1.45) The principal component plot was generated using R’s plot function. For spike shape feature vector analysis, a covariance-matrix-based PCA was conducted using the scikit-learn library version 1.5.1 for Python version 3.10.14.46)

2.5. Struct-f4.

We used the Struct-f4 package for the population structure analysis based on the SNP genotypes of the accessions.47),48) Struct-f4 relies on f4-statistics, which estimate the amount of shared genetic drift across pairs of individuals.49) An f4-statistic, f4(A, B; C, D), is formulated as «(a-b)(c-d)», where «•» denotes the average over all genotyped sites, and a, b, c, and d represent the allele frequency for a given site in the four populations (or individuals, when population size n = 1) A, B, C, and D, respectively. Unlike ADMIXTURE,50) Struct-f4 characterizes individual genetic profiles as mixtures of K ancestral genetic components, based on patterns of allele sharing between quartets of individuals, without assuming Hardy-Weinberg equilibrium. This is particularly suitable for analysis in Ae. tauschii, as it is a selfing species. In the present study, K = 5 was chosen, because Ae. tauschii has five distinctive lineages/sublineages.18),19),22) To estimate parameters, the algorithm was run with 3,000,000 Markov chain Monte Carlo (MCMC) iterations in the first chain and a burn-in length of 7,500 followed by 20,000 MCMC iterations in the second chain.

2.6. Geographic distribution map.

Maps showing the geographic distribution of the accessions were created using spatial datasets obtained either from Natural Earth—free vector and raster map data available at naturalearthdata.com—via the rnaturalearth package51) with the ggplot2 package for R ver. 4.4.1,52) or from a bathymetry and topography database hosed by the National Oceanic and Atmospheric Administration using the marmap package.53)

2.7. Green glume shape comparison.

In general, glume shape tends to be elongated in Ae. tauschii subspecies tauschii (largely corresponding to TauL1), whereas it tends to be more quadrate in Ae. tauschii subspecies strangulata sensu lato (largely corresponding to TauL2 and TauL3).54) Green glume shape (length and width) data for 203 accessions were obtained from the previous work (Supplementary Table S1).54) Green glume length/width ratios were compared among lineages using lineage-wise box-and-dot plots, generated with the ggplot2 package for R ver. 4.4.1.52)

2.8. Spike image data preparation.

Spike samples were imaged using a flatbed scanner (EPSON GT-X830). Empty anthers remaining on the fresh spike samples were removed by hand using tweezers prior to imaging. For each accession, fresh or dried spike images (ca. 10 spikes per image) were saved at 96 dpi in a jpg format file with image size equaling to 3400 × 935 pixels.

2.9. Supervised learning of residual neural network (ResNet).

A CNN model, ResNet55) was used. ResNet was pretrained on ImageNet-1k56) implemented in PyTorch Image Models library.57) In the green spike image dataset (1,538 images in total), the numbers of images per accession varied from one to nine (mean 7.29), whereas in the dry spike image dataset (2,384 images in total), they ranged from four to 32 (mean 12.04). All images were resized to 400 by 800 pixels (height by width), and their pixel values were normalized using the following parameters: a mean RGB of (0.485, 0.456, 0.406), a standard deviation RGB of (0.229, 0.224, 0.225), and a maximum possible pixel value of 255.0. Supervised training where lineages served as the labels, was conducted using stratified group 5-fold cross-validation. For each green and dry spike image dataset, we split the whole dataset into five equal parts, trained each model on four parts, evaluated it on the fifth, and repeated the process with each part being used as the validation set once. During training, the images were augmented using Albumentations (v1.1.14) library58) with the following transformations: (1) horizontal, vertical, and both horizontal and vertical flips each with a probability of 0.5, (2) random adjustments to brightness and contrast, using default parameters except for brightness_limit, contrast_limit, and p, which were set to 0.2, 0.2, and 0.5, respectively, and (3) random cropping, using default parameters except for height, width, scale, ratio, and p, set to 400, 800, [0.7, 1.0], [2.0, 2.0], and 0.7, respectively. This training pipeline was executed using NVIDIA GeForce RTX 2080 Ti as a hardware accelerator. The macro F1 score, i.e., a harmonic mean of precision and recall over all lineages, was used to evaluate the effectiveness of the trained models in the image classification. The training pipeline was implemented using the PyTorch library version 1.12.1+cu113.59) We used Adam algorithm60) to optimize model parameters based on the value of cross entropy error with learning rate 0.0001. Eight spike images were batched to input into models, and we repeated the training for 60 epochs.

2.10. Feature visualization using attention maps.

To visualize the image features that influenced the CNN modeling, we used the Grad-CAM, Guided Backpropagation, and Guided Grad-CAM methods to generate attention maps on original spike images based on the final convolutional layer in each ResNet model.34)

2.11. Spike shape feature vector PCA.

The structure of spike shape diversity was examined by reducing the dimensionality of feature vectors generated in the five ResNet models via PCA for each dry and fresh spike dataset. In each model, feature vectors were obtained for each spike image by conducting a series of convolutions and pooling operations on the input images, with the final full connection layer, which functions as a classifier, being removed. Subsequently, PCA was performed for each model on the averaged feature vectors per accession for dimensionality reduction.

2.12. Blind test for lineage prediction.

We used 37 of the 38 North Caucasus accessions to test the trained ResNet models for their ability to infer the lineage from spike shape. The image dataset contained 201 green spike images in total and the numbers of images per accession varied from one to eight (mean 5.43). All images were resized to 400 by 800 pixels (height by width), with pixel values normalized according to these parameters: mean RGB values of (0.485, 0.456, 0.406), standard deviation RGB values of (0.229, 0.224, 0.225), and a maximum possible pixel value of 255.0. The lineage membership probabilities for each accession were calculated by feeding the image dataset to each of the five ResNet models trained on the green or dry spike images of the 211 accessions. In each spike image dataset, the resultant probabilities were averaged for each lineage, with the lineage exhibiting the highest probability inferred as the one to which the accession belonged. Concurrently, the 38 North Caucasus accessions underwent SNP genotyping using the GRAS-Di method. SNPs were called as described for the 211 accessions used for the ResNet modeling. The merged raw SNPs (called for 249 accessions in total) were then filtered and pruned as described. Subsequently, each North Caucasus accession was assigned to a lineage based on the result of a PCA performed on the SNP genotypes. The accuracy of the spike-image-based inference for each lineage was assessed by comparing it to the genotypic assignment for consistency.

3. Results

3.1. SNP genotyping.

In total, qualified SNPs were obtained at 16,746 sites across the 211 accessions relative to the AL8/78 reference genome sequence. The average observed heterozygosity varied from 0.06 to 0.11 between accessions (mean 0.09). The AL8/78 accession in our panel shared the same nucleotide with the reference genome sequence at 12,399 out of 13,936 homologous sites (89.0%). The SNP sites were distributed widely over the genome (Fig. 2A). By chromosome, the numbers of sites were 1,988 for chromosome 1D, 2,951 for 2D, 2,621 for 3D, 1,898 for 4D, 2,666 for 5D, 2,062 for 6D, and 2,560 for 7D.

Fig. 2.

(Color online) GRAS-Di-derived SNP genotyping of the 211 accessions. (A) Distribution of GRAS-Di-derived SNPs mapped to the Ae. tauschii AL8/78 reference genome sequence. Bar plots of the number of markers are displayed every 10 Mbp. (B) Graph of the first three axes principal components from a PCA based on the GRAS-Di-derived SNP genotypes of the 211 accessions. The first component (PC1) accounts for 13.74%, the second (PC2) for 5.13%, and the third (PC3) for 2.58% of the total variance. Lineage and sublineages are represented by color and shape for each accession according to the key. (C) Geographic distribution of the 211 accessions. The region of the TauL3 accessions is denoted by a purple arrowhead. Lineage and sublineage are represented by color and shape for each accession according to the key.

3.2. Population structure.

A PCA of the SNP genotypes revealed that the 211 accessions were categorized into the known lineages and sublineages: TauL1 (consisting of TauL1a and TauL1b), TauL2 (consisting of TauL2a and TauL2b), and TauL3 (Fig. 2B). Two exceptions were noted: KU-2109 (TauL1, previously classified as TauL2) and IG 127015 (TauL2, previously classified as TauL1).18) TauL1 (135 accessions) was spread across the species range, whereas TauL2 (71 accessions) was exclusively located in the western part of the range. TauL3 (five accessions) was confined to Georgia (Fig. 2C). Moreover, within the TauL1 lineage, TauL1a (52 accessions) predominantly occurred in the western region, while TauL1b (83 accessions) was primarily observed in the eastern part. Similarly, TauL2a (35 accessions) tended to occur in the western region of the lineage range, whereas TauL2b (36 accessions) occurred more in the eastern part (Fig. 2C). All these observations aligned with prior research findings.18),22)

To further examine the population structure and genetic profiles of individual accessions, a Struct-f4 analysis (K = 5) was conducted. This analysis showed that, within each lineage/sublineage, the accessions shared a specific ancestral genetic component that served as a major factor in their genetic profiles (Fig. 3A; Supplementary Table S1). The proportions of these components were greater than 0.75 in each accession, with a few exceptions (KU-2109 in TauL1a, KU-2068, KU-2122, and KU-2153 in TauL1b, KU-2158 in TauL2b), supporting the PCA observations that each lineage/sublineage is a distinctive genetic aggregate. Within the TauL1 and TauL2 lineages, these components were shared as minor factors by many accessions between TauL1a and TauL1b and between TauL2a and TauL2b. In contrast, the proportions of these components in the genetic profiles were negligible in most accessions belonging to different lineages (i.e., TauL1 vs. TauL2). These observations provided evidence that, despite the clear separation between the sublineages observed in the PCA plots, TauL1a and TauL1b, and TauL2a and TauL2b, form distinct lineages, TauL1 and TauL2, respectively (Fig. 3A).

Fig. 3.

(Color online) f4-statistics-based ancestry profiling and convolutional neural network spike shape phenotyping in the 211 accessions. (A) Proportions of the ancestral genetic components of the 211 accessions based on f4-statistics-based ancestry profiling at K = 5. The ancestral genetic components for TauL1a, TauL1b, TauL2a, TauL2b, and TauL3 are represented in red, yellow, blue, purple, and green, respectively. (B) Geographic distribution of Ae. tauschii accessions with TauL3 ancestral component proportions greater than 0.05, represented by pie charts. The ancestral genetic components are color-coded as in (A).

This analysis further indicated that certain TauL1 and TauL2 accessions exhibited small but substantial proportions of the genetic component originating from TauL3. The proportions of the TauL3 component ranged from 0.00 to 0.27 (mean 0.02) in TauL1a, from 0.00 to 0.05 (mean 0.01) in TauL1b, from 0.00 to 0.15 (mean 0.05) in TauL2a, from 0.00 to 0.33 (mean 0.04) in TauL2b, and from 0.91 to 0.97 (mean 0.95) in TauL3. The TauL1 and TauL2 accessions with proportions greater than 0.05 (33 accessions in total) were confined to TauL1a (eight accessions), TauL2a (15 accessions), and TauL2b (10 accessions), and geographically limited to the coastal Caspian as the center, with the Transcaucasus and adjacent regions (Fig. 3B).

3.3. Conventional neural network analysis of spike shape.

The green glume length/width ratios of the 203 accessions showed substantial overlap among the lineages, confirming that the Ae. tauschii lineages cannot be distinguished by this visible trait (Fig. 4A). Accordingly, we analyzed the natural variation patterns in Ae. tauschii spike shape based on CNN, aiming to clarify how its diversity is structured in relation to the population structure of the species. Firstly, we trained ResNet models55) using the lineages as the labels to predict the lineage (TauL1, TauL2, and TauL3) to which an accession belongs based on its spike shape. The 5-fold cross-validation generated five trained ResNet models for each green (1,538 images in total) and dry (2,384 images in total) spike image dataset. The trained ResNet models demonstrated very high macro F1 scores (0.95–1.00), indicating their highly effective assignment of the spike images to the correct lineages (Table 1). Next, PCA was performed on averaged feature vectors per accession for each trained ResNet model. For the green and dry spike image datasets, TauL1 and TauL2 formed separate clusters on the PC1 axis in each PC plot (Fig. 4B). This indicated that the spike shapes of the TauL1 and TauL2 accessions were similar within lineages compared to between lineages. When included in the fold, TauL3 tended to be separate from TauL1 and TauL2 on the PC2 axis. In each plot, accessions with the TauL3 component proportions greater than 0.05 showed no clear distribution pattern within their respective lineages (Fig. 4B).

Fig. 4.

(Color online) (A) Lineage-wise box-and-dot plots for green glume length/width ratios for the 203 Ae. tauschii accessions. (B) Graphs of the first two axes (PC1 and PC2) from PCAs performed on averaged feature vectors per accession for trained ResNet models, generated using the green and dry spike image validation datasets in 5-fold cross-validation. The variance explained by each axis is shown in each graph. The lineages are color-coded: red for TauL1, blue for TauL2, and green for TauL3. Crosses denote accessions with TauL3 ancestral component proportions greater than 0.05.

Table 1. Macro F1 scores of the five trained ResNet models, generated by stratified group 5-fold cross-validation, for the green and dry spike image datasets

Dataset Model1
(fold1)
Model2
(fold2)
Model3
(fold3)
Model4
(fold4)
Model4
(fold5)
Green spike image 1 0.98 0.99 0.99 0.96
Dry spike image 0.98 0.95 1 0.99 0.99

The trained ResNet models were further tested for their ability to predict the lineage to which an accession belongs based on its spike shape. Of the 38 North Caucasus accessions with unknown lineages, green spike images, which were obtained for 37 accessions (201 images in total), were used in the test. The ResNet models trained on green spike images classified the accessions into TauL1 (19 accessions) and TauL2 (18 accessions) based on their spike shapes (Supplementary Table S2). Notably, the ResNet models trained on dry spike images produced the same classification, assigning the same 19 accessions to TauL1 and the remaining 18 to TauL2 (Supplementary Table S2). This CNN classification was fully consistent with the categorization based on the genotypes: PCA of the genotypes based on 15,447 GRAS-Di-derived SNPs showed that each North Caucasus accession belonged to TauL1a or TauL2a (Fig. 5). These accessions had cylindrical to mildly moniliform spikes (Supplementary Fig. S1). In accordance with the CNN classification, they were largely assigned to TauL1 or to either TauL2 or TauL3 based on visual inspection.

Fig. 5.

(Color online) Population structure and geographic distribution and of the Caucasus accessions. (A) Graph of the first three principal components from a PCA based on the GRAS-Di-derived SNP genotypes of 249 accessions, including the 38 North Caucasus accessions used for the blind test. The first component (PC1) accounts for 13.55%, the second (PC2) for 5.02%, and the third (PC3) for 2.75% of the total variance. The lineage and sublineage are represented by color and shape for each accession according to the key. (B) Geographic distribution of Ae. tauschii accessions in the Caucasus region. The lineage and sublineage are color-coded and shape-coded for each accession according to the key.

In both green and dry spike image datasets, the image features influencing the ResNet modeling varied across different spike images within individual accessions, regardless of their respective lineages. For example, in the case of the dry spike images of KU-2091 (TauL2, 14 images in fold1), the features were observed at the tips in some attention map images, whereas they were distributed across the entire spikes in others (Fig. 6). Similarly, in the case of the green spike images of IG 48748 (TauL1, eight images in fold2), the features localized in the central parts in some attention map images, whereas they showed a broad distribution across the central spikes in others (Fig. 6).

Fig. 6.

(Color online) Examples of spike shape attention map images generated using the Grad-CAM method. The color gradient ranges from high attention (red, orange, and yellow) to low attention (green and blue). The Ae. tauschii accession from which the spike is derived is shown in each figure. All spike shape attention map images generated using the Grad-CAM, Guided Backpropagation and Guided Grad-CAM methods are available in the Zenodo repository.

4. Discussion

4.1. Population structure of Ae. tauschii and its implications for the origins of common wheat.

Based on the PCA and Struct-f4 analyses of GRAS-Di SNP genotypes of the accessions, the present study showed that the Ae. tauschii population comprises three genetically and geographically distinct lineages/sublineages: TauL1 comprising TauL1a and TauL1b, TauL2 comprising TauL2a and TauL2b, and TauL3. This finding aligns with previous research.18),19) In the lineage assignment, two accessions were placed in lineages different from those to which they had previously been assigned based on Diversity Arrays Technology marker genotypes. KU-2109, formerly classified as TauL2, was reassigned to TauL1, whereas IG 127015, formerly classified as TauL1, was reassigned to TauL2.18) The reassignment of KU-2109 (collected in Iran) to TauL1 was consistent with a recent study that used k-mers for genotyping.24) Further analysis is required to confirm the correct lineage of IG 127015 (collected in Armenia); however, for the purpose of this study, the accession was included in TauL2.

In the Struct-f4 analysis, several TauL1 and TauL2 accessions were found to have non-negligible proportions of genetic components originating from TauL3 (Fig. 3). Gene flow from TauL3 in ex-situ genebank fields, where the materials were propagated, might partially explain the sharing of TauL3 genetic components, because its relatively long anthers are capable of shedding pollen.61) However, it is more likely that the observed pattern of TauL3 component sharing originated from inter-lineage hybridization in natural habitats, rather than cross-pollen contamination in modern ex-situ fields. This conclusion is based on the following considerations. First, the TauL3 component was restricted to the southern Caspian and Transcaucasus. This geographic distribution pattern contrasts with the random distribution expected if gene flow had occurred in the ex-situ genebank fields. Second, 26 out of 33 accessions with TauL3 genetic component proportions greater than 0.05 (Supplementary Table S1), including KU-2109 (TauL1a, Iran), KU-2158 (TauL2b, Iran), and KU-2159 (TauL2b, Iran), which exhibit exceptionally high TauL3 genetic component proportions (>0.2), were provided by the National Bioresource Project/Kyoto University. These materials have been maintained through selfing via spike-bagging to prevent cross-pollen contamination since the introduction of the original samples collected from their respective natural habitats. Therefore, admixture within and between lineages may have shaped the observed distribution pattern of the TauL3 component. TauL1 and TauL2 accessions with TauL3 genetic component proportions greater than 0.05 were found primarily along the coastal Caspian, as well as in the Transcaucasus and adjacent regions. This may suggest that these areas served as centers of admixture involving this component (Fig. 3B).

Regarding the TauL3 component in the D genome, common wheat cultivars can be categorized into two types: those with little to low proportions of the TauL3 component, distributed across Eurasia and northern Africa, and those with substantial proportions, largely confined to the Transcaucasus and adjacent regions, including Georgia, Armenia, and the border areas of Turkey, Iraq, and Iran.24) Notably, more than 70% of the common wheat D genome exhibits “identity-by-state” with the genomes of southern Caspian TauL2 sublineages.24) However, because the common wheat D genome consistently forms a distinct cluster from Ae. tauschii in population structure studies,12),14),18) the direct descendants of the Ae. tauschii populations that gave rise to the common wheat D genome have likely not yet been identified.

The admixed nature of Ae. tauschii genomes, along with these previous findings, may have important implications for the evolution of common wheat. It can be inferred that the first type of cultivars likely originated from hybridizations involving Ae. tauschii populations with low proportions of the TauL3 component. However, those ancestral Ae. tauschii populations may not have been the direct progenitors of the current southern Caspian TauL2, although they were genetically closely related to it, as suggested by the findings mentioned above. The second type of cultivars likely originated from hybridizations involving TauL2-related Ae. tauschii populations that contained substantial TauL3 component proportions. A caveat to this non-exclusive scenario is the absence of populations genetically closely related to the current southern Caspian TauL2 (TauL2b) in the centers of the second type of cultivars, the Transcaucasus and adjacent regions (Fig. 3B). Because little is known about the historical genetic changes experienced by Ae. tauschii populations, addressing this knowledge gap may help further clarify the origins of common wheat.

4.2. Structure of spike shape diversity and its applications in breeding.

The common wheat D genome inherited only a small portion of Ae. tauschii alleles due to the limited number of ancestral populations (i.e., those closely related to the current southern Caspian TauL2 and Georgian TauL3) involved in natural hybridizations with T. turgidum. Consequently, modern Ae. tauschii represents a vast reservoir of alleles that have not yet been utilized in breeding. Such alleles may confer beneficial phenotypes, such as drought tolerance and disease resistance, when introduced into common wheat. In this context, TauL1, the most distant relative of the common wheat D genome, stands out as a rich source of such alleles, while each lineage may provide valuable genetic resources for wheat breeding. To efficiently utilize Ae. tauschii germplasm, it is essential to develop methods for distinguishing the lineages based on accession morphology, without the need for genotyping.

Ae. tauschii exhibits considerable diversity in spike shape, with some accessions having cylindrical outlines, others curved, and a range of intermediate forms. In the present study, we aimed to clarify the structure of spike shape diversity using CNN modeling. We found that, in the 249 accessions used, the lineages can be distinguished based on spike shape. Notably, the CNN-based approach demonstrated impressive potential for practical use in assigning accessions to the correct lineages based on the spike images. This was evident from the very high macro F1 scores (0.95–1.00) obtained for the trained ResNet models (Table 1), the clear, non-overlapping separation of the lineages on the feature vector PC plots (Fig. 4B), and the fully accurate assignment of the 37 North Caucasus accessions to their respective lineages in the blind test (Fig. 5; Supplementary Table S2). In the blind test, the ResNet models trained on dry spike images classified the green spike images into the correct lineages. This indicates that the models rely on the spike shapes rather than the colors when making classifications. Nevertheless, we caution that the under-representation of TauL3 accessions in our samples due to their limited availability may limit the ability of the CNN-based approach to distinguish this lineage from TauL1 and TauL2. The small sample size of TauL3 (five accessions) might have contributed to the observed variation in its discrimination accuracy by the ResNet models (Fig. 4B), as the TauL3 accessions were relatively uniform in spike shape. This issue requires further testing using an increased number of TauL3 accessions. The usefulness of this method for TauL3 must therefore be critically re-evaluated. Another caveat is that the specific morphological discriminators remain unidentified, as suggested by the Grad-CAM analysis. This may imply that the ResNet models generated complex discriminators that are not readily interpretable by humans.

Possible practical uses of the ResNet models trained on the spike images may include screening for salt-tolerant accessions that could be incorporated into breeding programs. In a previous study, TauL1 accessions outperformed TauL2 and TauL3 in seedling growth in NaCl stress conditions, indicating that TauL1 is a useful source of salt tolerance at the seedling stage.32) The ResNet models can directly facilitate the screening process by identifying TauL1 accessions without the need for genotyping. Similarly, the trained ResNet models can be used to identify TauL1 accessions for early flowering, and TauL2 or TauL3 accessions for increased seed size.22)

4.3. Spike shape diversity and intraspecific taxonomy of Ae. tauschii.

A taxonomy should provide a useful tool for field biologists while reflecting the phylogenetic history of the taxa as accurately as possible.62) Considering this perspective, our findings have implications for the intraspecific taxonomy of Ae. tauschii. First, TauL1, TauL2, and TauL3 should each ultimately be treated as a distinct taxon, because each represents a well-defined genealogical unit. This may require reconsidering the dichotomous nature of the current system—characterized by cylindrical spikes and elongated cylindrical spikelets in subspecies tauschii, and pearly spikes and thick, bulbous spikelets in subspecies strangulata—because it does not explicitly account for the existence of three distinct lineages. The sublineages of TauL1 and TauL2 could also be ranked as taxa within their respective lineage-level taxa. Second, to achieve this goal, morphological keys to the lineages and sublineages—essential for field biologists and breeders engaged in classification—must be identified. Spike shape remains the preferred trait for this purpose, as the ResNet models demonstrated its potential for distinguishing the lineages. In the present work, we were unable to identify such key phenotypes from the spike image features influencing the ResNet modeling, as visualized on the attention maps (Fig. 6). Further examination through human observation, combined with CNN-based analyses using a large spike image dataset that includes more representation of the TauL3 lineage, may help address this issue.

Conflict of interest

The authors declare no conflicts of interest.

Acknowledgements

We thank Pablo Librado for his assistance with running Struct-f4. We also extend our gratitude to Kazutoshi Okuno for providing collection site information on the 38 North Caucasus accessions, to the Genebank project at the National Agriculture and Food Research Organization (NARO) for supplying the seeds of these accessions, and to Hirokazu Handa for providing the seeds of AL8/78. Computations were partially performed on the NIG supercomputer at ROIS National Institute of Genetics. This work was supported by JPSP KAKENHI Grant (19H02935, 19KK0157, 20K20720, 23K23573, and 23K26927), Moonshot Agriculture, Forestry and Fisheries Research and Development Program (JPJ009237), Hyogo Science and Technology Association Academic Research Grant (number 4002, fiscal year 2022), and the Joint Research Program of Arid Land Research Center, Tottori University (04A2002 and 06B2007).

Supplementary materials

Supplementary materials are available at https://doi.org/10.2183/pjab.101.023.

Notes

Edited by Akira ISOGAI, M.J.A.

Correspondence should be addressed to: Y. Koyama and Y. Matsuoka, Graduate School of Agricultural Science, Kobe University, 1-1 Rokko, Nada-ku, Kobe, Hyogo 657-8501, Japan (e-mail: me@yox5ro.net (Y.K.); pinehill@port.kobe-u.ac.jp (Y.M.)).

References
Non-standard abbreviation list

CTAB

Cetyltrimethylammonium bromide

CNN

Convolutional neural network

Grad-CAM

Gradient-weighted class activation mapping

GRAS-Di

Genotyping random amplicon sequencing-direct

MCMC

Markov chain Monte Carlo

NARO

National Agriculture and Food Research Organization

PCA

Principal component analysis

ResNet

Residual neural network

SNP

Single nucleotide polymorphism

XAI

Explainable artificial intelligence

 
© 2025 The Author(s).

Published under the terms of the CC BY-NC license
https://creativecommons.org/licenses/by-nc/4.0/
feedback
Top