Analysis of nuclear mitochondrial DNAs and factors affecting patterns of integration in plant species

Sequences homologous to organellar DNA that have been integrated into nuclear genomes are referred to as nuclear mitochondrial DNAs (NUMTs) and nuclear plastid DNAs (NUPTs). NUMTs in nine plant species were analyzed to reveal the integration patterns and possible factors involved. The cumulative lengths of NUMTs in two-thirds of species analyzed were greater than those of NUPTs observed in a previous study. The age distribution of NUMTs was similar to that of NUPTs, suggesting similar mechanisms for integration and degradation of both NUPTs and NUMTs. Nuclear genome size and the cumulative length of NUMTs showed a signiﬁcant positive correlation for older but not younger NUMTs. The same correlation was also found between nuclear genome size and older NUPTs in 17 species. These results suggested that genome size is a key factor to deter-mine the cumulative length of relatively older NUPTs/NUMTs. Although the factor(s) determining the cumulative length of younger NUPTs/NUMTs is unclear, these sequences may be more deleterious, which could explain the different man-ner of determining the cumulative length of younger NUPTs/NUMTs in nuclear genomes. In addition, a relationship between the cumulative length of integrated NUMTs and complexity of mitochondrial genomes (i.e., the number of repeats) was found. The results indicate that the structural complexity of both NUMTs and their original mitochondrial sequences affects integration and degradation processes.


INTRODUCTION
Mitochondria are cytoplasmic organelles that have their own distinct genomes, which originated from the endosymbiosis of ancient bacteria (Mereschkowsky, 1905;Margulis, 1981). In comparison with those in the genomes of current, free-living bacteria, the number of genes in mitochondrial genomes is small, and the genes encode only a small fraction of the proteins that function in mitochondria (Kleine et al., 2009). During the process of endosymbiosis, a large number of genes were transferred from the organellar genomes to the host nuclear genome (Timmis et al., 2004). Recent studies have revealed that transfers of segments of DNA from organellar to nuclear genomes are ongoing. Both mitochondrial and plastid DNA sequences have been integrated into the host nuclear genome (Noutsos et al., 2005;Smith et al., 2011;Michalovova et al., 2013). Such organellar DNA-like sequences in nuclear genomes are called nuclear mitochondrial DNAs (NUMTs) and nuclear plastid DNAs (NUPTs). In previous studies, the patterns of integration in the model plants Arabidopsis thaliana and Oryza sativa have been extensively analyzed (e.g., Richly and Leister, 2004;Matsuo et al., 2005;Noutsos et al., 2005). Recently, other species also have been analyzed to reveal the patterns of integration (Smith et al., 2011;Michalovova et al., 2013;Yoshida et al., 2014). In addition, research has been conducted using tobacco plants to estimate the transfer rate (Lloyd and Timmis, 2011;Stegemann et al., 2013).
The fate of integrated DNA sequences (NUPTs and NUMTs) has been analyzed using both genomic and experimental data, and the degradation of these sequences by both small and large indels (insertions and deletions) was suggested (Richly and Leister, 2004;Matsuo et al., 2005;Sheppard and Timmis, 2009). The general patterns of genomic integration of NUPTs in 17 plant species have been analyzed (Yoshida et al., 2014). The esti-mated age distribution of NUPTs for those species was of two types. For some species, there was a very high proportion of young insertions of plastid genes with a rapid decrease of the number of insertions in older categories. In other species, there was a relatively even distribution of old and young insertions. However, it is not clear whether these differences are due to differences among the species. Because NUMTs originated from mitochondrial genomes, which are independent of plastid genomes, analyses of plant NUMTs provide an opportunity to compare the age distributions of NUMTs with those of NUPTs.
A distinct feature of mitochondrial genomes compared to plastid genomes is diversity in their structure and size. Mitochondrial genomes often contain repeat regions that cause intermolecular recombination and form multipartite structures. These genomes also differ in length between species. In some cases, even close relatives show large differences in genome sizes and coverage by these short repeats (Alverson et al., 2011a). Plastid genomes do not show such high levels of structural diversity among species, and so it is worth investigating whether the variable structure of mitochondrial genomes is related to the NUMT integration patterns.
In this study, we investigated the patterns of genomic integration of NUMTs for nine plant species whose mitochondrial and nuclear genome sequences are available. The results revealed two types of age distributions, as seen in NUPTs (Yoshida et al., 2014). In addition, they suggest that patterns of incorporation of NUMTs are related to the complexity of mitochondrial genomes.
Identification of NUMTs BLAST (http://blast.ncbi.nlm. nih.gov/) searches against nuclear genomic sequences were conducted using each mitochondrial genome as a query sequence. For C. papaya, V. vinifera and Z. mays, bulk genomic sequence data were obtained from Phytozome ver. 9.1 (Goodstein et al., 2012). For L. japonicus, genomic data were obtained from the website of the Kazusa DNA Research Institute (www.kazusa.or.jp/lotus/ index.html). The NCBI BLAST server was used for A. thaliana, G. max, O. sativa and S. bicolor, whereas the local BLAST program was used for species with bulk data. For BLAST hits, p-distance (p = n d /n, where n d is the number of nucleotide differences and n the number of aligned nucleotides) was calculated to estimate the relative integration time. A larger p-distance indicates a more ancient integration into the nuclear genome. As in the previous study of NUPTs (Yoshida et al., 2014), BLAST hits of 100 bp or longer and with relatively low p-distances (≤ 0.1) were selected. Compared with NUPTs, BLAST searches with mitochondrial genomes resulted in a larger number of partially/completely overlapping BLAST hit coordinates (start position, end position) in nuclear genomes. This is due to the larger number of repeats in mitochondrial genomes compared to plastids. To prevent overestimation of the number of NUMTs, such overlapping BLAST hits were inspected and eliminated in the following manner: first, if more than one BLAST hit had identical/nested coordinates, the longest hit with the lowest p-distance was counted as an NUMT and the others were ignored ( Supplementary  Fig. S1, A and B); and second, if BLAST hits partially overlapped for more than half of their length, only the hit with the longest sequence was counted as an NUMT ( Supplementary Fig. S1C). If BLAST hits were adjacent to each other and with overlaps of less than half of their lengths, each hit was independently counted as an NUMT ( Supplementary Fig. S1D). It should be noted that in addition to independent, recurrent integration (or simultaneous integrations) of distinct NUMTs, a single large insertion and subsequent multiple deletions can make the pattern of mosaic NUMTs. The nuclear coordinates of candidate NUMTs in each species are listed in Supplementary Table S2.
Data analyses of NUMTs The proportion of the nuclear genome that consisted of NUMTs and the cumulative length of those NUMTs were calculated and compared with nuclear genome size. Mutation directions of each nucleotide in integrated DNA sequences are different from each other (Rousseau-Gueutin et al., 2011). This makes it difficult to estimate the corrected number of substitution events between NUMTs and mitochondrial genomes. Thus, as in Yoshida et al. (2014), p-distance was used as an indicator of relative time since integration into the nuclear genome. For each species, the age distribution of NUMTs, as indicated by their p-distance, was estimated.
The relationship between integration of NUMTs and structural diversity of mitochondrial genomes was investigated. To evaluate the structural complexity of the mitochondrial genomes of each species, the number of repeat regions in each was estimated by BLAST searches using MegaBLAST with word_size = 16. In the BLAST search, each mitochondrial genome was used as a query against its own sequence. To avoid multiple counting of repeat sequences, BLAST hits with unique coordinates in the mitochondrial genome were considered to be repeat sequences. The number of long (more than 1 kb) and short (21-1,000 bp) repeats was counted and the relationship with cumulative length of the NUMTs was examined.
For all species analyzed, there were mosaics of BLAST hits, each of which corresponds to a discrete region in the mitochondrial genome ( Supplementary Fig. S1D). The existence of such mosaic BLAST hits may indicate the formation of mosaic organellar DNA fragments during or after insertion into nuclear genomes. The coordinates of each BLAST hit within the mosaic overlapped by between less than 10 bp and more than 1 kb. To compare such mosaic-structured NUMTs between species, the number of mosaic NUMTs and the length of overlap between adjacent hits were assessed based on the coordinates of BLAST hits. The hits were sorted by their coordinates (chromosome number, start position, and end position) and the end position of one NUMT (e.g., the purple-colored NUMT in Supplementary Fig. S1D) was compared with the start position of the next one (the light blue-colored NUMT).

RESULTS AND DISCUSSION
NUMTs in plant species For each of the six study species (A. thaliana, C. papaya, V. vinifera, C. sativus, S. bicolor and Z. mays), the cumulative length of NUMTs (Table 1) was greater than that found for NUPTs (Yoshida et al., 2014). For example, in A. thaliana, the cumulative length of NUMTs (282.1 kb) was about 16 times greater than of NUPTs (17.7 kb). In addition to A. thaliana, C. sativus showed a large excess of NUMTs (309.2 kb) compared to NUPTs (49.0 kb). The excess of NUMTs compared with NUPTs in these species may reflect more frequent integrations from mitochondrial genomes than from plastid genomes. Several developmental processes have been proposed as pathways for integration of organellar-derived DNA: lysis of organelles during processes such as pollen development, nuclear inclusion of mitochondria, nuclear attachment of plastids, and stromules connecting plastids with nuclei (Leister, 2005). The integration of large NUMTs is more likely caused by nuclear inclusion of mitochondria, rather than the fragmental integration after lysis of organelles. Such mitochondrial inclusion events have been observed in plants (Yu and Russell, 1994) and may be one of the reasons for the greater cumulative length of NUMTs compared to NUPTs.
In our previous study, a significant enrichment of transposable elements (TEs) in the vicinity of NUPTs was observed in A. thaliana and Z. mays (Yoshida et al., 2014), suggesting the similar distribution of NUPTs and TEs in nuclear genomes. In the current study, A. thaliana NUMTs co-existed with TEs more frequently than expected ( Supplementary Fig. S2). The average number Age distribution of NUMTs In our previous study of 17 plant species (Yoshida et al., 2014), NUPTs showed mainly two types of age distribution: the majority of NUPTs were relatively young for six species (p-distance < 0.01; two eudicots and four monocots), while there was an even distribution of estimated ages for the NUPTs of the remaining 11 eudicots. These same two patterns were found for NUMTs (Fig. 1). Seven of the nine study species (four eudicots and three monocots) showed a large proportion of the NUMTs falling in the youngest category of integrations, with an immediate decrease in the next-youngest category. The remaining two species, V. vinifera and G. max, did not show such a pattern. In four species, this predominance of young integrations was seen for NUMTs but not NUPTs (A. thaliana, C. papaya, L. japonicus and C. sativus). In A. thaliana, the excess of NUMTs in the youngest age category is due primarily to one extra-large integration: a nearly-entire mitochondrial genomic sequence is integrated into the vicinity of the centromere region of chromosome 2 ( Supplementary Fig.  S3) (Lin et al., 1999;Stupar et al., 2001). Long NUMTs Fig. 1. Age distribution of NUMTs. The cumulative length of NUMTs for each p-distance interval is shown. White bars represent the proportion of these NUMTs that are long insertions (length > 10 kb). Nuclear mitochondrial DNAs in plants were also observed in other species. Five species (C. papaya, L. japonicus, C. sativus, O. sativa and Z. mays) have more than one extra-long NUMT (longer than 10 kb) (Fig. 1, Supplementary Fig. S4). The location of extralong NUMTs varied among nuclear chromosome regions ( Supplementary Fig. S3), suggesting recurrent integrations of such long sequences. These long NUMTs were relatively recent integrations ( Supplementary Fig. S4), while older NUMTs were quickly degraded (Supplementary Fig. S5). The elimination of long NUMTs (Supplementary Fig. S4) and pervasive changes caused by small indels (Supplementary Fig. S5) were similar to those observed in NUPTs, suggesting that the degradation/ elimination process after integration is similar between NUPTs and NUMTs.

Relationship between nuclear genome sizes and
NUMTs Two previous studies suggested that the nuclear genome size and the cumulative length of NUMTs were positively correlated (Smith et al., 2011;Michalovova et al., 2013). In contrast, Richly and Leister reported no significant correlation between both nuclear and mitochondrial genome sizes and the cumulative length of NUMTs (Richly and Leister, 2004). In the current study, nuclear genome size and cumulative length of all NUMTs had a positive relationship but it was not significant (Kendall's τ = 0.278, P = 0.359). In previous analyses of NUPTs, we discussed the possibility that the age distribution of integrated DNAs is due to the different fitness effects of younger vs. older integrated DNAs. In that hypothesis, many recently integrated DNAs have a deleterious effect on the genome and are selected against, being either removed or reduced in size. Due to this selection, many fewer integrated DNAs survive to a relatively older age, and those DNAs that do survive are less deleterious and smaller in size. Based on that idea, we tentatively divided NUMTs into younger (p-distance < 0.01) and older NUMTs, and tested for a correlation between nuclear genome sizes and the cumulative lengths of each age class. Interestingly, while there was no relationship between nuclear genome sizes and cumulative lengths of younger NUMTs, a significant positive relationship was found between the nuclear genome sizes and cumulative lengths of older NUMTs (Table 2). To confirm that this difference is a common pattern for integrated organellar DNAs, we also tested the data for NUPTs from 17 species in our previous study (Yoshida et al., 2014) and found the same result (Table 2). One explanation is that a recent "burst" of integration events in some species may have resulted in the non-significant relationship between nuclear genome sizes and cumulative lengths of younger NUMTs/NUPTs. The other possible explanation is a difference between younger and older NUMTs in their deleterious effect on nuclear genomes, as hypothesized above. Young NUMTs may have stronger del-eterious effects and their toleration by nuclear genomes may be determined by conditions such as genome architecture and the strength of the genome immune system against foreign DNAs. Genes potentially involved in such epigenetic mechanisms show functional variation within species (Shen et al., 2014) or elevated amino acid substitution rates between species (Willing et al., 2015), suggesting differences in the genome immune system between species (Springer et al., 2016). In contrast, NUMTs that survived in the genome to an older age could have always been or could have become less deleterious, and the cumulative length of those NUMTs might be determined by the size of the nuclear genome.
Relationship between complexity of mitochondrial genomes and NUMTs The structural diversity of mitochondrial genomes could affect both the number and size of NUMTs. We estimated the number of repeat sequences as an indicator of the structural complexity of the mitochondrial genomes. For all species, both long ( > 1 kb) and short repeats were found (Supplementary Table S3). The largest fraction of repeat sequences was short repeats of 21 to 50 bp in length (Supplementary Fig. S6). There was an especially large number of repeat sequences in the mitochondrial genome of C. sativus. The number of short repeats in C. sativus (n = 109,504) was more than 30 times larger than in the species with the second-highest number of repeats, V. vinifera (n = 3,065). The number of long repeats ranged from two (C. papaya and V. vinifera) to 25 (G. max; Supplementary Table S3). The relationship between number of short repeats in the mitochondrial genome and cumulative length of NUMTs was not significant (Kendall's τ = 0.5, P = 0.075; Supplementary Table S4). When the unusually high numbers from C. sativus were excluded, the test showed a significant correlation (Supplementary Table S4). In contrast, the number of long repeats did not show a relationship with the cumulative lengths of NUMTs. We also estimated the number of mosaic NUMTs and the overlapping sequence length for adjacent NUMTs within those mosaics (Table 3). Most overlapping regions were shorter than 10 bp for all species except C. sativus, for which the majority of overlaps were in the range of 21-50 bp. In previous studies, the mechanism of NUMT integration was analyzed, and nonhomologous end-joining (NHEJ) of double-stranded breaks (DSBs) was suggested as a common mechanism of integration among all eukaryotes (Ricchetti et al., 1999;Wang and Timmis, 2013;reviewed in Leister, 2005). With this mechanism, termini of nuclear DSBs and organellar DNA fragments are attached to one another due to microhomology in the sequences (Leister, 2005). Homologous sequences may influence not only de novo insertions in the nuclear genome, but also rearrangements and deletions after integration (Noutsos et al., 2005); repeats are an example of sequences that may provide microhomologies. Our results suggest a relationship between repeat sequences within mitochondrial genomes and both the cumulative length and the complexity of NUMT regions. During integration, such repeat sequences can facilitate the formation of complex structures of mitochondrial DNA by end-joining. In the rearrangement of integrated NUMTs, repeat sequences can be the cause of replication slippage. The reason why the test with C. sativus, which has the highest amount of short repeats, did not show clear significance is unknown. One possibility is that the extra-large mitochondrial genome size (1.56 Mb) and a high number of repeats have an antagonistic effect on the amount of integrated sequences.
Factors affecting the pattern of NUMT integration In this study, there were two types of age distribution of NUMTs across species (Fig. 1), similar to those seen for NUPTs (Yoshida et al., 2014). There was a relationship between the complexity of mitochondrial genomes and the integration pattern of NUMTs (Table  3, Supplementary Table S3). The cumulative length of NUMTs tentatively categorized by p-distance as being relatively young did not show any relationship to nuclear genome sizes, while older NUMTs showed a significant positive relationship. Although it is difficult to measure the deleterious effect of de novo NUMTs, the analysis of epigenetic modification as an indicator of the strength of the genome immune response against foreign DNAs (Kim and Zilberman, 2014) may bring new insights into the integration pattern of NUMTs.