2019 Volume 94 Issue 6 Pages 269-281
In the current era, as a growing number of genome sequence assemblies have been reported in animals, an in-depth analysis of transposable elements (TEs) is one of the most fundamental and essential studies for evolutionary genomics. Although TEs have, in general, been regarded as non-functional junk/selfish DNA, parasitic elements or harmful mutagens, studies have revealed that TEs have had a substantial and sometimes beneficial impact on host genomes in several ways. First, TEs are themselves diverse and thus provide lineage-specific characteristics to the genomes. Second, because TEs constitute a substantial fraction of animal genomes, they are a major contributing factor to evolutionary changes in genome size and composition. Third, host organisms have co-opted many repetitive sequences as genes, cis-regulatory elements and chromatin domain boundaries, which alter gene regulatory networks and in addition are partly involved in morphological evolution, as has been well documented in mammals. Here, I review the impact of TEs on various aspects of the genome, such as genome size and diversity in animals, as well as the evolution of gene networks and genome architecture in mammals. Given that a number of TE families probably remain to be discovered in many non-model organisms, unknown TEs may have contributed to gene networks in a much wider variety of animals than considered previously.
Approximately half of the human genome is composed of transposable elements (TEs) (Fig. 1A; International Human Genome Sequencing Consortium, 2001). TEs are mobile elements that can mobilize their sequences with or without proliferation within the genome. Although different kinds of TEs are found in various organisms, all major TEs can be classified into two main categories, retrotransposons and DNA transposons (Fig. 1B; Kazazian, 2004). Retrotransposons, also known as Class I TEs, propagate their copy sequences in the host genome by reverse transcription of their RNAs (retrotransposition; a copy-and-paste mechanism). LTR-retrotransposons and LINEs (long interspersed elements) are major classes of retrotransposons (Deininger and Batzer, 2002). LTR-retrotransposons have long terminal repeats (LTRs) consisting of several hundred base pairs that flank an internal coding region of the elements. LINEs, also called non-LTR-retrotransposons, encode a reverse transcriptase that recognizes the 3’ end sequence of the LINE RNA during retrotransposition (Moran et al., 1996; Kajikawa and Okada, 2002). DNA transposons, also known as Class II TEs, encode a transposase flanked by short terminal inverted repeats (TIRs or IRs). The transposase recognizes TIRs and mobilizes the elements (transposition; a cut-and-paste mechanism). Meanwhile, many non-autonomous TEs, which do not encode enzymes for transposition, have been identified. The non-autonomous TEs have a specific sequence that can be recognized by enzymes encoded by partner autonomous elements, which enables them to be transposed (Fig. 1B). For example, non-autonomous elements that have LTRs and no coding region, called LARDs (large retrotransposon derivatives) or TRIMs (terminal-repeat retrotransposons in miniature), can replicate using enzymes encoded by corresponding autonomous elements (Havecker et al., 2004). SINEs (short interspersed elements) are frequently observed in the genomes of multicellular organisms (Okada, 1991) and share 3’ end sequences with partner LINEs so that the 3’ sequence of the SINE RNA can be recognized in trans by the reverse transcriptase encoded by the partner LINEs during retrotransposition (Ohshima et al., 1996; Kajikawa and Okada, 2002). In the case of DNA transposons, MITEs (miniature inverted-repeat transposable elements) are non-autonomous elements that possess TIRs with no coding region (Wessler et al., 1995). Thus, all TEs have been hierarchically classified based on their mobilization mechanism, domain structure and consensus sequences (Kojima, 2018, 2019).
Fraction of TEs in the human genome and classification of TEs. (A) Proportion of protein-coding sequences (gray), TEs (black), and other DNA (white) in the human genome. Major components of SINEs are Alu and MIR, whereas those of LINEs are L1 and L2. (B) Basic classifications of eukaryotic TEs. LTRs and TIRs are represented by boxes with triangles. The 3’ end sequences shared between LINEs and SINEs are represented by purple boxes. EN, endonuclease; RT, reverse transcriptase.
In general, TEs move throughout the host genome without providing a beneficial effect and have been traditionally regarded as selfish DNA, genetic parasites, or a part of “junk DNA” (Orgel and Crick, 1980), with some exceptions in bacteria (MacHattie and Jackowski, 1977). Indeed, almost all TE insertion events are considered neutral mutations because random nucleotide substitutions accumulate in the TE sequences after transposition (Lunter et al., 2006; Cordaux and Batzer, 2009). As a negative effect, TE insertions can occasionally cause or become a cause of genetic defects (Hancks and Kazazian, 2016). In contrast, given that TEs occupy a large fraction of the genome, it was proposed long ago that they have a beneficial function such as gene regulation (Britten and Davidson, 1969, 1971). Studies have indeed revealed a number of functional elements derived from TEs.
This paper is an overview of TEs as a component of and a driving force for the diversification of genomes in animals. In addition, several cases of TE exaptation (Brosius and Gould, 1992), in which these elements acquire functions that have an advantageous effect on the host in mammals, are reviewed. Finally, I discuss the possibility that TE exaptations have frequently occurred and thus have contributed to the evolution of many genetic and morphological features in various animal lineages.
There are several hundred kinds of TE families in humans, and over 1,200 of their consensus sequences are available in Repbase (Bao et al., 2015; Kojima, 2018). A large proportion of TEs consist of Class I elements (retrotransposons), such as SINEs, LINEs and LTR-retrotransposons, in the human genome (Fig. 1A). Among them, LINEs make up the largest fraction, and L1, in particular, is one of the most active TEs in the human genome (Moran et al., 1996). Also, Alu is an active SINE family that can retrotranspose using enzymes encoded by L1 (Dewannieux et al., 2003), and no less than 10% of the human genome is composed of a million copies of Alu elements (International Human Genome Sequencing Consortium, 2001). Among the many kinds of human TEs, only some of the families, such as L1 and Alu, are currently active (Kojima, 2018), whereas the vast majority of TE families are no longer active and can be observed only through their ancient remnants in the human genome. The L1 family is thought to have been active over 100 million years ago, and such insertions have been detected among distantly related species that diverged around that time (Churakov et al., 2009; Nishihara et al., 2009). Almost no TEs that were inserted before the last common ancestor of amniotes (310 million years ago) have been detected in the human genome other than TE sequences that are evolutionarily conserved. This observation suggests the possibility that most non-coding sequences, other than TEs, are remnants of TEs that can no longer be detected (Brosius, 2014).
During evolution, new TE families emerge and become transpositionally active. Many lineage-specific families of SINEs and LTR-retrotransposons have emerged during mammalian evolution, resulting in a large difference in TE composition among species. As shown in Fig. 2, over 100 SINE families have been identified in vertebrates to date (Ohshima and Okada, 2005; Nishihara and Okada, 2008; Vassetzky and Kramerov, 2013; Bao et al., 2015). For example, a number of CAN SINE (SINEC) sequences are observed in the dog genome, whereas the opossum harbors Opo-1, marsupial Mar1, and Mar3 (Fig. 2A). These SINE families have emerged independently, and their distribution is restricted in each clade at the order/family level in mammals (Fig. 2; Nishihara and Okada, 2008). Thus, although the overall prevalence of SINEs is similar among the human, dog and opossum genomes (>10% of each genome), the specific sequences vary considerably. In birds, only a few active SINE families are known in spite of the availability of nearly complete genome sequences for dozens of species. Bird genomes seem to harbor fewer kinds of TEs compared with other vertebrates, and TEs account for only 4–10% of avian genomes (Zhang et al., 2014). In other vertebrates, it is generally observed that multiple SINE families coexist in the genome (Fig. 2B), which may reflect the higher diversity of LINE families in these species (Nikaido et al., 2013). Thus, it is no exaggeration to say that TEs contribute to the genetic identity of the primary sequences of host genomes.
Distribution of SINE families along with the phylogeny of mammals (A) and non-mammalian vertebrates (B). Synonyms are shown in parentheses.
Genome sizes of mammals range from 2 to 3.5 Gbp, and ~30–50% of these genomes are composed of TEs. If the human genome (3 Gbp) contained no TEs, it would be half its current size. Notably, the proportion of TEs is roughly correlated with the genome size in mammals (Fig. 3A). This suggests that the main factor for the increase in genome size is amplification of retrotransposons, which is consistent with a report that changes in genome size can be explained by a balance between the dynamic gain and loss of DNA through TE expansion and large segmental deletions, respectively (Kapusta et al., 2017). More importantly, given the distribution of many kinds of SINE families within each clade (Fig. 2), they may have a large impact on both genome diversification and genome size.
TEs and genome sizes in animals. (A) The proportion of each class of TEs in mammalian genomes and the genome assembly sizes. The species are arranged in descending order of their TE content. TE content data were obtained from the RepeatMasker website (http://www.repeatmasker.org/genomicDatasets/RMGenomicDatasets.html). (B) Correlation between genome size and the proportion of TEs in animals. The data were retrieved from the original reports of genome sequencing projects. Vertebrata: human, mouse, opossum, platypus, chicken, alligator, painted turtle, lizard, frog, axolotl, coelacanth, Takifugu, Nile tilapia, Atlantic salmon, zebrafish, spotted gar and elephant shark; Deuterostomia: lamprey, lancelet, tunicate, sea urchin and acorn worm; Cnidaria: coral, Hydra and starlet sea anemone; Placozoa: Trichoplax; Ctenophora: comb jelly; Insects: mountain pine beetle, silkworm, honey bee, fruit fly, locust, dampwood termite, Blattella germanica and American cockroach; Crustaceans: amphipod and water flea; Chelicerates: tick, spider mite and velvet spider; Myriapoda: centipede; Molluscs: owl limpet, snail and octopus; Platyhelminthes: tapeworm; Nematodes: Caenorhabditis elegans; Annelids: freshwater leech and marine polychaete; Brachiopoda: brachiopod; Rotifera: rotifer.
Furthermore, among animals, the genome size and proportion of TEs are diverse, ~108–1010 bp and 0.1–62%, respectively. Correlation between the proportion of TEs and the host genome size is also widely observed (Fig. 3B), and TEs may have been the most important determinant of genome size in these species. However, it is important to note that in some species repeat content may be underestimated because of a lack of sufficient characterization and accurate manual curation of TEs (Platt et al., 2016). Positive correlations between TE fractions and genome sizes are also observed in other eukaryotes (Kidwell, 2002; Tenaillon et al., 2010; Elliott and Gregory, 2015). An interesting recent finding is that species-specific amplification of SINEs caused genome expansions in larvaceans, a subgroup of tunicates (Naville et al., 2019). Thus, TEs have largely contributed to an increase in both the size and diversity of genomes during animal evolution.
In general, TEs have no essential function for host survival, and most cases of their insertion are considered to be neutral mutations (Lunter et al., 2006). However, some TE insertion events do provide either negative or positive effects on genetic function (Fig. 4A). For example, although transposition of TEs is repressed by epigenetic mechanisms in most somatic cells, accidental de novo TE insertions (mostly Alu and L1) within protein-coding exons can cause disorders, such as cancers (Helman et al., 2014; Hancks and Kazazian, 2016). Even when they are located outside exons, TEs can cause unequal homologous recombination between copies, occasionally resulting in genomic deletions associated with human diseases (Deininger and Batzer, 1999; Sen et al., 2006). In contrast, some TEs have been co-opted/exapted and utilized as exons. For instance, >100 copies of Alu and MIR (mammalian-wide interspersed repeat) elements have been identified as exonized in the human genome (Fig. 4A(a); Lev-Maor et al., 2003; Krull et al., 2005, 2007). Alu and other TEs in 3’ UTRs of genes can also provide a polyadenylation signal, whereas those located in introns can result in truncated RNAs (Fig. 4A(b); Lee et al., 2008; Chen et al., 2009).
The evolutionary impact of TEs on genes and cis-regulatory elements. (A) Some TEs provide protein-coding exons (a), polyadenylation signals (b) or epigenetic regulation for gene expression (c). As cis-regulatory elements, some TEs can serve as promoters (d), enhancers (e) or insulators (f). (B) Retrotransposons can increase and spread potential transcription factor binding sites (TFBSs) that can be source sequences of cis-regulatory elements.
Epigenetic regulation of TE sequences can affect the expression level of neighboring genes (Fig. 4A(c)). For example, whereas rodent B1 SINEs are generally hypermethylated, those proximal to promoters are hypomethylated (Ichiyanagi et al., 2011). Given the enrichment of B1 in the promoters of genes with testis-specific expression, these SINEs may be involved in the regulation of their associated genes (Ichiyanagi et al., 2011). In addition, some TEs act as promoters for the transcription of proximal exons and have contributed to transcriptional changes in the host (Faulkner et al., 2009; Emera and Wagner, 2012). Interestingly, L1 possesses an antisense promoter activity in the 5’ UTR (in human) or ORF1 (in mouse) and contributes to the transcription of many genes (Speek, 2001; Li et al., 2014). Furthermore, the human L1 5’ UTR itself is also translated, and thousands of such potential ORFs are present in the human genome (Denli et al., 2015). Although the importance of their biological functions remains unclear, it is likely that TEs such as L1 have large effects on the transcriptome and proteome in humans. Remarkably, L1 is retrotranspositionally active in neuronal cells, unlike in other somatic cells (Muotri et al., 2005; Faulkner and Garcia-Perez, 2017). The genome mosaicism caused by L1 retrotranspositions may provide differences in somatic cell transcriptomes or proteomes and affect neuronal plasticity in humans (Singer et al., 2010).
Some ORFs encoded by TEs such as LTR-retrotransposons have been exapted to serve as protein-coding genes (domestication) that are responsible for mammalian morphogenesis. Peg10 (Sirh1) and Peg11 (Sirh2/Rtl1) are both paternally expressed imprinted genes derived from LTR-retrotransposons domesticated in a common ancestor of Theria and Eutheria, and were revealed to be responsible for the formation and functional maintenance of the placenta, respectively (Ono et al., 2006; Sekita et al., 2008). Another important example is the mammalian syncytin genes derived from endogenous retroviruses (ERVs). ERVs are a kind of TE and are considered to have originated evolutionarily from LTR-retrotransposons (Xiong and Eickbush, 1990). Human syncytin genes are involved in trophoblastic cell fusion during placental morphogenesis (Mi et al., 2000; Malassiné et al., 2007). Remarkably, syncytin genes found in various mammals, such as mice and cows, arose independently in their respective lineages via domestication of ERVs. Thus, LTR-retrotransposons have contributed to both common and lineage-specific features in placental formation. In contrast, Arc proteins derived from gag of LTR-retrotransposons were revealed to form virus-like capsids in mammalian neurons (Pastuzyn et al., 2018). As these proteins can mediate intercellular transfer of Arc mRNA, it is possible that Arc is involved in the intercellular communication of genetic information via mRNA transfer between neurons in the brain (Pastuzyn et al., 2018). In addition, in other animals, LTR-retrotransposon sequences can play a variety of important biological roles (for details, see Horie, 2019).
TEs have had a great impact on cis-regulatory evolution. In this section, I review studies about TE contributions in mammals to 1) the evolution of a novel enhancer involved in morphological innovation, 2) an increase in binding sites of CTCF, a chromatin architectural protein, through retrotransposition leading to an expansion of lineage-specific chromatin boundaries, 3) an increase in binding sites for transcription factors leading to an expansion of potential cis-regulatory elements, and 4) early embryonic development.
After TEs become integrated into the host genome, some may acquire biological functions (exaptation or co-option), such as promoters and enhancers, and become involved in the regulation of proximal genes (Fig. 4A(d, e)). There are hundreds of thousands of evolutionarily conserved non-coding elements (CNEs), which are putative functional elements (Bejerano et al., 2004; Siepel et al., 2005). Some CNEs overlap with TEs (Bejerano et al., 2006; Nishihara et al., 2006; Gentles et al., 2007; Lowe et al., 2007; Mikkelsen et al., 2007), suggesting that some TEs are involved in cis-regulatory alterations and possibly in morphological changes during evolution. Indeed, several copies of TEs serve as enhancers or promoters (Bejerano et al., 2006; Santangelo et al., 2007; Sasaki et al., 2008; Franchini et al., 2011; Emera and Wagner, 2012). For example, >100 elements of AmnSINE1, an anciently inactivated SINE family in amniotes, were found to overlap with CNEs (Nishihara et al., 2006; Sasaki et al., 2008; Hirakawa et al., 2009), and some of them serve as distal enhancers for genes involved in mammalian morphogenesis. One CNE derived from AmnSINE1 is an enhancer for fgf8 expression in the diencephalon (Sasaki et al., 2008; Nakanishi et al., 2012), while another acts as a distal enhancer of satb2 in the neocortex and may be involved in the formation of the corpus callosum, a eutherian-specific brain structure (Tashiro et al., 2011). Furthermore, a CNE derived from AmnSINE1 and two DNA transposons (X6b_DNA and MER117) act together as a distal enhancer of wnt5a during palatogenesis (Fig. 5A; Nishihara et al., 2016a). Wnt5a is responsible for complete closure of the secondary palate, which is a mammal-specific morphological feature (He et al., 2008; Angielczyk, 2009). As a closed secondary palate is required to allow neonates to suckle successfully (Erkan et al., 2013), acquisition of this feature is considered one of the most innovative morphological changes in a common ancestor of mammals. In addition to AmnSINE1, other TEs have also been broadly involved in cis-regulatory evolution in vertebrates, such as a neuronal enhancer derived from LF-SINEs (Bejerano et al., 2006), two mammalian neuronal enhancers that independently evolved from distinct TEs (Franchini et al., 2011), and an MER39 element that acts as a prolactin gene promoter (Emera and Wagner, 2012). Thus, many exaptation events have occurred during evolution, and alterations to gene expression patterns by TEs may have contributed to morphogenetic evolution (Okada et al., 2010).
Four examples of TE exaptation. (A) Three TEs, AmnSINE1, X6b_DNA and MER117, have transposed stepwise during mammalian evolution and coordinately act as a distal enhancer for wnt5a expression in the frontonasal region (arrowhead) and the secondary palate in mice. (B) Rodent B2 SINEs contain CTCF binding sites, which may contribute to TAD formation. (C) MER41Bs contain a binding site for STAT1 and have contributed to cis-regulatory evolution involved in innate immunity through transcriptional control of related genes such as aim2 and apol1. (D) L2s include binding sites for four transcription factors (ERα, FoxA1, GATA3 and AP2γ) and have provided many regulatory elements during mammary gland evolution.
Some TEs act as insulators in mammals (Fig. 4A(f); Lynch et al., 2011; Wang et al., 2015). For example, murine B2 SINEs possess a CTCF binding site in their consensus sequence (Lunyak et al., 2007; Schmidt et al., 2012). Because CTCF binding sites act as insulators that show an enhancer-blocking property, this finding suggests that B2 has increased the number of potential insulator elements in the genome (Fig. 5B). CTCF proteins are highly enriched at the boundaries of topologically associating domains (TADs) and are involved in the demarcation of transcriptionally active/inactive domains (Dixon et al., 2012). B2 is also enriched in TAD boundary regions (Dixon et al., 2012), and an increase in CTCF binding sites as a result of B2 retrotransposition during evolution might have influenced genome organization in the nucleus. Because the distribution of the B2 family is restricted to a subset of rodents (Fig. 2A), the spread of the binding sites by this SINE occurred only in this group. Notably, other mammalian SINEs such as CAN (SINEC) in carnivores and Mar1 in marsupials also possess CTCF binding sites (Schmidt et al., 2012). Not only SINEs but also ERVs are responsible for the demarcation of TADs. For example, de novo insertion of HERV-H elements introduces new TAD boundaries, and deletion of HERV-H elements causes elimination of TAD boundaries in human pluripotent stem cells (Zhang et al., 2019). Thus, a wide variety of TEs may be involved in the lineage-specific dynamics of higher-order chromatin structure and gene regulatory networks via expansion of potential chromatin boundaries.
Furthermore, some TE families possess binding sites for transcription factors in their consensus sequences (e.g., Bourque et al., 2008; Kunarso et al., 2010; Chuong et al., 2016; Nishihara, 2019). Various LTR-retrotransposons have contributed to an increase in the number of binding sites for various transcription factors such as Oct4 (Jacques et al., 2013; Sundaram et al., 2017), Sox2 (Bourque et al., 2008; Jacques et al., 2013; Sundaram et al., 2017) and Elf5 (Chuong et al., 2013). For example, MER41B is an LTR-retrotransposon family and possesses a STAT1 binding site in its consensus sequence (Chuong et al., 2016). Hundreds of MER41B copies are bound by STAT1 in human cells, and some of these indeed serve as enhancers of proximal interferon-stimulated genes, such as aim2 and apol1 (Fig. 5C). MER41 elements, which have expanded independently in multiple mammalian clades, may have contributed to lineage-specific cis-regulatory changes involved in innate immunity (Chuong et al., 2016).
These findings support the Britten–Davidson model in which TEs can generate many source sequences for regulatory elements via transposition (Fig. 4B; Britten and Davidson, 1969, 1971; Sundaram and Wang, 2018). Furthermore, it has been proposed that TEs have altered gene regulatory programs, which could lead to the modification of developmental systems (Britten and Davidson, 1971). This proposal can be verified by identifying TEs that possess binding site(s) for a master regulator of a gene regulatory network responsible for morphogenesis, because, according to this model, TEs could increase the number of developmental enhancers having the same or similar functions, leading to an expansion of downstream genes of the regulator (Fig. 4B). For example, initial development of the mammary gland, a morphological feature characteristic of mammals, requires binding of ERα and related pioneer factors (FoxA1, GATA3 and AP2γ) to many enhancers and promoters to induce a dramatic change in the expression of downstream genes (Manavathi et al., 2014). Dozens of kinds of human TEs possess binding motifs for these four transcription factors, and indeed thousands of TE copies are bound by these factors and exhibit enhancer-specific chromatin states (Nishihara, 2019). These TEs, such as MIR and L2, may have had a substantial impact on cis-regulatory evolution associated with mammary gland development. Notably, L2 elements, which are ancient mammalian LINEs, most likely increased and spread the binding sites of all four factors and thus contributed extensively to the production of potential enhancers (Fig. 5D). In a similar way, it is possible that a large number of TEs have been involved in an expansion of gene regulatory networks leading to morphological innovation.
Recent studies have revealed that TEs are also involved in early embryogenesis in mammals. One type of ERV (HERVs) is highly expressed in human embryonic stem (ES) cells (Wang et al., 2014). Long non-coding RNAs derived from HERVs determine ES cell identity (Lu et al., 2014) and regulate pluripotency in early embryonic development by serving as a key component of the regulatory network (Durruthy-Durruthy et al., 2016). Also, HERVs, as well as other evolutionarily young TEs, serve as enhancers in human ES cells and significantly contribute to genome activation during human early embryogenesis (Lu et al., 2014; Pontis et al., 2019). In contrast, transcription of L1 is also activated in the early mouse embryo. Silencing of L1 leads to a decrease in chromatin accessibility and to developmental delay, indicating that activation of L1 regulates chromatin opening and is essential for normal embryonic development (Jachowicz et al., 2017). In mouse gonocytes of a certain stage, TEs such as L1 are enriched in transiently accessible genomic regions, and transcriptional upregulation of evolutionarily young L1 elements occurs in these regions (Yamanaka et al., 2019). This result is suggestive of a role for the TEs in chromatin accessibility also in the germline. Thus, TEs such as ERVs and L1 are deeply involved in the early stage of development in mammals.
The genomic distribution of TEs may be associated with nuclear architecture. LINEs are enriched in AT-rich and gene-poor regions in the genome and in heterochromatin, which corresponds to G-bands of chromosomes (Korenberg and Rykowski, 1988; International Human Genome Sequencing Consortium, 2001). In nuclei, LINEs are enriched in constitutive lamina-associated domains (cLADs) and the nuclear periphery wherein gene expression is largely repressed (Ichiyanagi et al., 2011; Meuleman et al., 2013; Solovei et al., 2016). In contrast, SINEs are enriched in gene-rich euchromatin regions, which correspond to R-bands of chromosomes. In nuclei, SINEs are depleted in cLADs but are observed in the nuclear interior (Ichiyanagi et al., 2011; Meuleman et al., 2013; Solovei et al., 2016). It remains unclear what molecular mechanism causes this mutually exclusive distribution of LINEs and SINEs in the genome. It may be a result of natural selection if SINEs have a regulatory function related to gene expression (Ichiyanagi, 2013).
In addition to TEs, other repetitive sequences make up a portion of the human genome, such as satellites, minisatellites and microsatellites (short tandem repeats; STRs). RepeatMasker annotation indicates that simple repeats occupy 1.5% of the human genome. Most of the simple repeats are found in non-coding sequences, and some of them can be used as genetic markers for population genetics because they represent length polymorphisms among individuals (Rosenberg et al., 2002). Recent studies have revealed that a portion of STRs act as regulatory elements to control gene expression in humans (Gymrek et al., 2016; Hannan, 2018). Remarkably, some satellite repeats are involved in the formation of TADs and larger domains. For example, the inactive X chromosome in mice can be separated into two large mega-domains, and the DXZ4 macrosatellite region located at the boundary of these domains is essential for their formation (Giorgetti et al., 2016). Also, dozens of STRs are associated with diseases, and the disease-associated STRs are enriched at TAD boundaries. Patients with fragile X syndrome with the mutation-length STR exhibit loss of CTCF occupancy, disruption of the TAD boundary, and a reduced expression of the associated gene FMR1 (Sun et al., 2018). Thus, it is likely that not only TEs but also many satellite repeats have contributed to the evolution of the cis-regulatory network and three-dimensional organization of the genome in the mammalian cell nucleus.
One interesting finding under this topic is that in nocturnal mammals, heterochromatin localizes to the central region of the nucleus in rod cells and functions like a lens to send light efficiently to the outer segments, whereas it is distributed mainly in the periphery of the nucleus in rod cells of diurnal mammals (Solovei et al., 2009). In mice, constitutive heterochromatin, which can be marked by major satellite repeats (MSRs), localizes to the central region of the nucleus in rod cells (Solovei et al., 2009). The rod cells of owl monkeys, the only genus of nocturnal/cathemeral simian primates, show a spherical heterochromatin block in the central region of the nucleus, and a primary component of the heterochromatin region is the OwlRep, a megasatellite DNA that has expanded specifically in the owl monkey lineage (Koga et al., 2017; Nishihara et al., 2018). Thus, because megasatellites as a whole can be a major component of heterochromatin, lineage-specific expansion of satellite repeats might have had an impact on the nuclear architecture associated with nocturnal adaptation in this lineage.
Together, TEs exhibit a large diversity in their transpositional machinery, structure and sequences (Fig. 1), and >1,200 consensus sequences of TEs have been reported at the subfamily level in humans. Although TEs have generally been regarded as junk/selfish DNA, it has been reported primarily in mammals that a subset of TEs has pivotal or assistive roles in gene regulation leading to morphogenesis (Figs. 4 and 5). However, it remains largely unknown how many of the total 4.5 million copies of TEs in the human genome are involved in gene regulation and morphogenesis. Apart from mammals, some TEs also have positive or negative effects on gene expression and contribute to morphological changes in non-mammalian vertebrates (Santos et al., 2014) and invertebrates (van’t Hof et al., 2016). It is expected that there are, in a variety of animals, diverse, lineage-specific exapted TEs that have contributed not only to genome size (Fig. 3) but also to morphological innovation.
Because SINEs exhibit lineage-specific distribution at the mammalian order/family level (Fig. 2), it is reasonable that some of them have provided lineage-specific functions. One of the remarkable features of SINEs is that several SINE families in different species share a highly similar central sequence and therefore can be grouped together as a SINE superfamily (Ogiwara et al., 2002), in contrast to the more typical SINEs that have different sequences in their central regions. SINE superfamilies such as CORE-SINEs (Gilbert and Labuda, 1999, 2000; Munemasa et al., 2008), V-SINEs (Ogiwara et al., 2002), DeuSINEs (Nishihara et al., 2006) and MetaSINEs (Nishihara et al., 2016b) are widely distributed in animals, but the functions of the central sequences of the SINE superfamilies remain unknown. AmnSINE1, some copies of which have been exapted in mammals as described above (Fig. 5A), is a member of the DeuSINE superfamily (Nishihara et al., 2006), and the central sequences of ancient MIRs, a member of the CORE-SINE superfamily, were a source of binding sites for multiple transcription factors (Fig. 4B; Bourque et al., 2008; Rohrmoser et al., 2018; Nishihara, 2019). These findings suggest the possibility that the central sequences of the SINE superfamily members have provided functional source sequences to the genome through retrotransposition and thus may have been frequently exapted in other animals.
To fully understand the TE exaptation landscape in animals, the following three approaches will be required. First, because various types of TEs have been reported and hierarchically classified in mammals (Kojima, 2018, 2019), it can be easily expected that in other animals, hundreds of kinds of TEs may also constitute a large fraction of the genome. An accurate annotation of TEs will be necessary to understand their variety and coverage in animal genomes (Platt et al., 2016), although such studies lag behind the recent exponential increase in whole-genome sequence assemblies. Second, genome-wide identification of functional elements using epigenetic studies such as ChIP-seq and Hi-C and epigenetic editing technologies (Hilton et al., 2015; Jachowicz et al., 2017), as well as genome engineering techniques such as the CRISPR-Cas9 system (Ran et al., 2013), will be applied widely to various non-model organisms in the future. ChIP-seq analyses enabled the identification of a number of TE-derived functional elements that have led to large-scale alteration of cis-regulatory networks (Fig. 5B–5D). Notably, new methods have recently been developed for epigenetic and chromatin interaction analyses in a single cell or a small number of cells in mammals (Nagano et al., 2013; Zhang et al., 2016; Harada et al., 2019). Application of such new technologies for identification of exapted TEs should unveil many cases of lineage-specific and tissue-specific acquisition/modification of gene regulatory networks driven by TE expansion. Third, multidisciplinary studies, including computational genomics, chromosomal analyses, cell biology technology and developmental studies, should be a powerful approach for revealing the multifaceted contributions of a number of TEs to the evolution of a wide variety of animals. In particular, studies about nuclear organization and chromosome territory as well as cell biological approaches may reveal a possible crosstalk between the organization of TEs and gene regulation in the nucleus. Thus, an omic atlas of TE exaptations, as revealed via such approaches, will provide a new perspective on the molecular mechanisms for the acquisition and evolution of functional elements that can lead to the emergence of morphological innovations.
This work was partially supported by JSPS KAKENHI and a Naito Foundation Natural Science Scholarship to H. N.