2023 Volume 98 Issue 6 Pages 337-351
Retrotransposons are transposable elements that are transposed via transcription and reverse transcription. Their copies have accumulated in the genome of mammals, occupying approximately 40% of mammalian genomic mass. These copies are often involved in numerous phenomena, such as chromatin spatial organization, gene expression, development and disease, and have been recognized as a driving force in evolution. Different organisms have gained specific retrotransposon subfamilies and retrotransposed copies, such as hundreds of Mus-specific subfamilies with diverse sequences and genomic locations. Despite this complexity, basic information is still necessary for present-day genomic and epigenomic studies. Herein, we describe the characteristics of each subfamily of Mus-specific retrotransposons in terms of sequence structure, phylogenetic relationships, evolutionary age, and preference for A or B compartments of chromatin.
Retrotransposons are transposable elements (TEs) that are transposed via the transcription of their own sequences and reverse transcription of their RNAs. This transposition mechanism is called retrotransposition because it involves a reverse transcription process, and is also called copy-and-paste, because the original sequence remains and a new copy is generated. As a result, retrotransposon sequences have increased in copy number and now occupy a large portion of the mammalian genome: for instance, ~35% and ~40% of the mouse and human genomes, respectively (International Human Genome Sequencing Consortium, 2001; Mouse Genome Sequencing Consortium, 2002). Accumulating evidence suggests that retrotransposons contribute significantly to host evolution.
Retrotransposons are classified into three groups based on their transposition mechanism and sequence features: long interspersed elements (LINEs), short interspersed elements (SINEs) and long terminal repeat (LTR) retrotransposons. LINEs are transcribed by RNA polymerase II (Pol II), and their sequences contain an open reading frame that encodes a protein with endonuclease and reverse transcriptase activities that are necessary for retrotransposition. The 5’ UTR functions as a promoter (Goodier and Kazazian, 2008; Fueyo et al., 2022). SINEs are short noncoding sequences transcribed by RNA polymerase III (Pol III). Their sequences in the 3’ region (either unique sequence or polyA) are homologous to the 3’ region of a LINE, enabling the reverse transcription of SINE RNAs by LINE-encoded reverse transcriptase (Goodier and Kazazian, 2008; Ichiyanagi, 2013; Fueyo et al., 2022). As the name suggests, LTR retrotransposons have LTR sequences at both ends. They encode gag and pol genes, similar to retroviruses. Several LTR retrotransposons also encode the env gene, and are therefore called endogenous retroviruses (ERVs) (Havecker et al., 2004; Goodier and Kazazian, 2008; Stocking and Kozak, 2008; Johnson, 2015; Fueyo et al., 2022).
Each class of retrotransposon contains many families, each family contains subfamilies, and each subfamily contains many genomic copies. Although the original retrotransposon copy and the retrotransposed new copy have the same sequences at the time of transposition, each copy gradually accumulates mutations on an evolutionary time scale. Assuming that the retrotransposed copies (i.e., the progeny of the original copy) have random mutations, their consensus sequence represents the original copy. Therefore, the retrotransposition time for a given retrotransposon copy can be estimated based on the sequence divergence of the copy from the consensus sequence of the subfamily to which it belongs. If a retrotransposon copy is highly similar to the consensus sequence (i.e., low divergence), it was likely transposed recently, whereas a copy showing high divergence from the consensus sequence was likely transposed in the distant past. Sometimes, the former is called a young copy, and the latter is called an old copy. Two species that are phylogenetically closely related share some retrotransposon subfamilies. In this case, such subfamilies were likely amplified in a common ancestor. Each copy of a shared subfamily usually shows a high degree of sequence divergence.
In contrast, some retrotransposon subfamilies are found specifically in a single species. They show a low degree of sequence divergence and thus likely transposed after the split of closely related species. Moreover, some copies of these subfamilies retain their transcriptional and retrotranspositional activities. Herein, we describe retrotransposons specific to the mouse, a mammalian model species.
Based on the phylogeny of reverse transcriptases that they encode, mouse LINEs are grouped into eight families, LINE-1 (L1), L2, CR1, RTE, RTE-X, Dong, Jockey and Tx1, together accounting for 20% of the mouse genome. These families except for L1 have become retrotranspositionally extinct. L1 is the most abundant family in the mouse genome, and some of its subfamilies maintain retrotransposition activity. The 5’ regions of mouse L1 sequences are composed of tandem repeats, and each repeat unit is called a monomer. These monomers function as internal promoters of Pol II transcription. There are several distinct monomer sequences, which characterize each L1 subfamily. The subfamilies L1M, L1M1, L1M2a, L1MA, L1MB, L1MC, L1MC, L1MD, L1ME and HAL are ancient and are also present in other eutherians. The L1Md groups (Md stands for Mus musculus domesticus), such as L1Md_A, L1Md_Tf and L1Md_Gf, are Mus-specific, and many insertional polymorphisms between strains have been identified (Akagi et al., 2010).
The L1 family is deeply involved in biological phenomena, with or without retrotranspositional activity. For instance, their transcription is involved in developmental programs (Muotri et al., 2005; Fadloun et al., 2013; Blythe et al., 2021), X-chromosome inactivation (Chow et al., 2010) and many diseases (Miki et al., 1992; Morse et al., 1988; Ueno et al., 2016; Song et al., 2021; Takahashi et al., 2022). A correlation between LINE density and regional GC content is observed, as low-GC regions harbor more LINE copies (Fig. 1A). Likewise, in the context of spatial chromatin compartmentalization, LINE density is low in the A compartments, which are nuclear spaces consisting of transcriptionally active chromatin, whereas the B compartments, which are enriched with chromatin having repressive epigenetic modifications, have more LINE copies (Fig. 1B) (Meuleman et al., 2013; Lu et al., 2021). To understand the role of L1 in these biological phenomena, it is necessary to investigate L1 at the subfamily and locus levels.
The current classification of Mus-specific L1 subfamilies is based on the promoter type, length of the polymorphic region (LPR) of ORF1, and sequence differences caused by recombination and mutation (Sookdeo et al., 2013). The promoters of the Mus-specific L1s are grouped into A-, F- and V-types. V-type L1s and many F-type L1s have lost their retrotransposition activity, whereas two F-type subfamilies, Tf- and Gf-types, remain active. In Dfam (Storer et al., 2021) and RepBase (Bao et al., 2015), L1 elements are deposited as L1_5end (containing a 5’ UTR and ORF1), L1_orf2, or L1_3end (containing a 3’ UTR). In the RepeatMasker annotation file in the UCSC genome browser (Lee et al., 2022), neighboring portions are joined together to construct a single element. Due to the substantial revision of the subfamily classification, L1 copies are named differently in the mm10 and mm39 versions of RepeatMasker annotation files in the UCSC genome browser.
L1Md_AL1 copies with A-type monomers are called L1Md_A and have recently been subdivided into L1MdA_I, II and III. The sub-numbering accords to their ages; thus, the L1MdA_I subfamily is the youngest with an average divergence of 0.85% (Fig. 2A and Supplementary Table S1), and the estimated time of their burst is 0.2 million years ago (Mya) (Sookdeo et al., 2013). L1MdA_IV, V, VI and VII subfamilies also exist. L1MdA_IV and _VII were previously classified as “L1MdF2”, whereas L1MdA_V and _VI were previously classified as “L1MdF3”. Promoters of L1Md_A (currently, L1MdA_I, II and III) harbor a high number of CpG sites, which are the targets of DNA methylation (Supplementary Table S1) and are indeed densely DNA-methylated. Accordingly, transcription of L1Md_As is upregulated when DNA methylation is reduced in embryonic stem cells (ESCs) and male germ cells (Tsumura et al., 2006; Inoue et al., 2017). In addition, their chromatin is enriched with H3K9me3, a repressive histone methylation, and a decrease in this mark leads to derepression of L1Md_As (Karimi et al., 2011). Derepression and consequent retrotransposition of these subfamilies have also been observed during early development and in tumors (Vazquez et al., 2019; Gretarsson and Hackett, 2020; Blythe et al., 2021; Gerdes et al., 2022; Kong et al., 2022).
Subfamilies carrying F-type promoters are older than those with A-type promoters and were previously thought to be retrotranspositionally inactive. Later, active copies of the F-type promoters were identified and designated as Tf and Gf. Indeed, a comparison of average sequence divergence confirms that Tf- (1.1–1.8%) and Gf-type (2.0–2.5%) subfamilies are the youngest among F-type subfamilies (3.4–5.8%) (Supplementary Fig. S1, Supplementary Table S1). L1Md_Tf is subdivided into L1MdTf_I, II and III, with estimated burst times of 0.25, 0.27 and 1.23 Mya, respectively. The current L1Md_Gf is subdivided into L1MdGf_I and L1MdGf_II with estimated burst times of 0.75 and 2.16 Mya. Similar to A-type L1s, transcription of the Tf- and Gf-type subfamilies is regulated by DNA methylation and H3K9me3. Tf-type copies not only carry the sense promoter to transcribe the copies, but also harbor an antisense promoter in their ORF1 region (Li et al., 2014). Interestingly, the transcription of these antisense promoters is also regulated by DNA methylation (Inoue et al., 2017). The older F-type subfamilies L1Md_F, L1Md_F2 and L1Md_F3 were reclassified as L1MdF_I, II, III and IV. The majority of L1MdF_V copies were previously classified as “L1VL1.” The ORF2 regions of the Tf and Gf subfamilies are more similar to those of L1Md_A than to those of L1Md_F. This inconsistency is likely explained by recombination events between L1 copies and subsequent expansion of these recombinants. The relationship between the old and new classifications of L1Md is slightly complicated. The older names are listed in Supplementary Table S1.
L1MdV and L1LxThe V-type promoters, consisting of L1MdV_I, II and III, are older (average divergence of 6.8–11.1%) (Fig. 2A, Supplementary Fig. S1, and Supplementary Table S1) and less active than the A- and F-type promoters (Jubier-Maurin et al., 1992). The estimated burst times of L1MdV_I (a mixture of “L1VL1” and “L1_Mus1” in the previous classification), L1MdV_II (part of “L1_Mus3”) and L1MdV_III (part of “Lx”) are 8.4 to 10.1 Mya. In addition, there are L1Lx_I (part of “L1_Mus3”), _II (part of “L1_Mus4”), _III (part of “L1_Mus4”) and _IV (part of “Lx”) subfamilies with estimated burst times of 10.2 to 14.1 Mya, showing average divergences of 9.9–12.2% (Fig. 2A, Supplementary Fig. S1 and Supplementary Table S1). The Lx-type promoter has also been found in other Rodentia species, suggesting that Mus-specific L1 subfamilies are derived from the Lx-type L1 subfamilies.
L1MdFanc, L1MdMus and L1MdNRecently, L1MdFanc, L1MdMus and L1MdN were identified (Sookdeo et al., 2013). Promoters of L1MdFanc (representing the ancestral F) are similar to those of the F-type but show distinctive features. L1MdMus has not been previously characterized and is absent from the rat genome. The N-type promoters (L1MdN, N standing for novel) do not show any similarity to the other L1 promoters. L1MdFanc_I (parts of “L1Md_F” and “L1_Mus1” in the previous classification), L1MdFanc_II (part of “L1_Mus2”), L1MdMus_I (part of “L1_Mus1”) and L1MdMus_II (part of “L1_Mus2”) are old subfamilies (burst 6.6–9.3 Mya, average divergences of 6.3–8.4%) (Supplementary Fig. S1 and Supplementary Table S1), whereas L1MdN is as young as L1MdA_II (1.9 Mya, 5.2%).
MusHAL1MusHAL1 is a Mus-specific HAL1 (half-L1) consisting of ORF1-like and polyA sequences. This constitutes an independent group although it likely arose from the division of an ancestral L1 copy (Smit, 1999). MusHAL1 in the mouse genome became extinct millions of years ago (with an average divergence of 13.4%) (Fig. 2A and Supplementary Table S1), whereas rat-specific HAL1, named RNHAL1, still retains transposition activity. While the 5’ region is highly similar between RNHAL1 and MusHAL1, the 3’ region is dissimilar (Rat Genome Sequencing Project Consortium, 2004). Interestingly, MusHAL1 copies have accumulated on the Y chromosome (25.3% of total copies). The loss of DNA methylation in ESCs and germ cells does not result in a marked upregulation (Karimi et al., 2011; Inoue et al., 2017), which may be related to the fact that there are only 10 CpG sites in its 5’ region, which is particularly low compared to other L1 subfamilies (Supplementary Table S2).
SINEs are non-autonomous TEs that are 100–500 bp long and do not encode proteins. Their retrotransposition is dependent on the enzymes encoded by LINEs present in the same genome. There are >1,000,000 copies of SINE in the mouse genome, making up approximately 8% of the genomic sequence. The SINE sequence has a Pol III promoter called the A-box and B-box; therefore, its transcriptional regulation may be different from that of Pol II-transcribed TEs, although the details remain unknown. There are three groups of SINEs, SINE1/7SL, SINE2/tRNA and SINE3/5S, according to the RNA genes from which they were derived. In mice, SINE1 includes the B1 family, SINE2 includes the B2, B3, B4, ID, MIR, tRNA-Deu and tRNA-RTE families, and SINE3 includes the AmnSINE1 family. Despite their independent retrotransposition, SINEs are enriched in gene-rich genomic regions (Fig. 1A), suggesting their role in gene regulation (Ichiyanagi, 2013). In addition, SINE copies are preferentially retained in the A compartments (Fig. 1B and Supplementary Table S1).
SINE1/7SL groupThe B1 SINEs (B1_Mm, B1_Mus2, B1_Mus1, B1_Mur4, B1_Mur3, B1_Mur2, B1_Mur1, B1F2, B1F, B1F1, PB1D7, PB1D10, PB1D11 and PB1) are approximately 150 bp in length and originate from the first 85-bp region (containing the Pol III promoter sequence) of the 7SL RNA gene. The DNA sequences of these SINEs, and consequently their RNA sequences, contain a polyadenylate sequence that is recognized by an L1-encoded reverse transcriptase for retrotransposition (Dewannieux and Heidmann, 2005). PB1 represents proto-B1, and is the ancestor of recent B1 SINEs and human Alu SINEs. Average divergences of PB1 subfamilies are 21.2–27.1% (Supplementary Fig. S1 and Supplementary Table S1). B1_Mm, B1_Mus2 and B1_Mus1 (average divergence of 8.4–10.7%) (Fig. 2B, Supplementary Fig. S1 and Supplementary Table S1) are specific to the mouse and have retained retrotranspositional activity, creating insertional polymorphisms between mouse strains (Akagi et al., 2010; Ichiyanagi et al., 2021). While B1 expression is normally limited to the testes in adult tissues (Ichiyanagi et al., 2011; Mori and Ichiyanagi, 2021), heat-shock treatment, DNA damage reagents and viral infection induce B1 expression (Liu et al., 1995; Li et al., 1999; Rudin and Thompson, 2001; Williams et al., 2004). Such transcribed B1 RNA inhibits the transcriptional activity of Pol II (Mariner et al., 2008). B1 DNA serves as a binding site for many transcriptional regulators. A subtype of B1 binds to an aryl hydrocarbon receptor and SNAI2, which underlies the insulating activity of these copies (Roman et al., 2008; Román et al., 2011). B1 also binds to KRAB zinc-finger proteins ZFP92, ZFP266 and ZFP819, and to the orphan nuclear receptor Nr5a2, resulting in the regulation of totipotency and pluripotency (Tan et al., 2013; Gassler et al., 2022; Kaemena et al., 2023; Osipovich et al., 2023). B1 has also been shown to regulate the DNA methylation status of neighboring regions (Ichiyanagi et al., 2011; Estécio et al., 2012).
SINE2/tRNA groupThis group includes B2 and B3 SINEs (B2_Mm1a, B2_Mm1t, B2_Mm1o, B2_Mm2, B3 and B3A; note that B2_Mm1o was recently identified and is not present in the UCSC RepeatMasker table for mm39) as well as B4, ID, MIR, tRNA-Deu and tRNA-RTE. B2 and B3 are approximately 190 and 210 bp in length, respectively, and their first 130-bp regions share homology, 70 bp of which originate from the tRNA gene containing the Pol III promoter sequence. B3 and B3A are evolutionarily older elements (average divergence of 23.3–26.8%), whereas B2_Mm1a, B2_Mm1t, B2_Mm1o and B2_Mm2 are younger (5.5–11.2%) (Fig. 2B, Supplementary Fig. S1 and Supplementary Table S1). Indeed, some loci of the B2_Mm1a/t/o subfamilies are insertionally polymorphic in mouse strains (Akagi et al., 2010; Ichiyanagi et al., 2021). Their DNA and RNA sequences also contain a polyadenylate sequence required for retrotransposition by L1-encoded reverse transcriptase (Dewannieux and Heidmann, 2005). While B2 expression is also normally limited to the adult testis (Ichiyanagi et al., 2021; Mori and Ichiyanagi, 2021), heat-shock treatment, DNA damage reagents and viral infection induce B2 expression (Liu et al., 1995; Li et al., 1999; Rudin and Thompson, 2001; Williams et al., 2004; Karijolich et al., 2017). Similar to B1 and Alu RNAs, B2 RNA can inhibit the transcriptional activity of Pol II (Allen et al., 2004; Espinoza et al., 2007; Mariner et al., 2008). Interestingly, upon heat shock, the polycomb protein EZH2, which usually functions as a transcriptional repressor, is recruited to the promoter regions bound by B2 RNA and enhances the cleavage of B2 RNA, resulting in the release of transcriptional repression (Zovoilis et al., 2016; Hernandez et al., 2020). B2 and B3 DNAs also serve as binding sites for several transcriptional regulators. These DNA sequences bind to CTCF, a pivotal player in chromatin organization, and form chromatin boundaries (Schmidt et al., 2012; Thybert et al., 2018; Kaaij et al., 2019; Ichiyanagi et al., 2021; Gualdrini et al., 2022). Such an effect is partially restricted by histone H3K9me3 modifications, as Setdb1 disruption increases the number of CTCF-bound B2 loci in ESCs (Gualdrini et al., 2022). Binding of the ChAHP complex can also compete with CTCF binding (Kaaij et al., 2019; Han et al., 2021). B2 DNA carries a TATA box; therefore, it binds to TBP and Pol II to drive Pol II transcription in an orientation opposite to that of Pol III transcription (Ferrigno et al., 2001; Lunyak et al., 2007), which is also suggested to be involved in the formation of chromatin boundaries. It was recently shown that many copies of B2_Mm2 bind to STAT1 and likely generate mouse-specific interferon-inducible genes upon viral and microbial infection (Horton et al., 2023).
Other families (not Mus-specific)SINE families of B4 (B4, B4A and RSINE1) and ID (ID, ID2, ID4, ID4_v and ID_B1) are older (average divergence of 25.2–29.2%) (Supplementary Fig. S1 and Supplementary Table S1) than B1 and B2, but rodent-specific. B4 is 294 bp in length, of which the first 80-bp region is derived from tRNA and the last 145-bp region is homologous to B1. RSINE1 has a 140-bp internal deletion in B4 (position 80–220 in B4) but retains the tRNA-derived region. ID families, which are approximately 80 bp long, are composed of only a tRNA-derived region.
Subfamilies of MIR (MIR, MIR1_Amn, MIR3, MIRb and MIRc), tRNA-Deu (AmnSINE2), tRNA-RTE (MamSINE1) and tRNA (LFSINE_Vert) families are all tRNA-derived SINEs, while AmnSINE1 (subfamily of the 5S-Deu-L2 family) is derived from the 5S rRNA gene. The 3’ regions of MIR and AmnSINE1 are homologous to L2, and that of MamSINE1 is homologous to an RTE LINE, suggesting that L2- and RTE-encoded enzymes are involved in retrotransposition. AmnSINE2 and LFSINE_Vert do not show homology with known LINEs. These SINEs are found in the genomes of other mammals (MIR, MIR3, MIRb, MIRc and MamSINE1), amniotes (MIR1_Amn, AmnSINE1 and AmnSINE2) and tetrapods (LFSINE_Vert). Copies from these families have accumulated mutations; therefore, they are not expressed even in germ cells (Mori and Ichiyanagi, 2021). However, they significantly contribute to the genomic DNA sequence. Many copies of AmnSINE1 have been shown to serve as developmental enhancers of genes involved in mammal-specific traits (Nishihara et al., 2006, 2016; Sasaki et al., 2008; Tashiro et al., 2011; Nakanishi et al., 2012). A copy of LFSINE_Vert also exhibits enhancer activity (Bejerano et al., 2006). Experiments with human cells have demonstrated that MIR copies serve as a CTCF-independent insulator (Wang et al., 2015), an enhancer (Jjingo et al., 2014; Zeng et al., 2020) and a binding site for estrogen receptor α (Nishihara, 2019) and ZFP768 (Rohrmoser et al., 2019).
The LTR retrotransposon sequence consists of an internal protein-coding region and two direct repeats (LTRs), one at each end (Table 1). Families carrying gag (encoding structural proteins for viral particles), pol (encoding reverse transcriptase, integrase and proteinase) and env (encoding an envelope protein) are called endogenous retroviruses (ERVs) in vertebrates. Their replication mechanism is analogous to retroviral replication. The full-length internal sequences of members of the LTR class are LTR-int, whereas the LTR sequences are named LTR. The LTR sequence consists of U3 (unique to 3’), R (regulatory) and U5 (unique to 5’) regions. The R region carries a polyA signal, transcription factor-binding sites and a Pol II promoter. Thus, the 5’ LTR drives the transcription of the internal region. The RNA carries a primer-binding site (PBS) complementary to the 3’ sequence of tRNA in the host. This tRNA serves as a primer for reverse transcription; therefore, the PBS is necessary for retrotransposition. ERVs and exogenous retroviruses are two possible states of these elements that can interchange during evolution, and several retroviruses, such as murine leukemia virus (MuLV), are considered to have originated from ERVs in the host genome (Khan and Martin, 1983; Stocking and Kozak, 2008; Kozak, 2014). However, we note that many families currently lack env genes. Mammalian LTR retrotransposons are classified into three families based on their retroviral group: ERV1, which is similar to Gammaretroviruses, ERV2/ERVK, which is similar to Alpharetroviruses and Betaretroviruses, and ERV3/ERVL, which is similar to Spumaretroviruses. Similar to LINEs, LTR transcription by Pol II is repressed by epigenetic mechanisms, including DNA methylation and H3K9me3. However, some LTR sequences provide tissue-specific enhancer activity to host genes (Jern and Coffin, 2008; Fueyo et al., 2022). Moreover, they have provided new functional genes during mammalian evolution, such as Syncytin-A, Syncytin-B, Peg10 and Peg11 (Kaneko-Ishino and Ishino, 2012; Mager and Stoye, 2015; Kitazawa, 2023). In contrast to the striking preference for a particular genomic location of LINEs (B compartments) and SINEs (A compartments), LTR elements accumulate in both compartments (and GC isochores) depending on the subfamilies (Fig. 1A and 1B).
Internal sequence | Main LTR sequences |
---|---|
RLTR4_MM-int | RLTR4_MM |
MuRRS-int | LTRIS_Mus and LTRIS_Mm |
MuRRS4-int | MURVY-LTR , LTRIS2 and LTRIS4 |
MURVY-int | RLTR5_Mm |
MMERGLN-int | MMERGLN_LTR |
MMVL30-int | RLTR6_Mm and RLTR6C_Mm |
RLTR6-int | RLTR6B_Mm |
MERV1_I | MERV1_LTR |
IAPEz-int | IAPLTR1a_MM, IAPLTR1_Mm and RLTR27 |
MMERVK10C-int | RLTR10C and RLTR27 |
MMERVK10D3_I | MMERVK10D3_LTR |
ETnERV-int | ERVB7_1-LTR_MM, RLTR13G and ERVB4_1C-LTR_Mm |
MMETn-int | RLTRETN_Mm, RLTR13G and ERVB4_1C-LTR_Mm |
SRV_MM-int | RLTR8 |
MERVL-int | MT2_Mm and MT2C_Mm |
MT-int | MTA_Mm |
ERV1 includes MuLV, MMERGLN, MERV1, MMVL30, nine subgroups of LTRIS and 28 subgroups of RLTR, including RLTR1, 4, 5, 6, 24, 30, 41, 47 and 48.
MuLVMuLV (murine leukemia virus) is a gammaretrovirus that appeared 150 Mya, which remains infectious today and exists in the genomes of the Mus genus (Kozak, 2014). This element invaded the genome mostly through its infection of female germ cells, but few endogenous copies have spread (Kozak, 2014). Their internal sequence is deposited as MuLV-int and the LTR sequence is RLTR4_MM (see below). Owing to its relatively ancient origin, the average divergence of MuLV-int is 24.6% (Fig. 2C and Supplementary Table S1). MuLV-int and RLTR4_MM are enriched slightly in the A compartments (Supplementary Table S1).
RLTR4_MM-int and RLTR4_MMRLTR4_MM was originally identified as an LTR of MuLV (Amanuma et al., 1988). RLTR4_MM-int (registered in RepBase but not yet in Dfam) was then identified as an element that is homologous to MuLV-int and was shown to have RLTR4_MM as its LTRs. A phylogenetic tree of ERV1 LTR sequences suggests that RLTR4_MM originated from the LTRIS subfamilies (Supplementary Fig. S2). In the mouse genome, several copies are located adjacent to MERV1_I, IAPEz-int, MuRRS-int and MMVL30-int (Supplementary Table S1). RLTR4_MM-int does not show a preference for A or B compartments (Supplementary Table S1).
MuRRS-int and LTRIS_Mm/LTRIS_Mus, MuRRS4-int and LTRIS4Murine retrovirus-related sequences, MuRRS-int and MuRRS4-int, have been identified as elements with LTRIS (or LTRIS4) as their LTR sequences (Schmidt et al., 1985). MuRRS4-int accumulates on the Y chromosome, although to a lesser extent than MURVY (see below) (Supplementary Table S1).
MURVY-int and MURVY-LTR/RLTR5_MmMURVY (murine retrovirus on the Y chromosome) was identified as an element localized on the Y chromosome in the mouse genome, and consists of MURVY-int and the LTR regions, MURVY-LTR or RLTR5_Mm (Phillips et al., 1982; Hutchison and Eicher, 1989; Fennelly et al., 1996). As the Y chromosome has a low gene density and very limited similarity to the X chromosome, the selection pressure is very low, even for deleterious insertions. Therefore, retrotransposon copies transposed into this chromosome are considered more likely to be retained than those in other chromosomes, but the exceptionally high enrichment of MURVY (90% of the total copies are in the Y chromosome; Supplementary Table S1) is very interesting in terms of evolution and development.
MMERGLN-int and MMERGLN_LTRMMERGLN is a recently identified retrotransposon with a PBS that is a complementary sequence to glutamine (GLN)-tRNA. The divergence of each copy from the consensus sequence of MMERGLN-int does not show a monomodal distribution (Fig. 2C), suggesting that the elements have spread multiple times or could be classified into subgroups. The elements are present in both the A and B compartments (Supplementary Table S1). The LTR sequence MMERGLN_LTR is closely related to the RLTR1 subgroup (Supplementary Fig. S2). Similar to the A-, Tf- and Gf-groups of L1, de novo methylation of MMERGLN in male germ cells is dependent on piRNAs, whereas other LTRs are generally methylated by a piRNA-independent mechanism (Inoue et al., 2017). MMERGLN_LTR and MMERGLN-int contain a high density of CpG sites (Supplementary Table S2), and the loss of DNA methylation results in the derepression of MMERGLN in ESCs and germ cells (Karimi et al., 2011; Inoue et al., 2017) (Supplementary Table S2).
MMVL30 and RLTR6_Mm/RLTR6C_MmVL30 (virus-like 30S) is a non-autonomous element, and its retrotransposition depends on MuLV (French and Norton, 1997). As the internal sequence of mouse VL30 (MMVL30) is homologous to that of rat VL30, MMVL30 is considered to have been inserted into the ancestral genome at least 10 Mya (Courtney et al., 1982). The LTR sequences of MMVL30 are RLTR6_Mm and RLTR6C_Mm. The average divergences of MMVL-int and these LTR regions are 8.2–8.6% (Fig. 2C, Supplementary Fig. S1 and Supplementary Table S1).
RLTR6-int and RLTR6B_MmApproximately 500 copies of RLTR6-int are present with its LTR sequence, RLTR6B_Mm. This subfamily is abundant in the B compartments (Supplementary Table S1). The average divergence of RLTR6-int and RLTR6B_LTR is 8.2% and 4.1%, respectively (Supplementary Fig. S1, Supplementary Table S1).
MERV1_I and MERV1_LTRMERV1 consists of the internal region, MERV1_I, and the LTR sequence, MERV1_LTR. Their genomic copy number is approximately 1,000 copies, present in both A and B compartments, and their average divergences are 14.7% and 12.2%, respectively (Supplementary Table S1).
ERVK (ERV2) familyThe ERVK family has 176 Mus-specific subfamilies, including ETnERV, MMETn, IAP and MMERVK10C groups. Of these, 31 subfamilies are internal sequences, such as ETnERV-int, MMETn-int, IAPEz-int and MMERVK10C-int, whereas 145 subfamilies are LTR sequences, such as MLTR18, MLTR31, RLTR13, RLTR20 and RMER17 (see Supplementary Table S1 for all subfamilies).
IAP groupThe IAP group is present in the Mus genus and has been intensively investigated since the 1970s. Virus-like particles have been observed and isolated from mouse tumors (Kuff et al., 1972), and an RNA sequence specifically associated with the particle (named intracisternal A-type particle or IAP) has been found to be endogenous in the mouse genome (Lueders and Kuff, 1977, 1980). The virus-like particles are formed intracellularly and are not exported. IAP retrotransposition occurs frequently and can cause several diseases, such as leukemia and lymphoma (Kuff and Lueders, 1988; Dewannieux et al., 2004; Qin et al., 2010). Moreover, IAP transcription and formation of virus-like particles have been observed in oocytes and early embryos (Pikó et al., 1984; Svoboda et al., 2004). The sequence of the first identified IAP lacked coding regions, such as an envelope. Later, an IAP copy encoding the envelope region was discovered in the mouse genome and named IAPE (IAP coding for the envelope) (Reuss and Schaller, 1991). Nowadays, internal regions of IAP copies are classified into 10 subfamilies (IAP-d-int, IAP1-MM_I, IAPA-int, IAPLTR3-int, IAPLTR4_I, IAPEz-int, IAPEy-int, IAPEY3-int, IAPEY4_I and IAPEY5_I) and LTR regions into 15 subfamilies (IAP1_MM_LTR, IAPLTR1_Mm, IAPLTR1a_Mm, IAPLTR2_Mm, IAPLTR2a, IAPLTR2a2_Mm, IAPLTR2b, IAPLTR3, IAPLTR4, IAPEY_LTR, IAPEY2_LTR, IAPEY3_LTR, IAPEY3C_LTR, IAPEY4_LTR and IAPEY5_LTR) in the Dfam database (Storer et al., 2021). These IAP subfamilies are preferentially located in the B compartments (Supplementary Table S1).
The IAPEz subfamily has 7,600 copies, many of which are flanked by various LTR sequences such as IAPLTRa_MM, IAPLTR1_Mm, IAPLTR2a and RLTR27. Owing to recent retrotransposition events, the average divergences of these elements are low (3.4–5.8%) (Fig. 2D, Supplementary Fig. S1 and Supplementary Table S1). The transcription of these LTRs is regulated by both DNA methylation and H3K9me3 in ESCs, somatic cells and male germ cells (Karimi et al., 2011; Inoue et al., 2017) (Supplementary Table S2). Copies of the IAPEy subfamilies, IAPEy, IAPEY3, IAPEY4 and IAPEY5 (average divergence of 6.1–11.1%), are elements slightly older than IAPEz and have accumulated in the Y chromosome. Specifically, 35% and 45% of IAPEy-int and IAPEY3-int, respectively, are in the Y chromosome (Supplementary Table S1). The DNA methylation levels in IAPEY copies are variable in sperm (Shimosuga et al., 2017), and the loss of DNA methylation results in the upregulation of IAPEY subgroups in male germ cells (Karimi et al., 2011; Inoue et al., 2017). Indeed, the IAPEY group LTR sequences are enriched with CpG sites (34 sites in 385 bp of IAPEY_LTR, 29 sites in 357 bp of IAPEY2_LTR and 26 sites in 332 bp of IAPEY3_LTR) (Supplementary Table S2).
MMERVK10C and RLTR10C/RLTR27MMERVK10C uses lysine (K)-tRNA as a primer for reverse transcription; therefore, it is a member of the ERVK family. This element is made up of MMERVK10C-int and RLTR10C as LTR regions, with average divergences of 8.8% and 5.2%, respectively (Fig. 2D, Supplementary Fig. S1 and Supplementary Table S1). In addition, some MMERVK10C-int copies are flanked by RLTR27, which is closely related to RLTR10C. Therefore, RLTR27 is present as an LTR sequence of both MMERVK10C and IAPEz-int. The copy number of MMERVK10C in the mouse genome is over 3,000, with preferential localization in the B compartments (Supplementary Table S1). The phylogenetic tree suggests that RLTR10C is closely related to the LTR sequences of IAP elements (Supplementary Fig. S3). Similar to IAP, MMERVK10C and RLTR10C are repressed by DNA methylation and H3K9me3 in ESCs and male germ cells (Karimi et al., 2011; Inoue et al., 2017). Moreover, several RLTR10C copies serve as enhancers that activate germ cell-specific genes in wild-type male mice. These enhancer-type RLTR10C copies are solitary elements that are not linked to MMERVK10C-int (Sakashita et al., 2020). On the other hand, RLTR27 is evolutionarily old (average divergence of 20.9%) and is not transcriptionally activated even when DNA methylation is impaired (Supplementary Table S2).
RLTR10 groupThe LTR regions designated as RLTR10 are classified into nine subgroups, and seven subgroups are Mus genus-specific, with a low average divergence (Supplementary Table S1). RLTR10C is one of the main LTR sequences of MMERVK10C. In addition, many RLTR10B, RLTR10B2 and RLTR10E sequences serve as LTR sequences for MMERVK10C. In contrast, the RLTR10 and RLTR10A copies are linked to RLTR10-int, an element as young as MMERVK10C-int (Supplementary Fig. S1 and Supplementary Table S1). RLTR10D copies are linked to IAP-d-int. These LTR sequences (RLTR10, RLTR10A and RLTR10D) are closely related to MMERVK10D3_LTR, an LTR sequence that flanks MMERVK10D3_I but is distantly related to the MMERVK10C-linked LTRs (RLTR10C, RLTR10B, RLTR10B2 and RLTR10E) (Supplementary Fig. S3).
ERVB2, ERVB3, ERVB4 and ERVB5ERVB2_1-I_MM and ERVB2_1A-I_MM have a few hundred copies in the mouse genome, and their sequences are very similar. These copies are localized in the B compartments (Supplementary Table S1). ERVB3_1-I has approximately 200 copies, almost none of which have an LTR sequence, such as ERVB3_1_LTR, in the mouse genome. Part of this region is similar to the relatively ancient RMER16-int, which has spread to diverse species of Muridae. The average divergences of these subfamilies are 5.8–19.6% (Supplementary Fig. S1 and Supplementary Table S1).
ERVB4 and ERVB5 are highly similar to part of the Xist sequence, which is a key factor in X-chromosome inactivation, suggesting that mouse Xist originated from a copy of these elements (Elisaphenko et al., 2008; Lu et al., 2017; Liu and Fang, 2022).
ETnERV (MusD) and MMETn (ETn)ETnERVs, also known as MusD or ERVB7, include ETnERV, ETnERV2 and ETnERV3. They are autonomous elements without the env gene, whereas MMETn (known as ETn or early transposon) is a non-autonomous element that does not have a protein-coding region but shares the LTR sequence (ERVB7_1-LTR_MM, ERVB4_1C-LTR and RLTR13G) with ETnERVs. Additionally, some MMETn-int sequences have RLTRETN_Mm sequences as their LTR. The phylogenetic tree suggests that RLTRETN_Mm, ERVB4_1C-LTR and ERVB7_1-LTR_MM are closely related to the RLTR9 subfamilies, whereas RLTR13G is distantly related to these elements (Supplementary Fig. S3). MMETn retrotransposition is dependent on ETnERV-encoded proteins (Ribet et al., 2004). MMETn is transcribed early in development, especially in two stages: first between E3.5 and E7.5, in the inner cell mass of the blastocyst and embryo proper, and second during E8.5–E11.5, in several tissues such as the neural tube and limb buds (Brûlet et al., 1983, 1985; Loebel et al., 2004). An analysis using ESCs showed that MMETn has a higher transcriptional activity than ETnERV, which correlates with the lower DNA methylation of MMETn LTRs than ETnERV LTRs (Maksakova et al., 2009). It is also noteworthy that ETnERTV elements are enriched in the B compartments, whereas MMETn elements reside in both the A and B compartments at similar densities (Supplementary Table S1). These elements possess germline retrotransposition activity, and can cause genetic diseases in laboratory mice. Owing to recent retrotransposition events, the average divergences of these copies are relatively low (5.7–14.6%), and highly similar copies are present in the genome (Fig. 2D, Supplementary Fig. S1 and Supplementary Table S1).
MMERVK9C/E and RLTR9C/9EMMERVK9C and MMERVK9E are Mus-specific ERVKs. Their internal regions, MMERVK9C_I and MMERVK9E_I, are sandwiched between RLTR9C and RLTR9E, respectively. These LTR sequences are closely related to those of MMETn and ETnERVs. MMERVK9C/E, showing average divergences of 11.1–12.9%, are slightly older than ETnERVs/MMETn. These elements are enriched in the B compartments (Supplementary Table S1).
MMTVMMTV (mouse mammary tumor virus) is a Betaretrovirus that causes tumors in mice. Although infection with MMTV has specific effects on the mammary gland, the endogenous MMTV-int copies in the mouse genome have accumulated mutations (average divergence of 27%) and most are fragmented (Fig. 2D and Supplementary Table S1). An MMTV-like Env protein has been detected in human breast cancers; thus, mice overexpressing MMTV proteins have been used as models for studying pathogenesis (Callahan and Smith, 2000; Li et al., 2000).
SRV_MMSRV_MM is an endogenous retrovirus homologous to the simian Betaretrovirus SRV. While most of the endogenous copies (approximately 100) in the mouse genome lack the LTR sequence, some have internal (SRV_MM-int) and LTR (RLTR8) sequences. Their average divergences are 13.7% and 10.1%, respectively (Supplementary Table S1).
BGLIIThe BGLII group was identified as an LTR-related sequence in a DNA band that appeared after BglII digestion of mouse genomic DNA. The group consists of BGLII, BGLII_A/B/B2/C, BGLII_Mur and BGLII_Mus. BGLII, BGLII_C and BGLII_Mus are Mus-specific (Propst and Vande Woude, 1984), showing an average divergence of 9.6–18% (Supplementary Fig. S1 and Supplementary Table S1). Most are solo LTRs, although a limited number of copies sandwich internal sequences such as RMER3D-int, MYSERV-int or MYSERV6-int.
RLTR31 and MLTR31This group of elements has a limited similarity to BGLII. Among the eight RLTR31 subfamilies, RLTR31_Mm, RLTR31A_Mm, RLTR31B_Mm, RLTR31C_Mm and RLTR31D_Mm are Mus-specific. The six MLTR31 subgroups are closely related to RLTR31 and are Mus-specific. The average divergence of Mus-specific RLTR31/MLTR31 is 13.4–19.8% (Supplementary Fig. S1 and Supplementary Table S1). While most RLTR31 and MLTR31 elements are solo LTRs, a small number of their copies comprise LTR sequences for RMER3D-int or RMER16-int.
RLTR18 and MLTR18/32RLTR18 and RLTR18B are LTR sequences of RLTR18-int. The MLTR18 subfamilies (MLTR18 and MLTR18A–MLTR18D) are Mus-specific subgroups of RLTR18, and most MLTR18 copies are solo LTRs, whereas the others comprise LTR regions for RLTR18-int. MLTR32C_MM is also Mus-specific and similar to RLTR18. Their average divergences are 19.6–23.0%. Notably, RLTR18 copies are enriched in the A compartments (Supplementary Table S1).
ERVL (ERV3) familyThe ERVL family contains four Mus-specific subfamilies such as MERVL. The ERVL-MaLR family, described below, is also included in this family. MaLR (mammalian apparent LTR retrotransposon) is a group of non-autonomous elements closely related to autonomous MERVL elements. MaLR contains six Mus-specific subfamilies. A close relationship between the LTR sequences of MERVL (MT2 and MT2C) and those of MaLR (MTA and MTB) is shown in Supplementary Fig. S4.
MERVL and MT2_Mm/MT2C_MmMERVL (mouse ERV with a PBS for leucine (L)-tRNA) consists of internal (MERVL-int or MERVL_2A-int) and LTR (MT2) sequences. MT2 is subdivided into six subgroups including Mus-specific MT2_Mm and MT2C_Mm. MERVL copies likely underwent two retrotranspositional bursts, first at 10 Mya and then at 2 Mya, during the evolution of Mus musculus (Costas, 2003). The average divergences of MERVL-int and MT2 are 4.9% and 2.2%, respectively (Supplementary Fig. S1 and Supplementary Table S1), and highly similar copies are present in the mouse genome. In contrast to IAP, MERVK10C and ETnERV, MERVL elements are present in the A and B compartments at equal densities (Supplementary Table S1). MERVL is highly expressed during early development, especially at the two-cell stage, owing to the transcription factor Dux (Peaston et al., 2004; Macfarlan et al., 2012; Hendrickson et al., 2017). Interestingly, the LTR region activates the expression of several genes required for embryonic development (Modzelewski et al., 2021). Consequently, MERVL expression at the two-cell stage is essential for development and cell fate determination (Sakashita et al., 2023).
RLTR35B_MM and RLTR28RLTR35B_MM, RLTR28 and RLTR28B are closely related LTR sequences, with RLTR35B_MM being Mus-specific and RLTR28/RLTR28B being widely present in Muridae. These elements are likely LTR sequences for RMER17C-int but have already become retrotranspositionally extinct. They are enriched in the A compartments (Supplementary Table S1).
MT-int and MTA_Mm/MTB_MmMT (mouse transcript) was first identified as repetitive DNA sequences that hybridize to brain cDNA and are recognized as LINE- or SINE-related elements (Heinlein et al., 1986; Bastien and Bourgaux, 1987; Schaal et al., 1987). Later, however, the LTR sequence of MT (and ORR) was shown to resemble the LTR sequence of THE1, a primate MaLR (Smit, 1993; McCarthy and McDonald, 2004). The internal sequences are MT-int, MTC-int and MTE-int, while the LTR sequences are MTA, MTB, MTC, MTD, MTE and MT2. Typical MT-int copies have MTA_Mm or MTB_Mm as LTR, but some copies have MT2_Mm or MT2C_Mm. These elements are Mus-specific. MTC-int and MTE-int, as well as their associated LTRs, MTC, MTD and MTE, are older than MT2-int/MTA_Mm/MTB_Mm, and are distributed in Muridae as well as some Rodentia species. The average divergences of MT-int, MTA_Mm and MTB_Mm are 4.4–13.8% (Supplementary Fig. S1 and Supplementary Table S1). These elements are present in the A and B compartments at similar densities (Supplementary Table S1).
ORR1A1-int and ORR1A1/ORR1A0ORR1 (origin region repeat 1) was identified as a repetitive element within the replication origin in the Dhfr locus (Caddle et al., 1990). The internal sequences are ORR1A1-int, ORR1A3-int, ORR1B1-int and ORR1D-int. Interestingly, ORR1A1-int, ORR1A3-int and ORR1B1-int show preferential localization in the A compartments (Supplementary Table S1). As their burst occurred long ago (average divergences of 25.1–26.0%) (Fig. 2E, Supplementary Fig. S1 and Supplementary Table S1), it is conceivable that copies in the A compartments have been preferentially retained by natural selection. ORR1A1-int, which is flanked by the LTR sequences ORR1A0 and ORR1A1, is Mus-specific. There are 15 subgroups of ORR1 LTR sequences (ORR1A0 to ORR1G, in order of evolutionary age). Notably, ORR1 sequences contain binding motifs for transcription factors. ORR1A0 transcription is promoted by the transcription factor KLF1 (Krüppel-like Factor 1) and is strongly repressed by KLF3 in mouse erythrocytes, suggesting its role in hematopoietic development (Mak et al., 2014; Upton and Faulkner, 2014). On the other hand, in mouse ESCs, ORR1A1 binds to KLF4, a master regulator of pluripotency (Bakoulis et al., 2022). Given that ORR1 is enriched in gene-rich compartments and has high transcriptional activity in oocytes and preimplantation embryos, these elements likely shape the transcriptomic program during early development.
The RLTR25 and MLTR25 subgroups are closely related, with RLTR25 being Muridae-specific, and MLTR25 being Mus-specific. Some of these copies comprise LTR sequences for ORR1A1-int, ORR1A3-int or ORR1B1-int. The transcriptional activities of RLTR25 and MLTR25 have yet to be investigated.
Although most of the retrotransposon insertions retained in the present mouse genome are likely nearly neutral, the preferential distribution of some specific retrotransposons may be a result of natural selection. As noted above, many subfamilies of LINEs reside in the B compartments. This suggests their roles in the formation and/or maintenance of heterochromatin. For example, it has been reported that L1 copies are involved in X-chromosome inactivation in mice (Chow et al., 2010). On the other hand, SINEs are enriched in the gene-rich A compartments, and some of their copies regulate gene expression by serving as binding sites for transcription factors (Román et al., 2008; Roman et al., 2011; Schmidt et al., 2012; Tan et al., 2013; Thybert et al., 2018; Kaaij et al., 2019; Ichiyanagi et al., 2021; Gassler et al., 2022; Gualdrini et al., 2022; Horton et al., 2023; Kaemena et al., 2023; Osipovich et al., 2023). Some LTR retrotransposons, especially those of ERVK, are enriched in the B compartments, whereas those of ERVL are often found in the A compartments. Because the R regions of LTR sequences contain various kinds of transcription factor-binding sites, some LTR copies can also drive gene transcription as tissue-specific enhancers (Jern and Coffin, 2008; Fueyo et al., 2022) or as promoters of developmentally essential genes (see above). It is also interesting to speculate that other copies, especially those located in the A compartments, can also serve as enhancers or promoters of various genes.
Due to the high degree of sequence identity between copies, it has been difficult to analyze the epigenetic and transcriptional states of individual retrotransposon copies of evolutionarily young families by ChIP-seq and mRNA-seq. However, long-read sequencing technologies are now developing, which should promote the epigenetic study of retrotransposons at locus-level resolution.
This work was supported by a research grant from the Ministry of Education, Culture, Sports, Science and Technology of Japan to K. I. (grant number 23H02523), and by research grants from the SECOM Science and Technology Foundation, the Uehara Memorial Foundation and the Astellas Foundation for Research on Metabolic Disorders.