Genes & Genetic Systems
Online ISSN : 1880-5779
Print ISSN : 1341-7568
ISSN-L : 1341-7568
Special reviews
The role of transposable elements in human evolution and methods for their functional analysis: current status and future perspectives
Kei Fukuda
Author information
JOURNAL OPEN ACCESS FULL-TEXT HTML

2023 Volume 98 Issue 6 Pages 289-304

Details
ABSTRACT

Transposable elements (TEs) are mobile DNA sequences that can insert themselves into various locations within the genome, causing mutations that may provide advantages or disadvantages to individuals and species. The insertion of TEs can result in genetic variation that may affect a wide range of human traits including genetic disorders. Understanding the role of TEs in human biology is crucial for both evolutionary and medical research. This review discusses the involvement of TEs in human traits and disease susceptibility, as well as methods for functional analysis of TEs.

INTRODUCTION

Transposable elements (TEs) are DNA sequences that can move around the genome and have played a significant role in shaping the evolution of life on Earth. They are found in almost all organisms, from bacteria to humans. TEs make up around half of the human genome, making them a significant contributor to genetic variation and diversity (Lander et al., 2001; de Koning et al., 2011). TEs are classified into two major categories: DNA transposons and retrotransposons. DNA transposons move within the genome by a “cut-and-paste” mechanism, where the TE is excised from one location and reinserted at a new location. On the other hand, retrotransposons use a “copy-and-paste” mechanism, where the TE is first transcribed into RNA and then reverse-transcribed into DNA, which is then inserted at a new location in the genome (Bourque et al., 2018). TEs can cause mutations that may provide advantages or disadvantages to the host organism (Payer and Burns, 2019; Senft and Macfarlan, 2021). While some TE insertions may disrupt a gene, leading to a loss of function (Payer and Burns, 2019), which can be harmful if the gene is essential for survival or reproductive success, TE insertions can also create new genes or regulatory elements, leading to the acquisition of new functions or traits (Senft and Macfarlan, 2021). The insertion of TEs also creates genetic diversity within a population (O’Donnell and Burns, 2010). If a TE insertion provides an advantage to an individual, that individual may be more likely to survive and reproduce, passing on the TE insertion to their offspring. Over time, inheritance of the TE leads to the fixation of the TE insertion in the population, creating genetic diversity (Carroll et al., 2001; Salem et al., 2003).

TEs play a significant role in shaping the evolution of humans (Britten, 2010; Wang et al., 2021). One of the best-known examples is the Alu element, which makes up around 10% of the human genome (Lander et al., 2001). Alu insertions have been shown to be responsible for many of the differences between human and chimpanzee genomes (Hedges et al., 2004), and are thought to have contributed to the evolution of human-specific traits. Despite their importance in human evolution and diversity, TEs can also cause harmful mutations and have been linked to several genetic disorders (Hancks and Kazazian, 2016; Larsen et al., 2018; Payer and Burns, 2019; Gorbunova et al., 2021), such as hemophilia (Nakamura et al., 2015) and neurological diseases (Poduri et al., 2013). Aberrant suppression of evolutionarily young TEs such as HERVK and L1 is known to be associated with aging and autoimmune diseases through the induction of the IFN response (De Cecco et al., 2019; Wang et al., 2020; Liu et al., 2023). It is still unknown whether the acquisition of species-specific transposons contributes to the formation of species-specific aging phenotypes, and further research is needed to address this question.

Functional analysis of TEs has been an active area of research in recent years, with several new methods and tools being developed to study their impact on gene regulation and evolution. One such method is the use of high-throughput sequencing technologies to map the insertion sites of TEs in the genome, which has led to the discovery of many species-specific or population-specific TE insertions (Keane et al., 2013; Thung et al., 2014; Tubio et al., 2014; Zhuang et al., 2014; Gardner et al., 2017; Chu et al., 2021). Another recent improvement in the functional analysis of TEs is the development of new computational tools for analyzing their impact on gene expression and regulation (Goerner-Potvin and Bourque, 2018). Several studies have used RNA sequencing data to identify TEs that function as alternative splicing sites or alternative promoters (Elbarbary et al., 2016; Fueyo et al., 2022). Other studies have used chromatin immunoprecipitation followed by sequencing (ChIP-seq) to map the binding sites of transcription factors in TE sequences, revealing a potential role for TEs in regulating nearby genes (Sundaram and Wysocka, 2020). In addition, recent studies have advanced our understanding of TE function through control of TEs using CRISPR systems (Fuentes et al., 2018), massively parallel reporter assay (MPRA) (Du et al., 2022a) and expression quantitative trait loci (eQTL) analysis (Goubert et al., 2020). However, the function of human-specific TEs remains largely unknown. This review summarizes the functions of TEs that are inserted specifically in humans, as well as the methods for functional analysis of TEs. Additionally, I discuss future directions for TE analysis.

TRANSPOSABLE ELEMENTS IN THE HUMAN GENOME AND EVOLUTION

In humans, TEs make up around 45% of the genome and are classified into two main types: DNA transposons and retrotransposons. DNA transposons were active during early primate evolution until ~37 million years ago (Mya), but have lost their transpositional activity in humans (Pace and Feschotte, 2007). While DNA transposons are relatively rare in the human genome, accounting for less than 3% of TEs, retrotransposons make up the majority of TEs in the human genome, accounting for about 42% of the genome. Within retrotransposons, there are two subtypes: long terminal repeat (LTR) retrotransposons and non-LTR retrotransposons. LTR retrotransposons have sequences at their ends that are similar to retroviruses and use a reverse transcriptase enzyme for their transposition. Human LTR elements are endogenous retroviruses (HERVs), which account for ~8% of the genome (Lander et al., 2001; Cordaux and Batzer, 2009). Most HERVs were inserted into the human genome >25 Mya, and their activity is presently very limited in humans (Lander et al., 2001; Mills et al., 2007). By contrast, most human TEs result from the activity of non-LTR retrotransposons, which do not have LTRs and use their own encoded reverse transcriptase to transpose. In humans, LINE-1 (long interspersed element 1, L1), Alu and SINE-VNTR-Alu (SVA) elements are currently active non-LTR transposons, and they make up over a quarter of the human genome (L1: 16.9%; Alu: 10.6%; SVA: 0.2%) (Lander et al., 2001) (Fig. 1). In this review, I first summarize the evolutionary history of human TEs.

Fig. 1. Structure and evolutionary history of human transposable elements. (A) Representative structure of each TE type. Full-length HERVK provirus, which encodes group-specific antigen (Gag), protease (Pro), polymerase (Pol) and envelope (Env) proteins, is approximately 9.5 kb and is flanked by LTRs of around 1 kb, with the 5′ LTR containing the promoter for HERVK. A functional full-length LINE-1 element is ~6 kb, and encodes at least two open reading frames (ORF1 and ORF2). The 5′ untranslated region (5′UTR) harbors the endogenous LINE-1 promoter and an antisense promoter. Alu elements are primate-specific SINE transposons, and approximately 280–300 bp. They consist of two similar halves derived from a 7SL RNA gene, which are separated by an A-rich linker sequence (A5TACA6) and terminated with a poly(A) tail. The SVA element is a composite hominid-specific retrotransposon containing a CCCTCT(n) hexamer repeat, an Alu-like region consisting of two antisense Alu fragments and an intervening unique sequence, a VNTR region, and a short interspersed element of retroviral origin (SINE-R) region. (B) Evolutionary history of human TEs. DNA transposons lost their transposition activity before the split with New World monkeys. Although no currently active copies of HERVs have been found, the most recent type of HERVK, HERVK (HML2), had transposition activity for some time after splitting from chimpanzees and is often polymorphic between species and individuals. A single lineage of L1/Alu/SVA families undergoes amplification and evolution over an extended period, and these families are currently active and contribute to genome diversity in human populations.

DNA TRANSPOSON

The human genome contains over 380,000 copies of DNA transposons derived from at least 125 families belonging to four superfamilies (Pace and Feschotte, 2007). Two superfamilies, hAT and Tc1/mariner, make up the majority of DNA transposons, accounting for 69% and 28%, respectively. The hAT superfamily’s name derives from three members: the hobo element from Drosophila melanogaster, the Ac element from maize and the Tam3 element from snapdragon (Rubin et al., 2001). The Tc1/mariner superfamily has members in several taxa, including bacteria, vertebrates, invertebrates and plants, and is named after its two best-studied members, the Tc1 transposon of Caenorhabditis elegans and the mariner transposon of Drosophila (Plasterk et al., 1999). Typical autonomous hAT and Tc1/mariner DNA transposons encode only a single protein called transposase, which acts as an endonuclease and catalyzes the transfer of transposon DNA strands from one genomic site to another (Kapitonov and Jurka, 2006). Most DNA transposons in the human genome were acquired in the ancestors of mammals between 80 Mya and 150 Mya. These DNA transposons include 85 families, comprising approximately 284,000 elements, most belonging to the hAT superfamily (Pace and Feschotte, 2007). The ancestors of primates acquired 29 families of DNA transposons, comprising approximately 74,000 elements, most belonging to the Tc1/mariner superfamily (Pace and Feschotte, 2007). Then, 11 families, comprising approximately 23,000 elements, were integrated into anthropoid species, most belonging to the hAT superfamily, but no active DNA transposons have been found in the human genome after the split from New World monkeys (Pace and Feschotte, 2007) (Fig. 1). Therefore, the number of active DNA transposon families has decreased during the evolution of primates.

HERV

HERVs are the genetic legacy of ancient germline infections by exogenous retroviruses, which have become integrated into the genetic lineage. They share the same structure as exogenous retroviruses, consisting of four genes (gag, pro, pol and env) flanked by two LTRs (Sverdlov, 2000; Vargiu et al., 2016). HERVs can undergo recombination between their LTRs, leading to the formation of solo LTRs, and the majority of HERVs exist as solo LTRs (Sverdlov, 1998). ERVs are classified into three groups based on their similarity to exogenous viruses: Class I includes viruses related to the gamma and epsilon genera, Class II ERVs are related to betaretroviruses or distantly to deltaretroviruses and lentiviruses, and Class III consists of elements similar to spumaviruses. Gifford and Tristem (2003) identified 26 HERV families, each thought to have originated from independent integration events into human chromosomes. Of these 26 families, 18 are classified into Class I, four into Class II and four into Class III. The most ancient HERVs known are probably 60–70 million years (Myr) old, a time when the first primates had just begun to appear (Tristem, 2000). The majority of HERV families were amplified in the germline after the separation of Old and New World monkeys, that is, 30–45 Mya (Sverdlov, 2000). Although no reports of active infectious ERVs in humans have been made (Bannert and Kurth, 2006), the HERVK (HML-2) family, which is the most recently evolved HERV family, was actively amplified after the split between humans and chimpanzees (Fig. 1). As a result, approximately 90% of HERVK (HML-2) elements are specific to humans (Buzdin et al., 2003). At least 89 HERVK (HML-2) proviruses and 944 (Subramanian et al., 2011) to 1,200 solo LTRs (Babaian and Mager, 2016) have been identified in the human genome, and some of them show insertional polymorphisms among individuals (Turner et al., 2001; Buzdin et al., 2003; Hughes and Coffin, 2004; Mamedov et al., 2004; Belshaw et al., 2005; Marchi et al., 2014; Wildschutte et al., 2016). Wildschutte et al. (2016) identified 36 non-reference HERVK (HML-2) elements by examining 2,500 genome sequences, most of which are full-length proviruses (Marchi et al., 2014). HERVs are considered a source of genomic diversity in humans.

L1

The human genome has >500,000 L1 copies (Lander et al., 2001). The canonical full-length L1 is ~6 kb long and includes a 5’ untranslated region (UTR) with an internal RNA polymerase II promoter, two open reading frames (ORF1 and ORF2), and a 3’ UTR that contains a polyadenylation signal and an oligo(dA)-rich tail of varying length (Khan et al., 2006). L1 retrotransposons are the most abundant retroelements in mammals and have been actively multiplying for approximately 170 Myr, playing a significant role in shaping the organization and function of mammalian genomes (Smit, 1996; Lander et al., 2001; Kazazian, 2004). A single lineage of L1 families underwent amplification and evolution over an extended period to become the dominant L1 family (Furano, 2000). Phylogenetic analysis reveals the existence of three well-supported L1 lineages, namely L1MA, L1PB and L1PA, which evolved in parallel in ancestral primate genomes, but only the L1PA lineage survived to the present day, while the other two became inactive (Khan et al., 2006). The most recent L1PA family, L1PA1 (also referred to as L1Hs), emerged approximately 3 Mya, which was after the divergence of the human and chimpanzee lineages (Khan et al., 2006). L1Hs is still capable of transposition activity and contributes to genomic diversity in humans (Brouha et al., 2003) (Fig. 1).

Alu

The human genome contains over 1 million Alu copies (Lander et al., 2001), which have been able to multiply over the past 65 Myr (Batzer and Deininger, 2002), making them the most successful TEs in the genome in terms of copy number. Alu elements are typically 300 bp long and are formed by the fusion of two monomers from the 7SL RNA gene (Kriegs et al., 2007). The 5’ end of the region includes an internal promoter for RNA polymerase III, which consists of A and B boxes. The Alu transcript terminates with an RNA polymerase terminator sequence (e.g., TTTT) and contains oligo(dA) sequences of various lengths near the tail (Batzer and Deininger, 2002). Alu elements are classified as nonautonomous TEs because they do not possess coding capacity. As a result, Alu elements rely on the transposition machinery of L1 elements to move within the genome (Dewannieux et al., 2003). The majority of Alu elements (> 890,000 copies) belong to the AluS/J family and appeared before the divergence of the owl monkey lineage 35 Mya (Batzer and Deininger, 2002). Subsequently, due to the accumulation of mutations, AluS/J lost its ability to transpose, and AluY emerged from AluS/J (Batzer and Deininger, 2002). After divergence of the orangutan from the human lineage, AluYa5/Yb8 emerged and currently exists in around 3,500 copies, remaining active and forming polymorphisms within human populations and between humans and chimpanzees (Batzer and Deininger, 2002) (Fig. 1).

SVA

SVAs have been active throughout hominoid evolution for about 25 Myr, and there are now roughly 3,000 copies in the human genome (Ostertag et al., 2003; Wang et al., 2005). A standard SVA element measures approximately 2 kb and is made up of a hexamer repeat region, an Alu-like region, a region with varying numbers of tandem repeats, a HERV-K10-like region, and a polyadenylation signal that terminates with a tail of oligo(dA) sequences of varying length (Ostertag et al., 2003; Wang et al., 2005). Like Alu elements, SVA elements are nonautonomous TEs that are believed to be mobilized by the L1 transposition machinery (Ostertag et al., 2003; Wang et al., 2005). While VNTRs were already in existence before the split from Old World monkeys, the oldest SVA family, SVA_A, which resulted from a fusion with other sequences, emerged after the split from gibbons. Subsequently, the sequence evolved to evade transcriptional suppression mechanisms by the host (Jacobs et al., 2014; Fukuda et al., 2022), leading to the formation of five subfamilies, SVA_B to F (Wang et al., 2005). Of these, SVA_E/F are human-specific SVA subfamilies that still retain transposition activity and contribute to generating human-specific genome structures and genomic diversity among individuals (Wang et al., 2005) (Fig. 1).

HUMAN-SPECIFIC TE INSERTIONS AFTER DIVERGENCE FROM CHIMPANZEES

Chimpanzees are the living beings that are genetically most similar to humans. The ancestors of humans and chimpanzees diverged about 6.5 to 7.5 Mya (Amster and Sella, 2016). There are differences in the DNA between humans and chimpanzees, with approximately 1.23% of human DNA represented by single-nucleotide changes. Additionally, about 3% of the human genome consists of larger deletions and insertions (The Chimpanzee Sequencing and Analysis Consortium, 2005). Moreover, there are significant differences in the structure of chromosomes, including inversions, translocations and chromosome fusion (Yunis et al., 1980). There are also substantial differences in the number of TE copies between humans and chimpanzees. Studies suggest that humans have approximately 15,000 TE insertions specific to their genome, which increases the size of the human genome by about 14 Mb (Mills et al., 2006; Tang et al., 2018). Tang et al. (2018) identified 14,870 human-specific TEs, consisting of 8,817 Alus, 3,912 L1s, 1,571 SVAs and 530 HERVs. According to estimates, humans have about twice the number of unique TE copies as chimpanzees (The Chimpanzee Sequencing and Analysis Consortium, 2005; Mills et al., 2006). About 58% of human-specific Alu insertions are derived from AluYa5/Yb8, while the remainder are from various AluY subtypes (Tang et al., 2018). Only about 1,500 Alu copies are unique to the chimpanzee genome. Most of these copies are classified under the AluYc1 and AluY subfamilies (The Chimpanzee Sequencing and Analysis Consortium, 2005; Mills et al., 2006; Tang et al., 2018), suggesting significant differences in Alu activity between species. SVA is also twice as active in humans as in chimpanzees, and 57% and 26% of human-specific SVA insertions are derived from SVA_D and SVA_F, respectively (Tang et al., 2018). Contrary to Alus and SVAs, the activity of L1 is comparable in humans and chimpanzees, and 25% and 44% of human-specific L1 insertions are derived from L1HS and L1PA2, respectively (Tang et al., 2018). After the split of human and chimpanzee ancestors, a HERVK (HML-2) family of endogenous retroviruses was also proliferating in both genomes (Suntsova and Buzdin, 2020). Human and chimpanzee genomes contain ~140 and at least 45 species-specific HERVK (HML-2) copies, respectively (Suntsova and Buzdin, 2020). Additionally, two new retroviral families specific to chimpanzees – PtERV1 and PtERV2 – have emerged in the chimpanzee genome and consist of at least 250 copies (The Chimpanzee Sequencing and Analysis Consortium, 2005; Mun et al., 2014).

DNA methylation is a repressor of transposon expression. As DNA of SVAs in sperm is hypomethylated in humans but not in chimpanzees (Molaro et al., 2011; Fukuda et al., 2017), differences in the degree of transposon control in germ cells may contribute to differences in transposition activity between species. Despite the overall hypomethylation of SVA in human sperm, SVA copies that are inserted into highly transcriptionally active regions are found to be highly methylated in human sperm (Fukuda et al., 2022). This suggests that an evolutionary mechanism has developed in humans to inhibit excessive SVA transposition while permitting a certain level of transposition.

LINK BETWEEN HUMAN-SPECIFIC TE INSERTIONS AND HUMAN-SPECIFIC TRAITS

Our early ancestors experienced substantial changes in brain structure and function, resulting in a three-fold increase in brain size, primarily in higher-order association areas of the neocortex (Sousa et al., 2017; Pollen et al., 2023). These modifications, along with alterations to tongue and vocal cord anatomy and associated neural circuits, played a crucial role in human speech and language (Rilling et al., 2011; Pollen et al., 2023). Humans have also undergone significant structural modifications to their skeletal, muscle and joint systems, allowing for upright walking, improved object grasping, projectile throwing and the ability to move over long distances (Roach et al., 2013; Zihlman and Bolter, 2015; Pollen et al., 2023). These changes also include modifications to the pelvis to support upright walking and accommodate a larger cranium during childbirth (Gruss and Schmitt, 2015; Young et al., 2022; Pollen et al., 2023). In addition, encounters with pathogens throughout ancient and modern history have led to modifications in our immune systems (Wang et al., 2012; Dannemann et al., 2016; Enard and Petrov, 2018; Khan et al., 2020; Vespasiani et al., 2022; Pollen et al., 2023).

The insertion of TEs affects transcriptional regulation (Modzelewski et al., 2022) by altering gene structure (Sorek, 2007; Hancks and Kazazian, 2010; Lin et al., 2016; Florea et al., 2021) and translational efficiency (Shen et al., 2011; Zucchelli et al., 2015). TEs act as a transcriptional regulatory region by providing binding motifs for transcription factors, promoting the transcription of nearby genes while also being subject to host suppression mechanisms that lead to heterochromatinization and the suppression of transcription of neighboring genes (Fueyo et al., 2022). The KRAB-ZNF-KAP1-H3K9 methylation enzyme axis represents TE suppression mechanisms (Imbeault et al., 2017; Fukuda and Shinkai, 2020). TEs regulate transcription also by functioning as epigenetic boundaries and controlling the 3D genome structure (Ichiyanagi et al., 2021; Fueyo et al., 2022). Some TE families are more prevalent at the boundaries of topologically associating domains (TADs) or possess insulator activity (Wang et al., 2015; Cournac et al., 2016). Furthermore, experimental manipulation suggests that certain TEs directly impact chromosome architecture and folding (Zhang et al., 2019). TEs have promoted the evolution of various traits such as the placenta (Chuong et al., 2013; Sun et al., 2021; Du et al., 2023; Frost et al., 2023), second plate (Nishihara et al., 2016), mammary gland (Nishihara, 2019) and innate immunity (Chuong et al., 2016) by modulating gene expression patterns. The majority of primate-specific regulatory sequences are derived from TEs (Jacques et al., 2013; Trizzino et al., 2017). TEs also modify gene structure by providing or disrupting splice donor and acceptor sites (Hancks and Kazazian, 2010; Shen et al., 2011; Florea et al., 2021). Splicing pattern changes caused by SVA insertions are known to be responsible for diseases such as Fukuyama-type muscular dystrophy and Lynch syndrome (Taniguchi-Ikeda et al., 2011; Yamamoto et al., 2021a).

TEs are associated with human traits, especially in the brain. Human-specific non-LTR transposon insertions are linked to an increase in transcriptional and splicing variation of the genes they are inserted in, and are enriched in genes expressed more highly in the brain, particularly in undifferentiated neurons (Guichard et al., 2018). In addition, human genes containing intronic SVA are enriched among genes involved in neurodevelopment and neurological diseases (Nadler et al., 2023). As SVAs are repressed by the KRAB-ZNF pathway in human primed embryonic stem cells and epiblast but are de-repressed and function as enhancers in the fetal brain, pineal gland and hippocampus (Trizzino et al., 2018; Pontis et al., 2019), human-specific SVA insertions potentially affect the regulation of genes nearby, especially in the brain. In addition to non-LTR transposons, LTR transposons also affect the transcription of nearby genes. Human- or chimpanzee-specific HERVK/LTR5 insertions confer species-specific activation of neighboring genes in induced pluripotent stem cells (iPSCs) (Hirata et al., 2022). The intronic regions of the SLC4A8 and IFT172 genes contain human-specific HERV LTRs that act as promoters in the antisense orientation, generating RNAs complementary to adjacent exons. These antisense transcripts, produced from the LTR promoter, have been shown to decrease the mRNA levels of their corresponding genes in the brain (Gogvadze et al., 2009). A human-specific insertion of HERVK (HML-2) upstream of the schizophrenia-linked gene PRODH enhances its expression in the hippocampus. PRODH plays a crucial role in regulating proline catabolism, which is essential for normal central nervous system functioning (Suntsova et al., 2013).

Comparative analysis of the 3D genome among human, macaque and mouse brains revealed that evolutionarily young TEs, especially AluY, were enriched in human-specific TAD boundaries (Luo et al., 2021). One example is the human-specific TAD boundary located in an intron of CNTN5, a gene involved in neuron circuit formation and autism spectrum disorders (ASDs). This human-specific TAD boundary contains human-specific AluY and is correlated with higher expression of the CNTN5 gene in humans than in rhesus monkeys (Luo et al., 2021). Although there is currently no direct evidence that the insertion of human-specific TEs directly alters the 3D genome of the brain, the accumulation of young TEs at human-specific TAD boundaries suggests that these TE insertions have contributed to brain evolution by influencing changes in the 3D genome. Future studies are warranted to investigate this possibility.

iPSCs are a useful tool for investigating interspecies differences in cellular functions. Comparative analysis of neural progenitor cells and brain organoids derived from human and non-human primate iPSCs has revealed interspecies differences in neural proliferation, timing of maturation, and migration ability (Mora-Bermúdez et al., 2016; Otani et al., 2016; Marchetto et al., 2019; Schörnig et al., 2021). These findings suggest that this interspecies variation contributes to the distinctive morphological and functional characteristics, such as the enlarged neocortex, observed in humans. Patoori et al. (2022) compared chromatin accessibility and gene expression in human and chimpanzee iPSC-derived TBR2-positive hippocampal intermediate progenitor cells (hpIPCs). IPCs express the neurodevelopmental transcription factor TBR2, and genetic ablation of TBR2 in IPCs results in impaired neurogenesis during hippocampal formation (Hodge et al., 2012). It is speculated that IPCs are involved in human-specific hippocampal expansion. Patoori et al. (2022) showed that species-specific enrichment for ERV and SVA sequences within human- or chimpanzee-specific accessible genomic sites is associated with species-specific expression of nearby genes in hpIPCs. Notably, the human-specific SVAs serve as the basis for the creation of thousands of new TBR2-binding sites. Repression of these SVAs using the CRISPR-Cas9 system led to a reduction in the expression of approximately 25% of the genes that are overexpressed in human hpIPCs compared to their chimpanzee counterparts (Patoori et al., 2022). Thus, changes in gene expression caused by human-specific SVAs may be involved in the increased hippocampal volume observed in humans (Barger et al., 2014).

Interestingly, TEs regulate gene expression not only in cis, but also in trans. An SVA-containing long non-coding RNA (lncRNA) named AK057321 is found in humans, chimpanzees, bonobos and gorillas. AK057321 is encoded by three exons, including a full-length SVA transposon sequence at the 3′ end. This lncRNA is duplicated in rare cases of ASD. AK057321 is upregulated during neuronal maturation, and its depletion results in reduced expression of genes with intronic SVAs such as CHAF1B, KCNJ6, POFUT2, HTT, CDK5RAP2 and SCN8A. Notably, these genes with intronic SVAs are repressed in neural progenitor cells by ZNF91 (Nadler et al., 2023), which binds and recruits H3K9 methyltransferases to SVAs (Jacobs et al., 2014), and are activated with the increase in AK057321 expression during neural maturation (Nadler et al., 2023). Two potential mechanisms have been proposed to elucidate how this lncRNA regulates SVA genes. One possibility is that the lncRNA forms an RNA:DNA heteroduplex with intronic SVAs, which somehow enhances the transcription of genes containing intronic SVAs. The other possibility is that the lncRNA interacts with ZNF91, thereby preventing ZNF91 from binding to intronic SVAs and subsequently increasing the transcription of genes harboring intronic SVAs. Deficiency of ZNF91 and overexpression of the lncRNA in neural stem cells have been shown to upregulate the expression of genes with intronic SVAs and promote neuronal maturation (Nadler et al., 2023). Depending on the target gene, the control of gene expression through intronic SVA by AK057321 can have varied impacts on the nervous system. AK057321 enhances CDK5RAP2 and SCN8A expression to promote neuronal maturation, while it enhances CHAF1B expression to promote proliferation and stem cell gene expression (Nadler et al., 2023). CDK5RAP2 and SCN8A have a human-specific SVA insertion, while SVA located in the CHAF1B intron is found in humans, chimpanzees and gorillas.

The arms race between TEs and host suppression mechanisms during hominoid evolution could have contributed to the formation of new gene expression networks by modulating the timing and quantity of gene expression (Fig. 2). In humans, TE-mediated gene expression regulatory mechanisms could have played a role in delaying neural maturation and contributing to brain expansion (Fig. 2). SVAs display a significant degree of polymorphism, with more than 25% of the human-specific SVA subfamilies SVA_E/F being polymorphic (Wang et al., 2005). Additionally, de novo SVA insertions have been found in non-coding regions of the genome, including introns (Borges-Monroy et al., 2021). Some of them are associated with neurological disorders such as X-linked dystonia parkinsonism (Makino et al., 2007). These reports imply that SVA contributes to the diversity of brain function among human populations and individuals through changes in transcriptional regulation.

Fig. 2. Evolution of the transcriptional control network by SVA and host suppression factors. SVA_A, the oldest SVA type, appeared before the divergence of orangutans, and, subsequently, SVA_B, C and D types appeared after the divergence of orangutans. After the split from chimpanzees, human-specific SVA_E and F types emerged and they still remain active. On the other hand, ZNF91, a KRAB-ZNF that binds to SVA and suppresses its transcription, appeared before the divergence of Old World monkeys, but underwent significant structural changes after the split from orangutans to effectively suppress SVA. SVA is inserted into introns of many neural-related genes, and in neural progenitors, ZNF91 binds to SVA to suppress the expression of genes containing SVA in their introns, forming a new transcriptional suppression network with SVA and the host suppression mechanism. After the split from orangutans, the insertion of SVA_B led to the formation of the long non-coding RNA AK057321 containing an SVA sequence. AK057321 is expressed at higher levels during neural maturation, binds to the SVA region to form RNA:DNA hybrids, and interacts with ZNF91. Although the detailed mechanism is unclear, AK057321 inhibits the transcriptional suppression mechanism by ZNF91 and increases the expression of genes containing SVA in their introns. This makes it possible to regulate the strength and timing of expression of genes containing SVA in their introns. Such genes are different among species, forming a species-specific expression network through the SVA control mechanism. Human-specific SVA insertions are inserted into introns of CDK5RAP2 and SCN8A, both of which function to delay neural maturation, which may contribute to the expansion of the human brain.

While the role of TEs in human evolution has mainly been studied in relation to brain function, as described above, various human-specific traits have emerged since the split from chimpanzees. Therefore, it is expected that future research will also reveal the role of TEs in human-specific traits other than brain function.

ASSOCIATION BETWEEN POLYMORPHIC TEs IN HUMANS AND TRAITS

TE insertions have been reported to cause various diseases, including neurological disorders and immune disorders (Hancks and Kazazian, 2012, 2016; Payer and Burns, 2019; Fueyo et al., 2022). The genome-wide association study (GWAS) is a powerful tool for identifying genomic variation associated with diseases and traits. Several GWAS analyses have reported trait-associated TEs (Payer et al., 2017; Wang et al., 2017a; Kojima et al., 2023). For example, the insertion of an SVA into a cell type-specific enhancer reduces the expression of the B4GALT1 gene and is associated with susceptibility to autoimmune disease (Wang et al., 2017a; Fueyo et al., 2022). Polymorphic HERVK insertion increases the expression of genes encoding the C4A and C4B complement factors and is correlated with higher schizophrenia risk (Sekar et al., 2016). The majority of 44 polymorphic Alu insertions in non-coding regions, linked with GWAS single-nucleotide polymorphisms, are linked to various traits (Payer et al., 2017; Fueyo et al., 2022). Interestingly, an intronic Alu polymorphism in the ACE gene is correlated with the amount of circulating ACE proteins and is hypothesized to be associated with both susceptibility and morbidity in SARS-CoV-2 infection (Hatami et al., 2020; Li et al., 2021; Yamamoto et al., 2021b). Based on the above, TEs are suggested to be involved in regulating transcriptional control networks, leading to various human traits and diseases.

eQTL analysis is a method for statistically identifying information on genomic variation and gene expression data from multiple individuals and is a powerful method for identifying genetic variants that underlie traits and diseases through their effects on gene expression regulation (Cookson et al., 2009; Kim-Hellmuth et al., 2020). eQTL analysis is also useful to identify polymorphic TE (polyTE) loci, which contribute to gene expression divergence among individuals. Wang et al. (2017b) searched for associations between polyTE loci and human gene expression levels in B cells using the eQTL approach and identified more than 1,000 polyTE loci associated with gene expression. Some polyTE loci corresponded to TE-eQTL for more than one human gene. For example, an Alu polymorphism downstream of the PAX5 gene, a transcription factor related to B cell differentiation, correlates with increased PAX5 expression and altered expression of its target genes, including immune-related genes such as PIK3AP1, REL and ZSCAN23. Modulating the gene expression network by this Alu polymorphism may contribute to inter-individual or inter-population differences in immune function (Wang et al., 2017b). Other studies using lymphoblastoid cell lines (LCLs) and iPSCs also discovered 211 and 176 TE-eQTLs in LCLs and iPSCs, respectively (Goubert et al., 2020). Cao et al. (2020) performed both eQTL and splicing QTL (sQTL) in 48 tissues using the GTEx dataset and identified 3,522 TE-eQTLs and 3,717 TE-sQTLs. There have been several reports on the association between polyTEs and traits, but the detection power has been low, partly due to insufficiently efficient and accurate identification of polyTEs.

Long-read sequencing can accurately identify TE insertions, but the number of samples sequenced by long-read sequencing is still limited. Recently, Kojima et al. (2023) developed mobile element genotype analysis environment (MEGAnE) software that efficiently and accurately identifies TE insertions from short reads. They applied MEGAnE to the genomic sequences of 180,000 individuals stored in the BioBank Japan, conducted GWAS, and reported a strong correlation between Alu/L1 insertions and keloid, schizophrenia, type 2 diabetes and prostate cancer. For example, the insertion of LINEs in an intron of NEDD genes is associated with keloid severity and increases the expression of NEDD genes (Kojima et al., 2023). The authors also performed eQTL analysis using GTEx v8 data, and found that TE insertions were more frequently associated with changes in gene expression than single-nucleotide polymorphisms and tended to weaken enhancers, providing essential insights into the importance of TE insertions in human traits.

FUNCTIONAL ANALYSIS OF POLYMORPHIC TEs

Although GWAS and eQTL analyses are helpful for identifying loci associated with traits, it is difficult to definitively determine whether the identified polymorphisms cause the traits and gene expression changes. Thus, verifying the function of polyTEs through additional functional assays is necessary. The reporter assay is useful to validate the effect of a polyTE on gene regulation (Cao et al., 2020; Payer et al., 2021). Payer and colleagues investigated the impact of Alu polyTEs on transcription by inserting regions containing polyTE loci and their surrounding DNA, with or without TE sequences, into a luciferase reporter vector. They analyzed the impact of 110 polyTE loci on transcription with this assay and reported that the insertion of Alu can lead to either an increase or a decrease in transcription. Although the number of loci that can be analyzed by luciferase assay is limited, it is possible to scale up the assay using techniques such as MPRA (Melnikov et al., 2012) and self-transcribing active regulatory region sequencing (STARR-seq) (Arnold et al., 2013). In an MPRA experiment, a library of synthetic DNA fragments, each containing a unique barcode, is inserted into a reporter vector. The vector is then transfected into cells, and the RNA transcripts and DNA of the inserted fragments are sequenced and quantified using next-generation sequencing technology. By comparing the levels of transcripts and DNA from different vectors, we can determine how DNA sequence variation affects transcriptional activity. MPRA can analyze tens of thousands of DNA fragments in a single experiment, making it a powerful tool for understanding the function of non-coding regions of the genome. MPRA has already been used to study the evolution of transcriptional regulation in humans (Girskis et al., 2021; Uebbing et al., 2021; Whalen et al., 2023), and also to study the evolution of transcriptional regulation by TEs. Du et al. (2022b) used the primate-specific TE family LTR18A as a model and performed MPRA using LTR18A DNA sequences derived from seven primate species and reconstructed ancestral DNA sequences. They identified important transcription factor binding motifs for LTR18A enhancer activity and revealed the evolutionary origin and dynamic evolution of LTR18A enhancer activity.

MPRA is a powerful tool for investigating enhancer activity of short TEs, but synthesizing DNA sequences, including both the TE and its flanking regions, which is necessary for studying human-specific TE insertions, is challenging and costly. STARR-seq and its derivatives (Gallego Romero and Lea, 2023), on the other hand, enrich for target regions within cells using methods such as target capture using oligonucleotide probes (Vanhille et al., 2015), insert them into vectors, and examine their impact on transcription. It may be possible to investigate the impact of human-specific TEs on gene expression by selectively capturing regions that contain TE insertions unique to humans from the genomic DNA of humans and other primate species, and performing STARR-seq using these DNA fragments (Fig. 3).

Fig. 3. Potentially effective methods for comprehensively investigating the function of human-specific transposable element insertions. 1: Screening for functional TEs using machine learning. Recent advances in machine learning have enabled the prediction of the effects of genomic mutations on transcription and chromatin modifications. Enformer is a machine learning tool that can predict the effects of genomic mutations up to 130 kb away (Avsec et al., 2021). Pre-trained models using various ENCODE samples are already available, making it very useful. However, the accuracy of predicting the effects of TE insertions or deletions is not yet well analyzed and requires further validation. 2: Reporter assays to investigate the effects of TE insertions on transcription. Reporter assays have already been used to investigate the effects of polyTE on transcription, but they have low throughput (Payer et al., 2021). MPRA has been used to examine the origin and dynamics of enhancer ability in LTR18A (Du et al., 2022b), but the effects of human-specific insertions have not been investigated. MPRA involves inserting synthetic oligonucleotides into a reporter vector and performing an assay, but even the smallest type of human-specific TE type, AluY, is 300 bp, making synthesis difficult or expensive. Additionally, investigating the effects of SVA or L1 insertion polymorphisms using MPRA is challenging due to their large size. One potential solution is to enrich polymorphic TE regions from humans and chimpanzees to comprehensively investigate the effects of TE insertions. However, it may still be difficult to assay the effects of several-kilobase insertions of SVA or L1, and such assays may be more suitable for investigating the effects of Alu. 3: CRISPR TE knockout screening. Proposal for identifying TEs involved in neural maturation using PRIME-Del. PRIME-Del achieves precise genome deletion by using a fusion protein of Cas9 H840A and reverse transcriptase and paired-pegRNA targeting specific regions (Chen and Liu, 2023). Further research is needed to determine if this method is applicable for genome-wide screening. This figure illustrates a screening model to identify TEs involved in neural maturation. GFP is incorporated downstream of mature neuronal marker genes in a neural progenitor cell line such as NTerra-2. Paired-pegRNA vectors targeting the insertion site of human-specific TEs are transfected, and GFP-positive cells are sorted to identify TEs involved in neuronal maturation by sequencing the pegRNA.

The limitations of MPRA and STARR-seq include their inability to account for the genomic environment surrounding the TEs and their limited ability to examine enhancer activity only in the assayed cell type. Recent advances in machine learning have made it possible to predict chromatin states and gene expression states from DNA sequences (Zhou et al., 2018; Agarwal and Shendure, 2020; Kelley, 2020; Avsec et al., 2021). For example, Enformer can model the epigenome using a transformer to incorporate genomic information within 130 kb around a target locus and can accurately predict the epigenome, especially for transcriptional states and active chromatin modifications (Avsec et al., 2021). Enformer has provided models for epigenome prediction in over 5,000 human and 2,000 mouse samples using ENCODE data, enabling prediction of epigenomic states, including those of surrounding regions, for a wide range of samples, potentially making it useful for investigating the effects of TE insertions in the epigenome. However, the effects of genomic regions further than 130 kb away cannot be predicted, so additional development of techniques and computer capabilities is necessary.

The above methods are helpful for understanding the overall impact of TE insertions and identifying candidate TEs associated with traits. However, ultimately, to determine how TEs affect traits, it will be necessary to study the effects of TE deletion or insertion using iPSCs, differentiated cells or organoids.

PERSPECTIVE

Recent advances in genome sequencing technology and analysis methods have rapidly revealed the importance of TEs in species evolution and diversity. However, although over 10,000 copies of TEs have been inserted in the human genome since the human–chimpanzee divergence, the effects of individual TE copies on traits are largely unknown. CRISPR-Cas9-based gene knockout screening is a powerful tool for identifying genes involved in various cellular phenotypes, such as drug resistance, transcriptional control, viral suppression and disease susceptibility (Fukuda et al., 2018; Przybyla and Gilbert, 2022). It could also be a powerful tool for investigating the effects of human-specific TE insertions by performing knockout screens for individual TE copies, requiring precise TE deletion. The most frequently used method for DNA deletion is a CRISPR-Cas9 and gRNA pair-induced double-strand break. Although deletion screening for lncRNA using a paired-gRNA CRISPR-Cas9 library has already been reported (Zhu et al., 2016), deletions generated by this method are often inefficient and imprecise, indicating a need for more precise and efficient methods to investigate the impact of TE insertions. A potential solution is the prime-editing-based method (PRIME-Del), which induces deletion using a pair of prime editing gRNAs (pegRNAs) that target opposite DNA strands, effectively programming not only the sites that are nicked but also the outcome of the repair (Chen and Liu, 2023). PRIME-Del can mediate large deletions (up to 10 kb) with efficiencies of up to 25% at endogenous genomic sites (Chen and Liu, 2023). Although still inefficient for screening purposes, further improvements in efficiency are expected to allow screening for functional TE insertion. In the future, further development of machine learning, reporter assays and knockout screening for TEs may enable comprehensive investigation of the functions of human-specific TE insertions (Fig. 3).

While GWAS and eQTL analysis are powerful for identifying genetic variants related to phenotypes, it is statistically challenging to investigate the effects of rare polyTEs. However, as methods for examining the effects of rare variants on traits are progressing (Chen et al., 2022), it should become possible to analyze the effects of rare polyTEs. Additionally, technologies such as PASTE, which allows the insertion of any sequence up to 36 kb at any location, have also been developed (Yarnall et al., 2023) and are potentially useful for analyzing the effects of rare TE insertions by inserting them into endogenous loci. The functions of rare polyTEs are also expected to become increasingly clear by advancing such new analytical methods.

Finally, recent advances in genome sequencing projects have revealed the diversity of TEs among species (Osmanski et al., 2023). By conducting functional analysis of TEs described above in various organisms, we can gain a deeper understanding of the role of TEs in species evolution.

CONFLICTS OF INTEREST

The author declares no conflicts of interest associated with this manuscript.

ACKNOWLEDGMENTS

I would like to express my sincere gratitude to Dr. Trent Newman for his invaluable contribution in reviewing and providing constructive feedback on the manuscript. His expertise and input significantly improved the clarity and structure of the paper.

REFERENCES
 
© 2023 The Author(s).

This is an open access article distributed under the terms of the Creative Commons BY 4.0 International (Attribution) License (https://creativecommons.org/licenses/by/4.0/legalcode), which permits the unrestricted distribution, reproduction and use of the article provided the original source and authors are credited.
https://creativecommons.org/licenses/by/4.0/legalcode
feedback
Top