2023 Volume 70 Issue 1 Pages 9-17
Genetic factors are involved in the etiology of most diseases, but prior to 2000, the methods for identifying such factors were very limited. Genome-wide association study (GWAS), developed in the 2000s, is an analytical method that can be applied to most diseases, including endocrine disorders. GWAS has provided a wealth of information on disease risks and the molecular pathogenesis of many human diseases. This review summarizes key findings from GWAS for thyroid physiology and diseases, and illustrates how GWAS is a powerful research tool to elucidate the molecular mechanisms of the diseases.
The risk for having diseases is modulated by complex interplay of multiple environmental factors and multiple genetic factors (Fig. 1A). As for disease-associated environmental factors, epidemiological studies can be used to identify them. For example, the association between smoking and lung cancer was first postulated through a case-control study reported in 1950 [1]. As for identifying disease-associated genetic factors, the most versatile method known to date is genome-wide association study (GWAS). GWAS is a “genetics version” case-control study (Fig. 1B), first reported from Japan [2]. In this review, the rationale for GWAS is explained, and then three groups of GWAS related to thyroid physiology and diseases are then summarized: GWAS for serum TSH levels, GWAS for thyroid cancer and GWAS for thyroid dysplasia. Through these studies, it will be illustrated what insights GWAS can provide into medical research.
Etiology of human diseases, and how to determine the contributing factors.
(A) Most diseases are influenced by multiple environmental factors and genetic factors. (B) To identify disease-associated environmental factors, epidemiological studies such as case-control studies, are used. As for genetic factors, genome-wide association studies (GWAS), which is a “genetics version” case-control studies, are used.
One set of genetic information of Homo sapiens is referred to as the human genome. The human genome serves as the blueprint of the human body, and translates genes to function through the flow of information from DNA to RNA to protein. The genome is the basis for the control of various cellular processes, such as the proliferation of itself, the metabolism of nutrients, the generation of energy, and the synthesis of many types of molecules. It is well accepted that changes of the DNA sequence of the human genome may cause diseases. Examples for such diseases would include Mendelian disorders, which are caused by a single mutated gene, and malignancy. In both cases, it is necessary to know the normal “reference” human genome sequence in order to identify sequence changes. For this purpose, The Human Genome Project was undertaken. In 2001, the draft version human genome sequence, consisting of approximately 3,000,000,000 bp with >20,000 genes, was released [3, 4], allowing researchers to interpret the meaning of the changes that occur in the human genome. For example, the guanine to adenine substitution at the position 81,143,407 of chromosome 14 (GRCh38) is often detected in Japanese patients with congenital hypothyroidism [5, 6]. By referring to the reference human genome sequence, we can determine that this nucleotide change results in the replacement of the Arg450 residue of the TSH receptor to a mutated histidine residue (i.e., p.Arg450His).
Single nucleotide polymorphismIn the early 2000s, when the draft human genome sequence was released, the focus of human genome research shifted from single reference sequence to the diversity of sequences between individuals. A study comparing the sequences of multiple individuals in selected genome regions revealed sites of inter-individual diversity [7]. Most of the observed variations were changes of a single nucleotide, and were named single nucleotide polymorphisms (SNPs). The HapMap consortium investigated the distribution of SNPs throughout the genome, and showed that SNPs were found at a frequency of approximately one in every 300 bp throughout the genome [7]. Today, not only single nucleotide substitutions, but also single nucleotide insertions, deletions, and short changes of two to a dozen consecutive bases are included in the category of SNPs. Considering that the human genome consists of approximately 3,000,000,000 bp [3, 4], approximately 10,000,000 positions are SNPs [8]. In terms of diversity, some alternative alleles are found at almost the same frequency of the reference alleles (e.g., 55% for the reference allele and 45% for the alternative allele), while other alternative alleles are rare, such as 0.1%. Ultimately, there are DNA changes that are found in only one person in the world. Generally, the word “polymorphism” in biology is defined as “the occurrence of two or more different forms or morphs in the population of a species” [9]. This indicates that SNPs are expected to exist at a certain frequency in a population. Initially, there was a debate as to what percentage of frequency should be accepted as SNPs. There was a suggestion that if there is a frequency of 1% or more in a population, these variants should be called SNPs. However, as research progressed, it became clear that there were many instances where variants with frequencies of more than 1% in one population were found to be less frequent in other populations. For this reason, no uniform frequency threshold is now used to define SNPs now.
As we have seen above, the human genome contains a huge number of SNPs, approximately 10,000,000 [8]. Most of these SNPs are considered to be functionally neutral and to have no deleterious effects, because those SNPs would be eliminated by negative selection during evolution. However, some exceptional SNPs located in genes or in regulatory regions may have significant impact on human health. Identification of such SNPs would contribute to predict disease risk and susceptibility to drugs and the environment. For this purpose, GWAS is the most versatile method.
GWASDNA microarray is an analytical tool to determine SNP genotypes using a glass slide, on which a number of DNA fragments are arranged. In the late 2000s, the technology of DNA microarray matured and became applicable to human genetic studies including GWAS. In a typical GWAS, more than 500,000 SNPs are genotyped for patients and controls, and the distribution of SNP genotypes (for example, CC/CT/TT) are statistically compared to determine if there are significant deviations between the two groups. There are approximately 10,000,000 SNPs in the human genome, but why are only 500,000 SNPs (corresponding to 5%) genotyped in GWAS? The important concept needed to understand the rationale for GWAS is linkage disequilibrium (LD). LD is the correlation of genotypes around neighboring SNPs. For example, suppose that the frequencies of SNPs A, B, C and D are each 0.33 (Fig. 2A). If SNP A and SNP D are located distantly, the expected frequency of chromosome that carries both SNP A and SNP D would be 0.33 × 0.33 = 0.11. Now let us assume that SNP B and SNP C exist in close proximity and are transmitted from ancestor to descendant without homologous recombination between them (Fig. 2A). In this case, a chromosome with SNP B will always have SNP C, while a chromosome without SNP B will lack SNP C. Therefore, the frequency of chromosome that carries both SNP B and SNP C would be 0.33, and this value is higher than expected (0.11) under the assumption of no LD. Therefore, by combining the frequency data of each SNP and combinations of SNPs (i.e., haplotype), one can identify a group of SNPs whose genotypes are correlated. For SNPs in LD, each genotype can be inferred from genotypes of a few representative SNPs (Fig. 2B). The HapMap project revealed that there are about 60,000 LD blocks in the human genome, and each block can be characterized by genotyping 5 to 8 SNPs (referred to as tag SNPs) [7]. This is the fundamental evidence that genotyping 500,000 SNPs would be sufficient to characterize all 60,000 LD blocks. The purpose of GWAS is not to directly identify “causal” SNPs with biological effects (Fig. 2B). Instead, it is expected to identify SNPs associated with disease, and to list the candidate “susceptibility genes” located inside the LD block. Usually, multiple susceptibility genes are identified for a single disease, and the identified genes are often enriched in specific biological pathways. The accumulation of effects on the specific biological pathways is thought to confer the risk for disease. Identification of such susceptibility genes would contribute to the understanding of the molecular pathogenesis of the disease. This knowledge made disease prevention and risk-based stratification possible, leading to the practice of precision medicine.
Linkage disequilibrium (LD) and tag SNPs.
(A) Schematic diagrams showing the generation of chromosomes from three ancestral chromosomes. In this example, each chromosome consists of 16 LD blocks. The yellow-colored ancestral chromosome has four SNPs (A, B, C and D). Note that B and C are located in the same LD. After the recombination events, chromosomes of descendants have variable combinations of LD blocks. If one analyzes SNPs on the nine chromosomes, the observed genotypes B and C are always observed simultaneously, but this is not valid for A and D. (B) In the LD block, SNPs are inherited as a set. Therefore, by genotyping a few representative SNPs (tag SNPs), one can infer the genotypes of the remaining SNPs. In GWAS, the information of 500,000 tag SNPs will represent the majority of the information for the entire SNPs (10,000,000).
In the late 2000s, SNP genotyping using DNA microarray was already widely used. Techniques for GWAS were later expanded to copy number variations (CNVs) [10] which are variations of genomic deletions and duplications. However, even if there are CNVs associated with a disease, their presence can be captured with SNPs due to LD. In the 2010s, as exome sequencing became available, DNA microarray that covered rare SNPs in exome was introduced [11]. Although microarrays specialized to rare coding variants have the potential to identify causal variants directly, their advantage would be limited as the majority of causal variants are considered to be in the non-coding regions [12, 13]. For these reasons, DNA microarray-based SNP genotyping remains the most effective and efficient method for GWAS as of 2022.
Although GWAS is a powerful method for quantifying the genetic effects on various types of traits, it can elucidate only a minor subset of heritability. Classically, heritability, defined as the proportion of phenotypic variation that can be explained by genetic variance (see Ref [14] for detail), has been measured through twin studies. For example, the heritability of height estimated from twin studies is about 0.8 [15], but only 15% of the heritability could be explained by susceptible SNPs identified in GWAS with N = 250,000 [16]. This gap is referred to as missing heritability. There are several explanations for the missing heritability, including insufficient sample size to capture variants with weak effects, inability to detect rare susceptible SNPs because only tag SNPs are genotyped, and inability to detect susceptible SNPs in regions with no LD. These limitations may be overcome in the future by using whole genome sequencing (WGS) data. In a very recent study of height using WGS data, the measured heritability reached as high as 0.68 [17]. Although the cost of WGS makes biobank-wide WGS impractical at present, WGS will become the primary methodology for GWAS in the future when the sequencing cost declines further.
In the following part of this review, three groups of GWAS related to thyroid physiology and diseases are reviewed, including GWAS for serum TSH levels, GWAS for thyroid cancer, and GWAS for thyroid dysgenesis.
Thyroid hormones are essential for normal growth, development, and maintenance of metabolic state; therefore, their levels in the circulation are tightly regulated. TSH secreted from the pituitary gland acts as a signal to stimulate the thyroid gland to produce thyroid hormones. TSH secretion is regulated by circulating thyroid hormone levels, creating a negative feedback loop. In the clinical setting, serum TSH levels are sensitively increased in response to subtle decrease of serum thyroid hormone levels. Previous studies using twin pairs have reported that serum TSH levels have high heritability (approximately 65%) [18, 19], suggesting the existence of genetic regulation. To date, several GWAS for serum TSH levels have been conducted.
In 2008, the first GWAS for serum TSH levels was performed on 4,300 Sardinians with both serum TSH level measurement and SNP genotyping [20]. The study identified a single genomic region on 5q13.3 (top-hit SNP rs4704397) that passed the genome-wide significance level. The initial finding was replicated in an independent Sardinian cohort, an Italian cohort in Tuscany, and the Old Order Amish cohort [20]. rs4704397 is located in intron of PDE8B, which encodes phosphodiesterase 8B involved in degradation of cAMP. As is widely known, TSH receptor employs cAMP as the major second messenger. Since phosphodiesterase 8B is expressed in the thyroid gland [21], it is reasonable to assume that its activity can modulate the intracellular TSH signaling.
The second GWAS for serum TSH levels was performed two years later [22]. In this study, serum levels of free T4 and free T3 were also evaluated. A total of 2014 female twins of European descent (17% monozygotic twins) were studied, and a single genome-wide significant region at 1p36.13 (rs10917469) was detected. This association was replicated in an independent cohort derived from the Busselton Health Study [22]. GWAS for serum levels of free T4 and free T3, however, did not identify any signals that reached genome-wide significance.
Although the two initial GWAS did not produce consistent signals, the next GWAS reported from Germany (N = 3,736) [23] detected both signals (5q13.3 and 1p36.13), in addition to novel signals on 16q23 and 4q31. In 2013, the large-scale GWAS meta-analysis was performed with 26,420 Europeans [24], and detected the four above mentioned signals and 15 novel loci. GWAS for free T4 was performed in this study as well, but did not show co-localization with the signals from GWAS for serum TSH levels. In 2018, further large-scale GWAS meta-analysis with 54,288 Europeans was performed [25] to replicate 18 of the 19 previously reported signals, and to find 19 additional loci including TSHR and TG.
In 2020, the largest (N = 116,445) GWAS for serum TSH levels known to date was performed by deCODE genetics and collaborators [26], reporting about 40 loci consistent with previous GWAS and >20 new loci. Of note, a Mendelian randomization study was performed in this study with the obtained data. A Mendelian randomization study is an analysis that takes advantage of the random allocation of genes during gametogenesis (which can be considered equivalent to the process in randomized controlled trials) to examine the association between a factor of interest (e.g., high serum TSH levels) and a disease (e.g., thyroid cancer) by substituting the factor with strongly associated genetic information as instrumental variables. The study group used the medical history of thyroid cancer in three cohorts (Michigan Genomics Initiative, deCODE and UK Biobank), and showed that an increase in TSH level by 1 SD (which was equal to 1.036 mU/L) was associated with a 45% reduction in the risk of developing thyroid cancer
By reviewing about 100 genes implicated in serum TSH levels, at least three categories can be recognized (Fig. 3). The first group is made up of the genes responsible for the monogenic forms of congenital hypothyroidism (GLIS3 [27], TG [28], TPO [29] and TSHR [30]). It is assumed that severe dysfunction of these genes may cause a major effect on thyroid development or hormone production, while mild effect (e.g., reduced gene expression) may result in subtle change of thyroid functions. The second group is made up of the genes comprising the Gs/cAMP pathway (ADCY9, GNG7, GNG12, PDE4D, PDE8B and PDE10A). As discussed above, cAMP is the major second messenger of the TSH receptor, and genetic changes affecting the signaling pathway would result in changes in serum TSH levels via the negative feedback loop to the pituitary gland. The third group is made up of genes is involved in angiogenesis (BCAS3, ELK3, PTEN, VAV3, VEGFA and VEGFC). The thyroid gland, like other endocrine organs, is rich in blood flow, and the degree of angiogenesis may influence the efficiency of hormone synthesis. Other signals of interest include CGA, encoding the alpha subunit of TSH, and THADA located at a chromosomal translocation breakpoint found in nodular goiter cases [31].
Biological pathways and genes identified through three groups of thyroid-related GWAS (serum TSL levels, thyroid cancer and thyroid dysgenesis).
The relative tissue expression levels of the genes located within or near disease-associated regions are shown in Fig. 4. Several genes are expressed predominantly in the thyroid, indicating their role in thyroid development and physiology.
Relative tissue mRNA expression levels of genes implicated in serum TSH levels, thyroid cancer and thyroid dysgenesis. Expression data were obtained from GTEx (https://gtexportal.org/) and relative expression levels were expressed as color intensity. Gene and tissue names are arranged in alphabetical rank order, respectively. Genes that showed the highest relative expression in the thyroid are indicated by asterisks.
Thyroid cancer is the most common endocrine malignancy, with an incidence of 14.7/100,000 persons per year in Japan according to the cancer statistics published by National Cancer Center, Japan [32]. A comparative study of heritability in 15 common cancers suggested that the strongest contribution of genetic factors was observed in thyroid cancer [33].
In 2009, the first GWAS for thyroid cancer was conducted by deCODE genetics [34]. The patient cohort included four histologic types (papillary, follicular, medullary and undifferentiated), of which 85% were papillary thyroid cancer (PTC). SNP genotype data derived from 192 Icelandic patients with thyroid cancer and 37,196 unaffected controls were used, and disease-associated SNPs on 9q22.33 (rs965513) and 14q13.3 (rs944289) were identified. These associations were replicated in the American cohort and the Spanish cohort [34]. The gene closest to rs965513 is a thyroid-specific transcription factor FOXE1, which may modulate the disease risk through dysregulation of the transcription factor. NKX2-1, another gene encoding a thyroid-specific transcription factor, was first reported to be relevant to rs944289. However, since the distance between NKX2-1 and rs944289 is more than 300 kb and there are other genes in between, it is still inconclusive whether the downstream effector of rs944289 is NKX2-1.
Exposure to radioactive iodine following the 1986 Chernobyl nuclear power plant accident resulted in a high incidence of thyroid cancer in children and adolescents [35]. To investigate the genetic background of this radiation-induced PTC, GWAS using 187 Belarusian patients (age 0–18 years at the time of accident) and 172 Belarusian controls (study 1), and another GWAS using 214 Belarusian patients and 448 Russian controls (study 2) were conducted [36]. Although there was no genome-wide significant signal in any study, meta-analysis of studies 1 and 2 identified a significant SNP rs965513 on 9q22.23, which was identical to the top-hit SNP in the previous GWAS [34]. These results indicate that thyroid cancer patients may share a genetic risk, even if they have different environmental exposures.
In 2012, deCODE genetics performed GWAS of 27,758 Icelanders with serum TSH level data, and identified 22 SNPs including rs965513 on 9q22.23 [37]. In the second step, rs965513 was excluded from the analysis and the remaining 21 SNPs were genotyped in 561 Icelandic thyroid cancer patients (any histologic types) and 40,013 controls. Positive signals were further validated in 595 non-Icelandic patients and 2,604 controls. The second step revealed that signals on 14q13.3 (already suggested in 2009 [34]), 2q35 and 8p12 confer the risk for thyroid cancer.
In 2017, deCODE genetics reported an even larger GWAS [38]. The study used samples derived from 3,001 European thyroid cancer patients (any histologic type) and 287,550 controls from five sites (Iceland, Columbus USA, Houston USA, The Netherlands and Spain). They reported six novel loci (1q42.2, 3q26.2, 5p15.33, 5q22.1, 10q24.33 and 15q22.33) in addition to the above-mentioned four loci. Of these, SNPs rs6793295 on 3q26.2 and rs7902587 on 10q24.33 were located in genomic regions associated with telomere length [39, 40], and SNP on 5p15.33 was located near TERT, encoding telomere reverse transcriptase (Fig. 3). These GWAS-based evidence for the association between telomeres and thyroid cancer seem consistent with previous studies showing the importance of TERT promoter mutations in thyroid cancer [41]. Recent Mendelian randomized studies also support the association between telomere length and risk of thyroid cancer [42].
GWAS for thyroid cancer performed in Korea detected the major previously known signals and several novel risk loci, suggesting the ethnicity of genetic risk [43]. Future replication studies, including one with Japanese patients, will be required to draw firm conclusions. More recently, an attempt was made to quantify the risk of thyroid cancer by calculating a score based on a large number of SNP information (i.e., polygenic risk score) [44]. Although the area under the ROC curve for thyroid cancer prediction is inadequate (0.7), its performance is expected to improve as more susceptible loci are identified in the future.
GWAS for thyroid dysgenesisAs we have seen with the examples of GWAS for serum TSH levels and for thyroid cancer, the progression of GWAS has usually been associated with increase of the sample size. In particular, since 2017, when the UK Biobank (N = 500,000) released their genotype data, it has become possible to perform GWAS on the scale of 500,000 for many common traits and diseases [45]. In contrast, for rare diseases, the number of affected individuals in a 500,000-person general population is limited to a few, making GWAS difficult. In fact, GWAS for rare diseases have barely been performed with a few exceptions such as amyotrophic lateral sclerosis [46].
Thyroid dysgenesis refers to abnormalities of thyroid organogenesis, and is clinically recognized as aplasia, hypoplasia and ectopia of the thyroid. Thyroid dysgenesis accounts for approximately 80% of severe and permanent congenital hypothyroidism cases. Genetic defects involving three transcription factors showing predominant expression in the thyroid (PAX8, NKX2-1, and FOXE1) are known to cause thyroid hypoplasia with Mendelian inheritance [47-49], but genetic screening studies for patient cohorts have shown that such abnormalities are seen in less than 5% of the patients [50]. A large-scale epidemiologic study in France found that as few as 2% of patients with thyroid dysgenesis had a family history of thyroid dysgenesis in first-degree relatives [51]. However, the rate (2%) was 15 times higher than would be expected based on chance alone [51]. It has been speculated that some genetic factors are involved in thyroid dysgenesis, but such factors have remained undetermined. In 2022, a GWAS of 140 Japanese patients with thyroid dysgenesis was reported [52], and a single disease-associated genomic region was identified on 2q33.3 (top-hit SNP rs9789446). The deviation of rs9789446 genotypes in patient group as compared with controls was replicated in a cohort of German patients.
Follow-up analyses using several public databases, including GTEx [53] and 3D-Genome Interaction Viewer [54], were conducted to evaluate the biological effects of the 2q33.3 region. The bioinformatic analyses suggested that this region may act as a distal enhancer of two genes (FZD5 and CCNYL1) located in the centromeric side of the disease-associated region [52]. Interestingly, both of these two genes encode components of the Wnt pathway, and the risk allele enhances expression of FZD5 and CCNYL1, suggesting that the pathway is over-activated (Fig. 3). This assumption was consistent with recent findings based on a zebrafish model [55, 56], although further studies are needed to draw clear conclusions.
GWAS for thyroid dysgenesis was able to identify a signal that reached genome-wide significance with a very small sample size (N = 140). This is probably due to the strong biological effect of the disease-associated region, that is, a per-allele odds ratio of 3.1 among patients with thyroid aplasia or ectopia [52]. Collectively, this study showed that GWAS can be used not only for common diseases where large sample sizes are available, but also for rare diseases with strong genetic factor(s).
Over the past decade, GWAS have been applied to the field of endocrinology (e.g., type 2 diabetes, obesity, hypertension), revealing a variety of new findings. There are already mega-GWAS using a biobank of >100,000 individuals, and it is expected that the sample size will be further increased through international collaborations of multiple biobanks. These efforts will contribute to further identification of more weak genetic factors for common diseases. On the other hand, GWAS can be a powerful tool for the analysis of rare diseases, as exemplified by the GWAS for thyroid dysgenesis. There are many endocrine disorders that are rare, and GWAS may be applicable to the elucidation of their pathophysiology.
The author declares no conflict of interests.
None.