2024 Volume 99 Article ID: 24-00020
The budding yeast Saccharomyces cerevisiae is an excellent model organism for studying chromatin regulation with high-resolution genome-wide analyses. Since newly generated genome-wide data are often compared with publicly available datasets, expanding our dataset repertoire will be beneficial for the field. Information on transcription start sites (TSSs) determined at base pair resolution is essential for elucidating mechanisms of transcription and related chromatin regulation, yet no datasets that cover two different cell types are available. Here, we present a CAGE (cap analysis of gene expression) dataset for a-cells and α-cells grown in defined and rich media. Cell type-specific genes were differentially expressed as expected, ensuring the reliability of the data. Some of the differentially expressed TSSs were medium-specific or detected due to unrecognized chromosome rearrangement. By comparing the CAGE data with a high-resolution nucleosome map, major TSSs were primarily found in +1 nucleosomes, with a peak approximately 30 bp from the promoter-proximal end of the nucleosome. The dataset is available at DDBJ/GEA.
For years, transcription start sites (TSSs) in the budding yeast genome have been determined at base pair resolution, using different techniques and growth conditions (Zhang and Dietrich, 2005; Park et al., 2014; Lu and Lin, 2019, 2021). In particular, recent studies have exploited the power of cap analysis of gene expression (CAGE), which can quantitatively assess the extent of transcription occurring from distinct TSSs on the genomic coordinate (Lu and Lin, 2019, 2021). This technique can be distinguished from conventional RNA-seq, which provides quantitative information on transcription levels by counting the number of reads that are aligned to somewhere in the annotated gene regions. In contrast, CAGE only detects the TSS, which corresponds to the extreme 5' end of the transcript (Murata et al., 2014). The high resolution of previously reported TSSs is critical for understanding the molecular mechanisms involved in transcription and related chromatin regulation. For example, previous studies have revealed that budding yeast prefers an adenine base at nucleotide position -8 relative to the TSS, implying the importance of this position for TSS determination (Zhang and Dietrich, 2005; Lu and Lin, 2019, 2021). Importantly, previous CAGE studies have not covered different yeast cell types, and such datasets are needed for elucidating the detailed mechanism of chromatin regulation regarding cell identity.
In association with chromatin regulation, TSSs are known to be located in the first nucleosome (+1 nucleosome) of protein-coding genes (Albert et al., 2007; Challal et al., 2018; Bagchi et al., 2020). In previous studies, TSS positions were compared with nucleosomes at relatively low resolution, as the nucleosomes were determined with micrococcal nuclease, which has sequence preferences in digestion (Chereji et al., 2017). This biased digestion obscures the midpoint of nucleosomal DNA on the genomic coordinate, preventing accurate comparisons between TSS and nucleosome positions. Thus, it is worth reanalyzing the TSS positions using more precisely determined nucleosomes. The comparison should be performed with nucleosomes that are determined by chemical mapping, as this determines the midpoints of nucleosomal DNA (i.e., dyad) at base pair resolution (Chereji et al., 2018). Examining the positions of TSSs relative to these precisely mapped nucleosomes allows us to discuss how nucleosomes spatially influence transcription initiation. One potential problem regarding this kind of comparison is that the publicly available TSSs could be representative of growth or genetic conditions that differ from those of the compared datasets. To overcome this problem, and for the future benefit of the field, we decided to obtain a yeast CAGE dataset that incorporates different cell types and growth conditions.
Yeast strains FY23 (MATa, ura3-52, trp1∆63, leu2∆1) and FY24 (MATα, ura3-52, trp1∆63, leu2∆1) (Fuse et al., 2017), which are considered to be isogenic, were grown in synthetic complete medium (SC) and rich medium (YPD). Cells grown in two independent cultures per condition were harvested (OD600 ~1.0), and total RNA was extracted as described (Schmitt et al., 1990).
The RNA samples were sent to DNAFORM (Yokohama, Japan), where library construction, sequencing and initial analyses were performed with the RECLU pipeline (Murata et al., 2014; Ohmiya et al., 2014). RNA quality was assessed by a Bioanalyzer (Agilent). In total, 10–11 million reads of 75 nt for each sample, generated by a NextSeq 500 sequencer, passed a quality check by FastQC (ver. 0.11.9), and 7.5–8.7 million reads were uniquely mapped onto the R64-1-1 reference genome with STAR (ver. 2.7.9a) (Dobin et al., 2013). CAGE-detected TSS counts in the bedGraph format were visualized with the genome browser IGV (ver. 2.12.3) (Thorvaldsdottir et al., 2013). Differential expression analysis of combined TSS clusters was performed with edgeR (ver. 3.22.5) (Robinson et al., 2010). TSS clusters of log2-transformed fold change (logFC) > 2 (or logFC < -2), log2-transformed mean counts per million (logCPM) > 2, and false discovery rate (FDR) < 0.05 were selected as significant. The dataset, including the bedGraph, lists of TSS clusters, and edgeR outputs, is available at DDBJ/GEA under the accession number E-GEAD-672.
For comparative analysis with precisely mapped +1 nucleosomes of MATa cells grown in YPD (Chereji et al., 2018), the TSS counts of the two replicates for FY23 grown in YPD were combined. The highest TSS peak in a 401-bp window on the nucleosomal coordinate (-200 to +200 bp from the nucleosome dyad) was searched for each gene, and the numbers of those peaks were summed at their respective nucleotide positions in a distribution array. When a given gene had more than one highest TSS peak in the searching window, the value one was divided by the number of such peaks and added to the respective nucleotide positions. When considering DNA sequences around the TSS, we selected the highest TSS peaks with or without adenine at nucleotide position -8 from the respective TSS. For the generation of sequence logos, genomic sequences around the highest peaks were processed with WebLogo (Crooks et al., 2004). The code for these procedures is available at https://github.com/hkatomed/Kawakamietal2024.
The TSS signals were highly reproducible for each condition (r > 0.998, Pearson’s correlation tests on CPM values for combined TSS clusters). When monitored by IGV, the TSS signals were visually very similar among the conditions, with some exceptions (Fig. 1). In fact, as summarized in Table 1, a differential expression analysis between different cell types selectively detected known cell type-specific loci as significant (Parnell et al., 2021). Interestingly, the main TSS clusters on the sense strands of STE3, AFB1, MF(ALPHA)1 and MFA2 were associated with upstream TSS clusters, transcribed in the opposite direction. For MFA2, an a-cell-specific gene, activation of an upstream TSS cluster (XIV_351562_351645_+) was observed in α-cells, and was more prominent in YPD than in SC. Moreover, as shown in Figure 1, we detected a-cell-specific TSS clusters around the recombination enhancer (RE) of chromosome III (Wu and Haber, 1996; Li et al., 2019; Dinda et al., 2023). The detection of these TSS clusters may help to clarify how cell type-specific transcription and recombination are regulated, since knowing the precise positions of TSSs on the genomic coordinate for each gene is the key to understanding the roles of factors involved.
Cluster IDa | Locus | α-cell vs. a-cell in SC | α-cell vs. a-cell in YPD | ||
---|---|---|---|---|---|
logFCb | FDR | logFCb | FDR | ||
II_429114_429132_-f | PHO3 | 10.4 | 0 | 12.7 | 0 |
VII_345150_345209_- | MF(ALPHA)2 | 12.1 | 0 | 10.8 | 0 |
X_444870_444940_- | SAG1 | 10.0 | 0 | 8.64 | 0 |
XI_114635_114800_- | STE3 | 13.3 | 0 | 11.1 | 0 |
XI_115081_115203_+c | upstream of STE3 | 3.20 | 3.72E-147 | 3.91 | 4.38E-238 |
XII_229628_229667_- | AFB1 | 5.43 | 0 | 5.39 | 0 |
XII_229912_230016_+c | upstream of AFB1 | 4.65 | 1.10E-79 | 4.05 | 7.14E-59 |
XII_459849_459961_-d | downstream of RDN5-1 | 0.306 | 0.475 | 2.44 | 4.49E-06 |
XIV_351562_351645_+c, d | YNL146W (MFA2) | 1.41 | 8.22E-10 | 2.01 | 2.40E-17 |
XVI_192794_192829_-c | upstream of MF(ALPHA)1 | 9.26 | 4.42E-36 | 9.25 | 2.15E-40 |
XVI_193555_193638_+ | MF(ALPHA)1 | 12.6 | 0 | 12.5 | 0 |
XVI_193902_193909_+ | MF(ALPHA)1 (internal) | 6.81 | 8.14E-60 | 9.54 | 1.22E-50 |
XVI_855769_855783_+d | YPRCTy1-4 | 1.07 | 5.81E-07 | 4.94 | 1.05E-23 |
III_29041_29082_+ | 700 bp RE | -7.70 | 3.51E-99 | -8.10 | 2.15E-82 |
III_29385_29398_- | 700 bp RE | -6.82 | 5.60E-63 | -7.68 | 1.63E-65 |
III_30854_30899_+ | RDT1 | -11.7 | 8.46E-196 | -11.8 | 4.13E-178 |
III_293794_293913_+e | a1 (HMRA1) | -13.5 | 0 | -13.1 | 0 |
IV_1385071_1385168_+ | MFA1 | -9.06 | 0 | -10.6 | 0 |
VI_82533_82565_+ | STE2 | -8.68 | 0 | -9.60 | 0 |
VII_436873_436876_- | AGA2 | -7.00 | 4.99E-70 | -6.75 | 5.73E-55 |
VII_436886_436897_- | AGA2 | -7.40 | 0 | -8.99 | 7.17E-239 |
IX_322266_322324_+ | BAR1 | -9.63 | 7.94E-231 | -9.72 | 5.86E-272 |
XI_46297_46390_- | STE6 | -12.6 | 2.00E-291 | -8.23 | 2.38E-286 |
XIV_351773_351853_-c | upstream of MFA2 | -10.1 | 8.77E-59 | -9.72 | 1.21E-53 |
XIV_352343_352422_+ | MFA2 | -10.6 | 0 | -11.4 | 0 |
a Chromosomal coordinate and transcribed strand (i.e., + or -) are connected by underscored lines.
b Positive values indicate that the TSSs were detected more in α-cells; negative, more in a-cells.
c Antisense strand for the indicated gene is transcribed.
d Significant four-fold difference was observed only in YPD medium.
e The TSS cluster for a1 was detected in the HMRA1 locus, as the reference strain S288C is MATα.
f The promoter of the PHO3 gene, including the detected TSS cluster, in the MATa strain FY23 has been lost (see Fig. 2).
Among the differentially expressed TSS clusters listed in Table 1, three TSS clusters, downstream of RDN5-1, inside YPRCTy1-4 and upstream of PHO3, were unexpected because the respective loci were either not expressed or only partially recognized as being expressed in a cell type-specific manner.
RDN5-1 is the sole copy of the complete 5S rRNA gene in the reference genome (McMahon et al., 1984). Expression of the TSS cluster downstream of RDN5-1 (XII_459849_459961_-) was only affected by cell type when cells were cultured in YPD, and not in SC. Thus, this differential expression could be associated with culture conditions.
YPRCTy1-4, which is located in the Crick strand of the genomic coordinate XVI:850,629-856,554, is a Ty1 retrotransposable element (Lesage and Todeschini, 2005). The position and transcription direction of the detected TSS cluster (XVI_855769_855783_+) suggest that it can modulate the activity of Ty1 itself and its neighboring genes. Indeed, the block II element (XVI:855,622-855,736), which contributes to cell type-dependent activation of adjacent gene expression (Company and Errede, 1988), resides next to the detected TSS cluster. Consistently, this TSS cluster was more active in α-cells and was repressed in a medium-dependent manner, with full repression in YPD.
PHO3 encodes a phosphatase responsible for thiamine pyrophosphate dephosphorylation and uptake of the reaction product (Nosaka et al., 1989). This gene is reportedly prone to local rearrangement with the neighboring paralog gene PHO5 (Takashita et al., 2013). Indeed, the PHO3 and PHO5 genes are fused to generate a chimeric gene in the MATa strain FY23 (Fig. 2). The rearrangement led to the amplification of a 1,660-bp PCR fragment, instead of the original 3,510-bp fragment. The junction was within the 98-bp identical regions located from nucleotides 298 to 395 in both genes’ coding sequences (yellow in Fig. 2A and 2C). Thus, in FY23, the promoter for PHO3 had been lost, causing the misdetection of the respective TSS cluster (II_429114_429132_-) as cell type-specific.
When the TSS peak distribution was analyzed with respect to the +1 nucleosomes, which have been precisely localized by chemical mapping (Chereji et al., 2018), the highest TSS signals peaked at -44 bp from the dyad (Fig. 3A). The TSS peak was previously reported to be ~13 bp from the promoter-proximal end of the nucleosomal DNA, which was protected from micrococcal nuclease digestion (Albert et al., 2007; Challal et al., 2018; Bagchi et al., 2020). In contrast, our analysis suggested that the recognized TSS peak position should be shifted inward: it is located ~30 bp from the promoter-proximal end of the 147-bp nucleosomal DNA. Strikingly, 87% of the genes had the highest TSS signals in the 147-bp region potentially occupied by their +1 nucleosomes (gray in Fig. 3A), mostly in the promoter-proximal half, indicating that the TSS is generally hidden by the +1 nucleosome if it is fully wrapped with DNA.
The highest TSSs predominantly had adenine at nucleotide position -8, in accordance with the literature (Zhang and Dietrich, 2005; Lu and Lin, 2019, 2021) (Fig. 3B). When focused on this position, the highest TSS signals that lacked the preferred adenine oscillated with a pitch of 10 bp, while those with adenine did not exhibit such clear oscillation (Fig. 3C). This observation suggests that the TSS is partially determined by the spatial orientation of the potential TSS relative to the rotationally positioned +1 nucleosome, especially when the preferred adenine at position -8 is absent.
In conclusion, we established a yeast CAGE dataset consisting of two cell types, each grown in SC and YPD. In addition to the anticipated cell type-specific TSS clusters, novel clusters were also identified. Using PHO3 as a model locus, we also demonstrated that unrecognized chromosomal rearrangements can lead to the misdetection of related TSS clusters as being regulated. Moreover, the utility of this dataset was demonstrated by conducting a comparative analysis with precisely mapped nucleosomes. Therefore, this dataset will be informative for future studies on cell type specificity and chromatin regulation of this excellent model organism.
Raw nucleotide sequence data and processed data are available at DDBJ under the accession numbers PRJDB17334 (BioProject) and E-GEAD-672 (GEA), respectively.
Experimental code for this study: https://github.com/hkatomed/Kawakamietal2024
The authors are grateful to anonymous reviewers of GGS for providing constructive comments on our manuscript. We declare no conflicts of interest associated with this manuscript. This study was supported by research grants from the Ohsumi Frontier Science Foundation to H. K., and by JSPS KAKENHI grant numbers JP21K05505 and JP21H00255 to H. K., JP23K05646 and JP23H02178 to K. K., and by the Priority Research Funding from Meisei University to M. S. The authors thank the Interdisciplinary Center for Science Research, Shimane University, for use of facilities.