A budding yeast CAGE dataset comprising two cell types

Kei Kawakami; Shin-ichi Maeda; Yoshiko Tanimoto; Mitsuhiro Shimizu; Hiroaki Kato

doi:10.1266/ggs.24-00020

ABSTRACT

The budding yeast Saccharomyces cerevisiae is an excellent model organism for studying chromatin regulation with high-resolution genome-wide analyses. Since newly generated genome-wide data are often compared with publicly available datasets, expanding our dataset repertoire will be beneficial for the field. Information on transcription start sites (TSSs) determined at base pair resolution is essential for elucidating mechanisms of transcription and related chromatin regulation, yet no datasets that cover two different cell types are available. Here, we present a CAGE (cap analysis of gene expression) dataset for a-cells and α-cells grown in defined and rich media. Cell type-specific genes were differentially expressed as expected, ensuring the reliability of the data. Some of the differentially expressed TSSs were medium-specific or detected due to unrecognized chromosome rearrangement. By comparing the CAGE data with a high-resolution nucleosome map, major TSSs were primarily found in +1 nucleosomes, with a peak approximately 30 bp from the promoter-proximal end of the nucleosome. The dataset is available at DDBJ/GEA.

MAIN

For years, transcription start sites (TSSs) in the budding yeast genome have been determined at base pair resolution, using different techniques and growth conditions (Zhang and Dietrich, 2005; Park et al., 2014; Lu and Lin, 2019, 2021). In particular, recent studies have exploited the power of cap analysis of gene expression (CAGE), which can quantitatively assess the extent of transcription occurring from distinct TSSs on the genomic coordinate (Lu and Lin, 2019, 2021). This technique can be distinguished from conventional RNA-seq, which provides quantitative information on transcription levels by counting the number of reads that are aligned to somewhere in the annotated gene regions. In contrast, CAGE only detects the TSS, which corresponds to the extreme 5' end of the transcript (Murata et al., 2014). The high resolution of previously reported TSSs is critical for understanding the molecular mechanisms involved in transcription and related chromatin regulation. For example, previous studies have revealed that budding yeast prefers an adenine base at nucleotide position -8 relative to the TSS, implying the importance of this position for TSS determination (Zhang and Dietrich, 2005; Lu and Lin, 2019, 2021). Importantly, previous CAGE studies have not covered different yeast cell types, and such datasets are needed for elucidating the detailed mechanism of chromatin regulation regarding cell identity.

In association with chromatin regulation, TSSs are known to be located in the first nucleosome (+1 nucleosome) of protein-coding genes (Albert et al., 2007; Challal et al., 2018; Bagchi et al., 2020). In previous studies, TSS positions were compared with nucleosomes at relatively low resolution, as the nucleosomes were determined with micrococcal nuclease, which has sequence preferences in digestion (Chereji et al., 2017). This biased digestion obscures the midpoint of nucleosomal DNA on the genomic coordinate, preventing accurate comparisons between TSS and nucleosome positions. Thus, it is worth reanalyzing the TSS positions using more precisely determined nucleosomes. The comparison should be performed with nucleosomes that are determined by chemical mapping, as this determines the midpoints of nucleosomal DNA (i.e., dyad) at base pair resolution (Chereji et al., 2018). Examining the positions of TSSs relative to these precisely mapped nucleosomes allows us to discuss how nucleosomes spatially influence transcription initiation. One potential problem regarding this kind of comparison is that the publicly available TSSs could be representative of growth or genetic conditions that differ from those of the compared datasets. To overcome this problem, and for the future benefit of the field, we decided to obtain a yeast CAGE dataset that incorporates different cell types and growth conditions.

Yeast strains FY23 (MATa, ura3-52, trp1∆63, leu2∆1) and FY24 (MATα, ura3-52, trp1∆63, leu2∆1) (Fuse et al., 2017), which are considered to be isogenic, were grown in synthetic complete medium (SC) and rich medium (YPD). Cells grown in two independent cultures per condition were harvested (OD₆₀₀ ~1.0), and total RNA was extracted as described (Schmitt et al., 1990).

The RNA samples were sent to DNAFORM (Yokohama, Japan), where library construction, sequencing and initial analyses were performed with the RECLU pipeline (Murata et al., 2014; Ohmiya et al., 2014). RNA quality was assessed by a Bioanalyzer (Agilent). In total, 10–11 million reads of 75 nt for each sample, generated by a NextSeq 500 sequencer, passed a quality check by FastQC (ver. 0.11.9), and 7.5–8.7 million reads were uniquely mapped onto the R64-1-1 reference genome with STAR (ver. 2.7.9a) (Dobin et al., 2013). CAGE-detected TSS counts in the bedGraph format were visualized with the genome browser IGV (ver. 2.12.3) (Thorvaldsdottir et al., 2013). Differential expression analysis of combined TSS clusters was performed with edgeR (ver. 3.22.5) (Robinson et al., 2010). TSS clusters of log2-transformed fold change (logFC) > 2 (or logFC < -2), log2-transformed mean counts per million (logCPM) > 2, and false discovery rate (FDR) < 0.05 were selected as significant. The dataset, including the bedGraph, lists of TSS clusters, and edgeR outputs, is available at DDBJ/GEA under the accession number E-GEAD-672.

For comparative analysis with precisely mapped +1 nucleosomes of MATa cells grown in YPD (Chereji et al., 2018), the TSS counts of the two replicates for FY23 grown in YPD were combined. The highest TSS peak in a 401-bp window on the nucleosomal coordinate (-200 to +200 bp from the nucleosome dyad) was searched for each gene, and the numbers of those peaks were summed at their respective nucleotide positions in a distribution array. When a given gene had more than one highest TSS peak in the searching window, the value one was divided by the number of such peaks and added to the respective nucleotide positions. When considering DNA sequences around the TSS, we selected the highest TSS peaks with or without adenine at nucleotide position -8 from the respective TSS. For the generation of sequence logos, genomic sequences around the highest peaks were processed with WebLogo (Crooks et al., 2004). The code for these procedures is available at https://github.com/hkatomed/Kawakamietal2024.

The TSS signals were highly reproducible for each condition (r > 0.998, Pearson’s correlation tests on CPM values for combined TSS clusters). When monitored by IGV, the TSS signals were visually very similar among the conditions, with some exceptions (Fig. 1). In fact, as summarized in Table 1, a differential expression analysis between different cell types selectively detected known cell type-specific loci as significant (Parnell et al., 2021). Interestingly, the main TSS clusters on the sense strands of STE3, AFB1, MF(ALPHA)1 and MFA2 were associated with upstream TSS clusters, transcribed in the opposite direction. For MFA2, an a-cell-specific gene, activation of an upstream TSS cluster (XIV_351562_351645_+) was observed in α-cells, and was more prominent in YPD than in SC. Moreover, as shown in Figure 1, we detected a-cell-specific TSS clusters around the recombination enhancer (RE) of chromosome III (Wu and Haber, 1996; Li et al., 2019; Dinda et al., 2023). The detection of these TSS clusters may help to clarify how cell type-specific transcription and recombination are regulated, since knowing the precise positions of TSSs on the genomic coordinate for each gene is the key to understanding the roles of factors involved.

Fig. 1. Genome browser view of TSS signals. An IGV snapshot of the genomic coordinate III:22,163–32,438. Genes on Watson (orange) and Crick (light blue) strands and other elements (gray) are shown at the top. Positions of differentially expressed TSS clusters are indicated with asterisks. Growth condition (SC and YPD) and genotype (a-cell and α-cell) are indicated to the left. TSS counts for each condition are colored orange and light blue when the Watson (W) and Crick (C) strand is transcribed, respectively. TSS counts for the Crick strand were counted negatively. TSS counts were scaled on a per medium basis: for SC samples, the Watson strand scale is 0 to 16 and the Crick strand -16 to 0; for YPD samples, the scales are 0 to 11 and -11 to 0, respectively.

Table 1. TSS clusters with differential expression patterns

Cluster ID^{^a}	Locus	α-cell vs. a-cell in SC		α-cell vs. a-cell in YPD
Cluster ID^{^a}	Locus	logFC^{^b}	FDR	logFC^{^b}	FDR
II_429114_429132_-^{^f}	PHO3	10.4	0	12.7	0
VII_345150_345209_-	MF(ALPHA)2	12.1	0	10.8	0
X_444870_444940_-	SAG1	10.0	0	8.64	0
XI_114635_114800_-	STE3	13.3	0	11.1	0
XI_115081_115203_+^{^c}	upstream of STE3	3.20	3.72E-147	3.91	4.38E-238
XII_229628_229667_-	AFB1	5.43	0	5.39	0
XII_229912_230016_+^{^c}	upstream of AFB1	4.65	1.10E-79	4.05	7.14E-59
XII_459849_459961_-^{^d}	downstream of RDN5-1	0.306	0.475	2.44	4.49E-06
XIV_351562_351645_+^{^{c, d}}	YNL146W (MFA2)	1.41	8.22E-10	2.01	2.40E-17
XVI_192794_192829_-^{^c}	upstream of MF(ALPHA)1	9.26	4.42E-36	9.25	2.15E-40
XVI_193555_193638_+	MF(ALPHA)1	12.6	0	12.5	0
XVI_193902_193909_+	MF(ALPHA)1 (internal)	6.81	8.14E-60	9.54	1.22E-50
XVI_855769_855783_+^{^d}	YPRCTy1-4	1.07	5.81E-07	4.94	1.05E-23
III_29041_29082_+	700 bp RE	-7.70	3.51E-99	-8.10	2.15E-82
III_29385_29398_-	700 bp RE	-6.82	5.60E-63	-7.68	1.63E-65
III_30854_30899_+	RDT1	-11.7	8.46E-196	-11.8	4.13E-178
III_293794_293913_+^{^e}	a1 (HMRA1)	-13.5	0	-13.1	0
IV_1385071_1385168_+	MFA1	-9.06	0	-10.6	0
VI_82533_82565_+	STE2	-8.68	0	-9.60	0
VII_436873_436876_-	AGA2	-7.00	4.99E-70	-6.75	5.73E-55
VII_436886_436897_-	AGA2	-7.40	0	-8.99	7.17E-239
IX_322266_322324_+	BAR1	-9.63	7.94E-231	-9.72	5.86E-272
XI_46297_46390_-	STE6	-12.6	2.00E-291	-8.23	2.38E-286
XIV_351773_351853_-^{^c}	upstream of MFA2	-10.1	8.77E-59	-9.72	1.21E-53
XIV_352343_352422_+	MFA2	-10.6	0	-11.4	0

^aChromosomal coordinate and transcribed strand (i.e., + or -) are connected by underscored lines.

^bPositive values indicate that the TSSs were detected more in α-cells; negative, more in a-cells.

^cAntisense strand for the indicated gene is transcribed.

^dSignificant four-fold difference was observed only in YPD medium.

^eThe TSS cluster for a1 was detected in the HMRA1 locus, as the reference strain S288C is MATα.

^fThe promoter of the PHO3 gene, including the detected TSS cluster, in the MATa strain FY23 has been lost (see Fig. 2).

Among the differentially expressed TSS clusters listed in Table 1, three TSS clusters, downstream of RDN5-1, inside YPRCTy1-4 and upstream of PHO3, were unexpected because the respective loci were either not expressed or only partially recognized as being expressed in a cell type-specific manner.

RDN5-1 is the sole copy of the complete 5S rRNA gene in the reference genome (McMahon et al., 1984). Expression of the TSS cluster downstream of RDN5-1 (XII_459849_459961_-) was only affected by cell type when cells were cultured in YPD, and not in SC. Thus, this differential expression could be associated with culture conditions.

YPRCTy1-4, which is located in the Crick strand of the genomic coordinate XVI:850,629-856,554, is a Ty1 retrotransposable element (Lesage and Todeschini, 2005). The position and transcription direction of the detected TSS cluster (XVI_855769_855783_+) suggest that it can modulate the activity of Ty1 itself and its neighboring genes. Indeed, the block II element (XVI:855,622-855,736), which contributes to cell type-dependent activation of adjacent gene expression (Company and Errede, 1988), resides next to the detected TSS cluster. Consistently, this TSS cluster was more active in α-cells and was repressed in a medium-dependent manner, with full repression in YPD.

PHO3 encodes a phosphatase responsible for thiamine pyrophosphate dephosphorylation and uptake of the reaction product (Nosaka et al., 1989). This gene is reportedly prone to local rearrangement with the neighboring paralog gene PHO5 (Takashita et al., 2013). Indeed, the PHO3 and PHO5 genes are fused to generate a chimeric gene in the MATa strain FY23 (Fig. 2). The rearrangement led to the amplification of a 1,660-bp PCR fragment, instead of the original 3,510-bp fragment. The junction was within the 98-bp identical regions located from nucleotides 298 to 395 in both genes’ coding sequences (yellow in Fig. 2A and 2C). Thus, in FY23, the promoter for PHO3 had been lost, causing the misdetection of the respective TSS cluster (II_429114_429132_-) as cell type-specific.

Fig. 2. Loss of the PHO3 promoter in the MATa strain FY23. (A) A schematic of the rearrangement found in the PHO3-PHO5 locus. Positions of primers ENY32 (5'-GTCGAGGTTAGTATGGCTTC-3') and ENY35 (5'-GCTCCGATATCTATTTCAGC-3') are shown with arrows. (B) PCR products amplified with the primers ENY32 and ENY35. Genomic DNA of S288C (BYD1, NBRP-Yeast, Japan) was used as the control template. PCR products were separated alongside a DNA Ladder One 1 kbp (Nacalai Tesque, Japan) (lane M) in a 0.8% agarose gel, stained with ethidium bromide, and photographed with a Gel Scene GS-GU (Astec, Japan). (C) Results of Sanger sequencing of the PCR products with the upstream primer ENY32.

When the TSS peak distribution was analyzed with respect to the +1 nucleosomes, which have been precisely localized by chemical mapping (Chereji et al., 2018), the highest TSS signals peaked at -44 bp from the dyad (Fig. 3A). The TSS peak was previously reported to be ~13 bp from the promoter-proximal end of the nucleosomal DNA, which was protected from micrococcal nuclease digestion (Albert et al., 2007; Challal et al., 2018; Bagchi et al., 2020). In contrast, our analysis suggested that the recognized TSS peak position should be shifted inward: it is located ~30 bp from the promoter-proximal end of the 147-bp nucleosomal DNA. Strikingly, 87% of the genes had the highest TSS signals in the 147-bp region potentially occupied by their +1 nucleosomes (gray in Fig. 3A), mostly in the promoter-proximal half, indicating that the TSS is generally hidden by the +1 nucleosome if it is fully wrapped with DNA.

Fig. 3. Location of TSS relative to +1 nucleosomes. (A) Distribution of the highest TSS signals around the +1 nucleosomes of 5,542 genes, which were previously determined by a chemical mapping method (Chereji et al., 2018). (B) Logo of DNA sequences around the TSS, located at position zero on the x-axis. (C) Distribution of the highest TSS signals on the nucleosomal coordinate with or without adenine at nucleotide position -8 from the respective TSS. In A and C, the 147-bp regions potentially occupied by the +1 nucleosomes are shaded in gray. In C, vertical dashed lines are drawn at nucleosomal nucleotide positions -(10n+3), where n is 0 to 7.

The highest TSSs predominantly had adenine at nucleotide position -8, in accordance with the literature (Zhang and Dietrich, 2005; Lu and Lin, 2019, 2021) (Fig. 3B). When focused on this position, the highest TSS signals that lacked the preferred adenine oscillated with a pitch of 10 bp, while those with adenine did not exhibit such clear oscillation (Fig. 3C). This observation suggests that the TSS is partially determined by the spatial orientation of the potential TSS relative to the rotationally positioned +1 nucleosome, especially when the preferred adenine at position -8 is absent.

In conclusion, we established a yeast CAGE dataset consisting of two cell types, each grown in SC and YPD. In addition to the anticipated cell type-specific TSS clusters, novel clusters were also identified. Using PHO3 as a model locus, we also demonstrated that unrecognized chromosomal rearrangements can lead to the misdetection of related TSS clusters as being regulated. Moreover, the utility of this dataset was demonstrated by conducting a comparative analysis with precisely mapped nucleosomes. Therefore, this dataset will be informative for future studies on cell type specificity and chromatin regulation of this excellent model organism.

DATA AVAILABILITY

Raw nucleotide sequence data and processed data are available at DDBJ under the accession numbers PRJDB17334 (BioProject) and E-GEAD-672 (GEA), respectively.

Experimental code for this study: https://github.com/hkatomed/Kawakamietal2024

ACKNOWLEDGMENTS

The authors are grateful to anonymous reviewers of GGS for providing constructive comments on our manuscript. We declare no conflicts of interest associated with this manuscript. This study was supported by research grants from the Ohsumi Frontier Science Foundation to H. K., and by JSPS KAKENHI grant numbers JP21K05505 and JP21H00255 to H. K., JP23K05646 and JP23H02178 to K. K., and by the Priority Research Funding from Meisei University to M. S. The authors thank the Interdisciplinary Center for Science Research, Shimane University, for use of facilities.

REFERENCES

Corresponding author

Version information

Register with J-STAGE for free!