Verification of Ribosomal Proteins of Aspergillus fumigatus for Use as Biomarkers in MALDI-TOF MS Identification

We have previously proposed a rapid identification method for bacterial strains based on the profiles of their ribosomal subunit proteins (RSPs), observed using matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS). This method can perform phylogenetic characterization based on the mass of housekeeping RSP biomarkers, ideally calculated from amino acid sequence information registered in public protein databases. With the aim of extending its field of application to medical mycology, this study investigates the actual state of information of RSPs of eukaryotic fungi registered in public protein databases through the characterization of ribosomal protein fractions extracted from genome-sequenced Aspergillus fumigatus strains Af293 and A1163 as a model. In this process, we have found that the public protein databases harbor problems. The RSP names are in confusion, so we have provisionally unified them using the yeast naming system. The most serious problem is that many incorrect sequences are registered in the public protein databases. Surprisingly, more than half of the sequences are incorrect, due chiefly to mis-annotation of exon/intron structures. These errors could be corrected by a combination of in silico inspection by sequence homology analysis and MALDI-TOF MS measurements. We were also able to confirm conserved post-translational modifications in eleven RSPs. After these verifications, the masses of 31 expressed RSPs under 20,000 Da could be accurately confirmed. These RSPs have a potential to be useful biomarkers for identifying clinical isolates of A. fumigatus .


INTRODUCTION
Aspergillus is a diverse genus of very common fungi that have high economic and social impact. 1) Some strains are used industrially for microbial fermentation and production of organic compounds and enzymes. Several Aspergillus species are also known to be causative agents for mycoses, which has been shown to cause aspergilloses, including allergic bronchopulmonary aspergillosis, aspergilloma, and invasive aspergillosis. 2) Because susceptibilities to antifungal agents vary according to Aspergillus species, accurate identication of unknown Aspergillus clinical isolates is the key to selecting an appropriate antifungal agent.
Identi cation of Aspergillus species has been tradition-ally performed based on the morphology of the conidia and conidiogeneses. 1,2) However, morphological discrimination is subjective and requires special skills and experience. is has led to the increasing use of DNA-based characterizations to determine Aspergillus species. Identi cation of Aspergillus species has been reported using the internal transcribed spacer (ITS) region between the 18S, 5.8S, and 28S ribosomal RNA (rRNA) genes, 3) the D1/D2 region of the 28S rRNA gene 4) and the housekeeping genes such as β-tubulin 5) and calmodulin 6) genes.
On the other hand, we have proposed a ribosomal protein based MALDI-TOF MS method for bacteria characterization. [7][8][9][10][11][12][13][14] Our method can identify the species of a bacteria based on the pro les of its ribosomal subunit proteins (RSPs), which are highly abundant house-keeping proteins and easily observed by MALDI-TOF MS. e results of identi cation at species level and discrimination at strain level are correlated with the molecular evolution of these housekeeping proteins. Prokaryotic (bacterial) ribosomal proteins consist of more than 50 subunits, so equivalent results as analyzing many genes are obtained by using RSPs as biomarkers. e key of the RSP based method is the reliability of the reference mass list of RSP biomarkers. e preparation of the reference mass list of RSP biomarkers is supported by bioinformatics. e theoretical mass of RSP biomarkers can be calculated from their amino acid sequences registered in the public protein databases such as the National Center for Biotechnology Information (NCBI) database and UniProt Knowledgebase (UniProtKB). erefore, this method has a potential for universal use, since it is not circumscribed by commercial databases.
To extend this ribosomal protein based method to the identi cation of eukaryotic Aspergillus species, we have rst attempted to characterize RSPs of various genomesequenced Aspergillus strains by MALDI-TOF MS. However, most RSPs in every strains were hard to be assigned. Here, we have found that the di culty is mainly caused by two problems in the public protein databases. e rst problem is originated from the confusion of the nomenclature in fungi. Prokaryotic (bacterial) ribosomes consist of 57 kinds of RSPs, whereas eukaryotic ribosomes typically consist of 78 RSPs. e di erence of numbers induces disagreements in the names of RSPs. So far, the nomenclature are proposed based on Escherichia coli in prokaryotes, while the two nomenclatures are proposed based on yeast and rats in eukaryotes. Various names based on the di erent nomenclatures are muddled now. erefore, it is di cult to search information from databases and references based on RSPs' names. Although a uni ed naming system for RSPs has also been proposed, 15) this proposal is not employed in the public protein databases at this time.
e second problem is that many amino acid sequences on databases seem to be incorrect. Di erent from prokaryotes genes, the genes of eukaryotes including Aspergillus fungi have intron sequences. We have performed the homology analysis of RSPs of Aspergillus species, and found that there were low homology parts in amino acid sequences. Because the house-keeping ribosomal proteins should be highly conserved, we have speculated that the intron sequences may be mis-annotated. erefore, the sequence correction of RSPs would be accomplished by combining in silico inspection by sequence homology analysis and the veri cation of expressed mass of RSPs by MALDI-TOF MS measurements.
In this paper, we have described the detailed procedures concerning the veri cation and correction of information of RSPs (i.e., protein names, intron sequences, amino acid sequences, and post-translational modi cations) using two genome-sequenced strains of A. fumigatus as a model.

EXPERIMENTAL
Cell culture and preparation of ribosomal protein samples e genome-sequenced strains of A. fumigatus Af293 (=IFM 54229) and A1163 (=IFM 53842), the neotype strain IFM 57323 NT , and a clinical isolate of IFM 62104 were provided by Chiba University's Medical Mycology Research Center. e genome-sequenced strains and IFM 57323 NT were grown in potato dextrose broth (PDB) medium at 25°C for three days. e IFM 62104 strain was grown in PDB medium at 37°C for four days.
A er incubation, the growing medium was centrifuged at 5,800 g for 10 min. Fungus bodies were harvested by centrifugation, and ground (twice, for 20 s each time, at 7,000 rpm) between zirconia silica beads (ca. 1,300 mg, 0.1 mm in diameter) in a MagNA Lyser (Roche). A er removing the beads and cell debris by centrifugation, the fungus lysates were subjected to ultra-centrifugation at 73,400 g for 1 h to isolate the ribosome fraction as precipitates.

MALDI-TOF MS measurements
Sample preparation, apparatus, and MALDI-TOF MS data acquisition methods were similar to those described in our previous papers. [7][8][9][10][11][12][13][14] e ribosomal protein sample solution (approx. 1 µL) was spotted onto the MALDI target. Approx. 1 µL sinapinic acid matrix solution at a concentration of 20 mg/mL in 50% acetonitrile with 1% tri uoroacetic acid was then overlaid and dried in air. e MALDI-TOF MS measurements were performed using an AXIMA CFRplus time-of-ight mass spectrometer (Shimadzu/Kratos, Kyoto, Japan) in positive linear mode. More than three mass spectra for each sample were collected from more than three sample spots. External mass calibration was carried out using three peaks of ACTH (human,

Calculation of the theoretical mass of RSPs
e amino acid sequence of each RSP was obtained from the UniProtKB (http://www.uniprot.org/). e sequence mass of each RSP was predicted using a Compute pI/Mw tool on the ExPASy proteomics server (http://www.expasy. org/tools/pi_tool.html), with N-terminal methionine loss considered rst as a possible post-translational modi cation. e possibilities of other modi cations will be discussed below in Results and Discussion section. e theoretical mass of each expressed RSP was calculated as [M+H] + ion.

RESULTS AND DISCUSSION
Uni cation of the RSP name system e nomenclature of RSPs is in a state of confusion. Names are typically composed of an alphabetical letter (L for large subunit proteins and S for small subunit proteins) and a digit, in which the numbering rule is di erent for each species. e rst nomenclature of RSPs was proposed for bacterial (Escherichia coli) RSPs in 1971. 16) For eukaryotic RSPs, mammalian (rat) RSPs were the rst to be characterized and named, 17) and the proposal for the yeast (Saccharomyces cerevisiae) RSP naming system 18) was followed. To solve the nomenclatural confusion, a uni ed naming system for RSPs has been discussed, in which homologous RSPs are assigned with the same name, independent of organism species. e rst proposal was based on a protein family, 19) and it was further modi ed to a new sys- eS10 S10b Q4WLQ8 S10b B0Y8V2 S11 uS17 S11 Q4WHU8 S11 B0XUT5 S12 eS12 S12 Q4WJM1 S12 B0XP41 S13 uS15 S13 Q4WGJ9 S13 B0YCP0 S14 uS11 S11 Q4X1C6 S11 B0XS79 S15 uS19 S15, putative Q4X1G1 S15, putative B0XS46 S16 uS9 Rps16, putative Q4X1C0 S9 B0XS84 S17 eS17 S17, putative Q4X1E0 S17, putative B0XS66 S18 uS13 S13p/S18e Q4WLH1 S13p/S18e B0XM75 S19 eS19 S19 Q4WJN7 S19 B0XP26 S20 uS10 S10a Q4WIE3 S10a tem for naming RSPs proposed in 2014. 15) Unfortunately, the new uni ed naming system 15) is not employed in the public protein databases at this time. is paper therefore provisionally adopts the yeast name system 18) for convenience of homology search, since Aspergillus and Saccharomyces are related organisms.
To unify the name of each A. fumigatus RSP into the yeast name, a homology search of A. fumigatus RSPs was performed using the NCBI blastp program (http://blast. ncbi.nlm.nih.gov/) to seek the RSPs of S. cerevisiae. Table  1 summarizes the data on A. fumigatus RSPs, such as the accession number and registered name in UniProtKB, the name using the yeast name system, and the name employing the uni ed naming system as a reference for the future. Most of the RSPs of A. fumigatus registered in UniProtKB were named using the yeast name system. e remaining RSPs, named using another naming system, were renamed to the yeast name in to the following manner. For example, L37a of A. fumigatus Af293 registered in UniProtKB as Q4WZH8, showed high homology with S. cerevisiae L43A (where A means one of the duplicate genes). Because L37a is based on the mammalian ribosome name, it is renamed to L43 in line with the yeast name (incidentally, it corresponds to eL43 in the uni ed name 15) ).
is L43 protein showed more than 95% similarity to L43 of A. clavatus NRRL1, A. terreus NIH2624, and A. niger CBS513.88. ese homologs of another Aspergillus species are registered using the yeast name. To prevent such confusion, all RSPs of A. fumigatus Af293 and A1163 were uni ed to the yeast name.
Ribosomal proteins L40, S30, and S31 are synthesized as fusion proteins with ubiquitin 20,21) (note that S31 is assigned as S27a in ref. 20). ere are several di erent types of ubiquitin, all of which are highly conserved and well characterized, so identi cation of the ubiquitin part in a fusion protein sequence is an easy task. In UniProtKB, L40 is registered as "Ubiquitin UbiA" (accession numbers: A4D9S6 for Af293 and B0XNB9 for A1163). In this fusion protein, ubiquitin forms a part of the N-terminal-side 76 amino acids, whereas L40 is the remaining part of C-terminal-side 52 amino acids. 20) In the case of S31 registered as "Ubiquitin (UbiC)" (Q4WXZ8 and B0XXM3), since the N-terminal side 76 amino acid is ubiquitin, the remaining C-terminal side chain is S31. To increase the confusion, S30, which is registered as "S30/ubiquitin fusion" (Q4WCU4 and B0YDK3), is not a fusion protein, and the full length of the registered amino acid sequence corresponds to S30. e page for alkaline serine protease in UniProtKB (Q4WI20) includes the "ribosomal protein L14P family" in the Family & Domains eld. L14P is the bacterial RSP name, which corresponds to yeast L23. e amino acid sequence of this protein showed a high homology with L23 of S. cerevisiae, so the name of this protein was changed to L23. All the names of A. fumigatus RSPs were veri ed and changed to the yeast name using this procedure.
Observation of MALDI-TOF mass spectra and peak assignment e next step is the calculation of the theoretical mass of each RSP based on the corresponding amino acid sequences obtained from UniProtKB. e theoretical mass was then compared with the observed mass. Figure 1 shows the mass spectra of the ribosomal protein fraction prepared from A. fumigatus Af293 and A1163, with the peaks under m/z 20,000 assigned. Finally, we were able to assign 31 RSPs, but at this stage only eight peaks could be assigned for each strain when using the registered amino acid sequences in UniProtKB and only if taking N-terminal methionine loss into account. ese peaks are indicated as the boxed protein names in Fig. 1. In our previous studies of bacterial RSPs, [7][8][9][10][11][12][13][14] most could be assigned by referring to the theoretical mass calculated from the registered amino acid sequences while only considering N-terminal methionine loss. e main reasons why only eight RSPs could be assigned might be speculated as (1) many incorrect amino acid sequences are registered in the protein databases and (2) post-translational modi cations occur, other than N-terminal methionine losses. e following section discusses the actual state of the registered information and how to correct erroneous sequences and speculate post-translational modications.

Correction of registered amino acid sequences
Incorrectly registered amino acid sequences in bacterial RSPs were mainly caused by mis-annotation of start codons. 9,12) In this study, we found that incorrect sequences of eukaryotic RSPs of A. fumigatus were caused by misannotation of the exon/intron structure. Accurate coding DNA sequence (CDS) was determined by a combination of informatics procedures involving a homology search and a manual inspection of the DNA sequence of the corresponding genes, followed by con rmation of the correct mass of the expressed RSPs by MALDI-TOF MS measurements. e details of the correction procedures are described below. e amino acid sequences of RSPs tend to be highly preserved, and show high homology with other species' proteins. However, RSPs not assigned at the beginning tended to have di erent sequence lengths registered in the database. For example, Fig. 2 shows the multiple alignment of S29 of A. fumigatus, for which the peak could not be observed at the calculated mass, and other Aspergillus species such as A. clavatus NRRL1, A. nidulans FGSC A4, and A. niger CBS513.88. e amino acid sequences between 1 and 54 are highly conserved between these strains, while the homology and length of C-terminal side are markedly di erent. Eukaryotic S29 is highly conserved from yeast to humans, 22) and has 56 amino acids containing a speci c zinc nger-like motif (C-x-x-C). 23) Since S29 of A. niger CBS513.88 and A. nidulans FGSC A4 have the zinc nger-like motif and 56 amino acid sequences, these sequences are more likely to be right. e DNA sequence of the S29 gene (rps29) of A. fumigatus Af293 was therefore compared to that of A. niger CBS513.88. e rps29 gene of A. niger CBS513.88 is located on c482296-481588 (708 bp) of supercontig An06 (NT_166522.1 in NCBI) and consists of 5 exons and 4 introns. e rps29 gene of A. fumigatus Af293 is located on c3211760-3211177 (583 bp) of chromosome 6 (NC_007199.1 in NCBI) and consists of 5 exons and 4 introns. Figure 3 shows the sequence alignment of these genes, with exon regions underlined. In spite of the high sequence similarity of exon-1 to exon-3, the length of exon-4 is di erent: it is 57 bp for A. niger CBS513.88 and 61 bp for A. fumigatus Af293. us, the differences of 4 bp indicated by the box in Fig. 3 seems to be a redundancy. If these 4 bp are assigned as an intron, as they are in A. niger S29, a frame shi occurs at exon-5, resulting in a shi in the stop-codon (i.e., removal of the redundant italic sequence at the 3′-side in Fig. 3). e numbers of base pairs now match, with the correct amino acid sequence being 56 aa, which is common to a wide range of eukaryotes. e correct amino acid sequence of S29 showed more than 90% similarity to that of A. clavatus and A. nidulans. e correct mass of S29 ion ([M+H] + ) was calculated as 6646.7 Da, and the corresponding peak was clearly observed in the mass spectra, as shown in Fig. 1. e same procedure was performed for S29 of A. fumigatus A1163, revealing the same sequence and mass as those of the Af293 strain. e sequence information of L39 of both Af293 and A1163 strains was not registered in the protein databases. We tried to nd the open reading frame (ORF) of the L39 gene (rpl39) in the genome sequence of Af293 and A1163 strains using the rpl39 gene sequence of other Aspergillus species by manual inspection. As a result of a blast search performed using known rpl39 gene sequences, highly homologous sequences of rpl39 gene    1, c422041-421524). An alignment analysis of the putative rpl39 gene sequences with those of several Aspergillus species gave the exon/intron structure and a total of 156 bp of CDS. e resulting amino acid sequences were the same between the Af293 and A1163 strains, and also the same as L39 of A. oryzae RIB40 and A. avus AF70. e theoretical mass of L39 ion ([M+H] + ) was determined as 6151.2 Da, and the corresponding peak was observed as shown in Fig. 1. ese results strongly support the speculated sequence and expressed mass of L39 of the A. fumigatus strains.
In this manner, the veri cation of A. fumigatus RSPs under 20,000 Da could be performed by a combination of manual sequence inspections and MALDI-TOF MS measurements. Surprisingly, more than half (17 of 31) of the RSPs were incorrectly registered in the public protein databases, mainly due to erroneous annotations of exon/intron structures. In addition, two RSPs were registered as fusion proteins, and L39 was absent. e corrected CDS and amino acid sequences of these 17 RSPs are summarized in the supporting information e automatic annotation of exon/intron structures a er whole-genome sequencing is likely to be imperfect, since the only clue to determining introns applied is the GT-AG rule (most introns start with GT and end with AG). Because accurate determination of cDNA by mRNA sequencing is both expensive and time-consuming, a full set of experimental cDNA sequence data of Aspergillus RSPs has not yet been reported. Our approach appears to be a simple and e ective method of speculating accurate amino acid sequences of RSPs.

Post-translational modi cations
Unidenti ed RSPs still remained a er sequence correction, suggesting the presence of post-translational modi cation. In this study, post-translational modi cations could be speculated for 11 RSPs, as described in this section. ese modi cations appear to be conserved in eukaryotes.
Acetylation, especially at the N-terminus, seems to be a common post-translational modi cation in eukaryotic RSPs. Nine RSPs (L31, L35, S11, S15, S16, S18, S21, S24, and S28) showed clear peaks at +42 Da over the calculated sequence mass, suggesting acetylation. For example, although the amino acid sequence of S21 is slightly di erent between Af293 and A1163 strains, clear peaks are seen in the +42 Da position for both samples, as shown in Fig. 4.
In yeast RSPs, when the penultimate amino acid residue is serine, N-terminal methionine loss followed by N-terminal acetylation is likely to occur. 24,25) Among probably acetylated nine RSPs, L31, L35, and S18 have an MS-sequence at the N-terminal side. In yeast RSPs, S21 with ME-and S28 with MD-are acetylated. 25) is information strongly suggests the acetylation of S21 and S28 of A. fumigatus strains with the same N-terminal sequences. Yeast S11, S15, S16, and S24 with MS-sequences are N-acetylated. 25) However, rat S11 (in UniProtKB, P62282) and S15 26) with MA-would also be N-acetylated. erefore, S11 and S15 (and also probably S16) with MA-are likely to be N-acetylated.
Methylation is another possible post-translational modi -cation of RSPs. In methylation of L42 at Lys-55 is evolutionally conserved among eukaryotes. 27) Because sequence homology around Lys-55 is high (yeast Lys-55 corresponds to Lys-50 of A. fumigatus by similarity), methylation is likely to be a post-translational modi cation of L42 of A. fumigatus.
A clear peak could in fact be observed around m/z 12028.3, taking account of +14 Da added to the calculated sequence mass. Prolyl dihydroxylation of eukaryotic S23 is known as an evolutionarily conserved modi cation, 28) and Pro-64 is hydroxylated in yeast S23. High sequence homology around Pro-64 of S23 suggests S23 of A. fumigatus strains to also be hydroxylated, resulting in a +32 Da shi . e corresponding peaks could be clearly observed around m/z 15802.5.

List of ribosomal protein biomarkers and its applicability
In this way, we could nally con rm the mass of 31 of 50 expressed RSPs under 20,000 Da. Most of the intense peaks observed under m/z 20,000 could be identi ed, as shown in Fig. 1. Unidenti ed RSPs are probably caused by low ionization e ciency due to the acidic properties and unclear post-translational modi cations (we found more putative methylated and acetylated RSPs, but they are omitted in this paper due to a lack of supporting references). Tables 2 and 3 summarize the assigned ribosomal proteins of A. fumigatus Af293 and A1163 strains, together with calculated masses and possible post-translational modi cations. Almost all identi ed RSPs have the same sequence and mass except for S21 with only one amino acid di erence.
To con rm the applicability of the reference mass list, RSPs of the neotype strain IFM 57323 NT and a clinical isolate IFM 62104 were further characterized. Because the criteria of species identi cation is the similarity to the type strain, the characterization of RSPs of IFM 57323 NT would be important to establish the reliable biomarker list for the identi cation of A. fumigatus. e characterization of the   clinical isolate IFM 62104, which have been already identied as A. fumigatus, was performed as a demonstration for the analysis of real samples. Figure 5 shows the partial mass spectra of ribosomal protein fractions obtained from (a) the Af293, (b) A1163, (c) IFM 57323 NT , and (d) IFM 62104 (whole mass spectra of IFM 57323 NT and IFM 62104 are shown in Figs. SI-1 and SI-2 in the supporting information). In this mass range, seven identi ed RSPs (S31, L38, L43, S21, L37, L30, and L36) are commonly observed. Here, of two types of S21, the peak for IFM 57323 NT and IFM 62104 appeared the same as S21 of A1163. In the entire mass spectra, all 31 RSP biomarkers could be observed for the IFM 57323 NT and IFM 62104 strains. ese results suggest that the reference mass list can be used as a clue for the species identi cation of A. fumigatus.

CONCLUSION
In this study, we have investigated the actual state of RSPs in the public protein databases by characterizing the RSPs of genome-sequenced strains of A. fumigatus Af293 and A1163. As a result, we could solve the problems of the registered information of RSPs in the public protein databases.
As for the problem concerning the confusion of the nomenclature, all the RSPs' names were veri ed and uni ed to the names based on yeast which is most prevalent in the public protein databases (also listed under the new uni ed naming system 15) ). As for the second problem originated from incorrect sequence information, we have pointed out that more than half of the A. fumigatus RSPs are incorrect mainly due to mis-annotation of exon/intron structures. Because RSPs are highly conserved, we could easily nd out the candidates of the correct sequences, and verify them by comparing the theoretical mass with the observed mass. In addition, the post translational modi cations such as acetylation and methylation could also be con rmed.
By solving these problems, we have successfully completed the reference mass list of two genome-sequenced strains of A. fumigatus. By using the completed sequence information of the RSPs of A. fumigatus as a reference, information on the RSPs of other related fungal strains can be more easily veri ed by combining in silico inspection with MALDI-TOF MS measurements. We are proceeding with the characterization of RSPs of other Aspergillus genomesequenced strains to make reliable lists of biomarker RSPs for identi cation of Aspergillus species. Once the Aspergillus RSP biomarker lists have been compiled, ribosomal proteinbased MALDI-TOF MS is anticipated to be a powerful and reliable tool in the eld of clinical microbiology.