A strategy for predicting gene functions from genome and metagenome sequences on the basis of oligopeptide frequency distance

Takashi Abe; Ryo Ikarashi; Masaya Mizoguchi; Masashi Otake; Toshimichi Ikemura

doi:10.1266/ggs.19-00041

ABSTRACT

As a result of the extensive decoding of a massive amount of genomic and metagenomic sequence data, a large number of genes whose functions cannot be predicted by sequence similarity searches are accumulating, and such genes are of little use to science or industry. Current genome and metagenome sequencing largely depend on high-throughput and low-cost methods. In the case of genome sequencing for a single species, high-density sequencing can reduce sequencing errors. For metagenome sequences, however, high-density sequencing does not necessarily increase the sequence quality because multiple and unknown genomes, including those of closely related species, are likely to exist in the sample. Therefore, a function prediction method that is robust against sequence errors becomes an increased need. Here, we present a method for predicting protein gene function that does not depend on sequence similarity searches. Using an unsupervised machine learning method called BLSOM (batch-learning self-organizing map) for short oligopeptide frequencies, we previously developed a sequence alignment-free method for clustering bacterial protein genes according to clusters of orthologous groups of proteins (COGs), without using information from COGs during machine learning. This allows function-unknown proteins to cluster with function-known proteins, based solely on similarity with respect to oligopeptide frequency, although the method required high-performance supercomputers (HPCs). Based on a wide range of knowledge obtained with HPCs, we have now developed a strategy to correlate function-unknown proteins with COG categories, using only oligopeptide frequency distances (OPDs), which can be conducted with PC-level computers. The OPD strategy is suitable for predicting the functions of proteins with low sequence similarity and is applied here to predict the functions of a large number of gene candidates discovered using metagenome sequencing.

INTRODUCTION

Since the development of base sequence decoding technologies, the rate of deciphering genomic and metagenomic sequences has accelerated dramatically. Nucleotide and amino acid sequence similarity searches, such as BLAST (Altschul et al., 1997), are widely used for evolutionary analyses and are indispensable tools for predicting gene functions when genomes and metagenomes are newly decoded, thus serving as basic bioinformatics tools. While the usefulness of sequence similarity searches is apparent, the functions of nearly half of the inferred genes cannot be predicted by these searches, especially when highly novel genomes, as well as metagenomes from novel environmental samples, are deciphered. Therefore, while enormous numbers of gene candidates continue to accumulate, a substantial portion is of no use either scientifically or industrially.

Current genome sequencing largely depends on the wide use of high-throughput and low-cost technologies. In the case of genome sequencing for a single species, sequencing at high density can undoubtedly reduce sequence errors. However, when analyzing a metagenome sample, high-density sequencing does not necessarily increase the sequence quality because multiple genomes coexist in the sample, and thus the predicted gene sequences may have non-negligible errors.

Our group has analyzed the microbiome of ticks, which transmit a variety of viral, bacterial and protozoal pathogens and cause various tick-borne diseases (Nakao et al., 2013). Because of the medical and social importance of tick-borne diseases, we have searched for and characterized pathogens and pathogenic genes from numerous metagenomic sequences derived from the microbiome of multiple ticks. As mentioned above, the sequence quality of metagenomic sequences is thought to be inevitably low. When considering gene function prediction for low-quality sequences, comprehensive judgment using multiple function prediction methods involving different algorithms, including a new method that is robust against sequence errors, becomes important, and this is the start point of the present study.

Regarding protein function, the three-dimensional organization of functional peptide segments is more important than the one-dimensional amino acid sequence, and, therefore, clear conservation of the amino acid sequence over the whole polypeptide often cannot be found between proteins with the same or similar functions. Function prediction programs (e.g., InterPro (Finn et al., 2017)) and databases (e.g., SCOP (Andreeva et al., 2014), FUGUE (Shi et al., 2001) and CATH (Sillitoe et al., 2015)), which focus on functionally important domains and motifs common among proteins with the same or similar functions, have been widely utilized (Das and Orengo, 2016). Additionally, function prediction techniques for proteins with low sequence similarity (proteins in the so-called “twilight zone”; sequence identity with available reference sequences is less than 30%) have been developed (Rost, 1999; Chang et al., 2008; Khor et al., 2015). Even so, these methods are inadequate for estimating the functions of the enormous numbers of function-unknown genes that are accumulating at present and will continue to do so in the future. The establishment of function prediction methods based on new principles that complement sequence similarity searches is urgently needed.

The self-organizing map (SOM) developed by Kohonen is an unsupervised machine learning algorithm that provides an effective tool for clustering and visualizing high-dimensional complex data on a two-dimensional map (Kohonen, 1990; Kohonen et al., 1996). We previously modified the conventional SOM for genome informatics, making the learning process and resulting map independent of data input order: this is the batch-learning SOM (BLSOM) (Kanaya et al., 2001; Abe et al., 2003). The unsupervised method thus developed is suitable for high-performance parallel computing and therefore for big data analysis. A BLSOM for oligonucleotide composition (e.g., 256 dimensions for tetranucleotide composition) can cluster genomic fragment sequences (e.g., 10 kb) of a wide range of species according to phylotype, without information regarding species during learning, and can thereby reveal various novel genome characteristics from genomic and metagenomic sequences (Uehara et al., 2011; Iwasaki et al., 2013, 2017). In addition, by focusing on short oligopeptide frequencies, we established a protein function prediction method that is robust against mutations including insertions/deletions (Abe et al., 2009). This sequence alignment-free method can predict the functions of a large number of proteins derived from environmental microorganisms, even for proteins with low identity and coverage levels in similarity searches. The BLSOM, however, requires substantial computer resources and computation times and thus is suitable for use with high-performance supercomputers (HPCs). However, HPCs are not widely used among experimental groups. Based on a wide range of information obtained via this artificial intelligence method using HPCs, we develop here a new function prediction strategy that can be conducted with PC-level computers. This strategy predicts protein function directly on the basis of oligopeptide frequency distance (OPD).

In the present study, amino acid sequences from the clusters of orthologous groups of proteins (COGs) database compiled by NCBI were used as the data set for function-known proteins (Galperin et al., 2015). The COG database clusters proteins derived from perfectly decoded genomes based on the relationship of the best hit from a sequence similarity search, thus generating protein function groups. These function groups are so useful that they are used in annotation by almost all microbial genome projects (Koonin and Wolf, 2008; Kuzniar et al., 2008; Altermann et al., 2017; Kristensen et al., 2017).

To predict the function of function-unknown query proteins, we calculated short oligopeptide frequencies in both query and COG proteins and, for each query, searched for the COG protein having the minimum Euclidean distance for its oligopeptide frequencies. We built an analytic workflow to perform this function prediction, utilizing the knowledge accumulated in BLSOM analyses using HPCs. To evaluate the performance of the present strategy, function prediction was performed on proteins obtained using environmental metagenome sequencing.

MATERIALS AND METHODS

Amino acid sequences

Amino acid sequences of COGs were obtained from ftp://ftp.ncbi.nih.gov/pub/COG/COG2014/data/, and sequences of proteins from environmental metagenome samples were obtained from ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/env_nr.gz. To reduce the computation time, we employed the method described by Ferran et al. (1994), wherein tripeptide or tetrapeptide frequencies were calculated with the degenerate 11 or 6 groups of residues, in which amino acids having similar physicochemical properties were grouped as the same residue sets: {V, L, I}, {T, S}, {N, Q}, {E, D}, {K, R, H}, {Y, F, W}, {M}, {P}, {C}, {A} and {G} for tripeptide or {V, L, I, M}, {T, S, P, G, A}, {E, D, N, Q}, {K, R, H}, {Y, F, W} and {C} for tetrapeptide frequencies.

Workflow for function prediction based on oligopeptide frequency distance

An overview of the workflow is shown in Fig. 1. From approximately 1.8 million COG-assigned proteins, those proteins 100 amino acids (aa) in length or longer were selected and fragmented with a window size of 100 aa. This fragmentation was done because our previous BLSOMs showed that fragmented protein sequences gave a better prediction performance than did full-length sequences. For each fragmented sequence, dipeptide frequencies were calculated; the 400 (= 20²) dimensional vectorial data were abbreviated as Di20. In addition, amino acids having similar physicochemical properties were assembled into 11 groups as described above, and for the 11 amino acid-degenerate groups, the tripeptide frequencies were calculated; the 1,331 (= 11³) dimensional vectorial data were abbreviated as Tri11. These two sets of vectorial data yielded high performance regarding function prediction in the previous BLSOM and were used in the present study. To be more precise, from 1,785,722 amino acid sequences from the NCBI COG database, a total of 7,218,021 sequences fragmented by a 100-aa window size were obtained, and two oligopeptide frequencies (Di20 and Tri11) in all 100-aa fragments were calculated and used as two separate sets of reference data for COGs.

Fig. 1.

Overview of protein function prediction using oligopeptide frequency distance. As an example of total proteins used in this analysis, 2,183,152 fragments were generated in step 1. After calculating the minimum Euclidean distance of oligopeptide frequencies between these fragments and databases (DBs) in step 2, 31,272 and 40,006 proteins were predicted by Di20 and Tri11, respectively, in step 3. Finally, in step 4, 28,378 proteins were predicted as opdCOG proteins.

In the case of function-unknown query proteins (i.e., those derived from metagenome sequences in the present study), fragmentation with a 100-aa window sliding with a 10-aa step was performed, as previously done in BLSOM, to reduce the effect of the fragmentation start position on matching with COG-derived 100-aa fragments. Di20 and Tri11 frequencies were calculated in each 100-aa fragment from query proteins (Step 1 in Fig. 1), their Euclidean distances to all 100-aa COG reference data were calculated, and the reference sequence with the minimum Euclidean distance was assigned (Step 2 in Fig. 1). The tentative function was thus predicted for each 100-aa fragment from the query (abbreviated as COG-ID). Steps 3 and 4 in Fig. 1 are explained in the Results.

The web service and source code can be accessed freely at http://opd.bio.ie.niigata-u.ac.jp/en/.

RESULTS

Preparation of the test data reliably assigned to COGs with BLASTP

To evaluate the performance of the OPD strategy, we used 7,003,524 protein sequences derived from metagenome studies compiled by NCBI as query data for function-unknown proteins. To examine a possible length effect of the query proteins on prediction performance, we produced nine sets with different lengths: 100–149, 150–199, 200–249, 250–299, 300–349, 350–399, 400–449, 450–499 and equal to or longer than 500 aa. After the set assignment, 10,000 proteins were randomly selected from each set.

For the total of 90,000 proteins thus selected, a similarity search using BLASTP (ver. 2.6.0+) was conducted against the original COG sequences (i.e., COG proteins without the 100-aa fragmentation). Using a threshold value of 1E-5 or less and requiring both sequence identity and coverage levels to be 70% or more, 15,184 proteins (16.9%) were reliably assigned to COGs via BLASTP, and the best hit COG for each protein was selected and abbreviated as blastCOG. The 15,184 blastCOG-assigned proteins were used to compare the prediction performance of BLASTP and the present strategy.

Evaluation using the test proteins reliably assigned to COGs with BLASTP

For the 15,184 blastCOG proteins, we separately predicted their candidate COGs based on only OPD, as follows. First, from the 15,184 sequences, 399,864 fragment sequences generated by fragmentation with a 100-aa window and a 10-aa sliding step were obtained (Step 1 in Fig. 1). Di20 and Tri11 values were then calculated for each fragment, and function prediction was performed by selecting a COG-ID as that with the minimum Euclidean distance in the 100-aa data set derived from COGs (Step 2 in Fig. 1), as described in Materials and Methods.

When the same COG-ID was found in more than 60% of all 100-aa fragments from one query protein, the COG was set as the most probable candidate for the function of the protein (Step 3 in Fig. 1). Finally, when the identical COG was predicted from both Di20 and Tri11, this COG was assigned as the final prediction and abbreviated as opdCOG for this protein (Step 4 in Fig. 1). If these two did not match, the function was assumed not to be predictable by the present strategy.

When comparing these opdCOGs with blastCOGs, the percentage accordance was 96.9% for Di20 and 98.6% for Tri11 on average for nine different length groups (Table 1).

Table 1. Comparison between blastCOGs and opdCOGs

Sample ID^*1	#protein^*1	#fragment^*2	#Di20^*3	#Tri11^*4	#Tetra6^*5	#opdCOGs^*6	#Identical COGs among Di20, Tri11 and Tetra6^*7	#blastCOGs^*8	#Di20 in blastCOGs^*9	#Identical Di20 in blastCOGs^*10	%Identical Di20 in blastCOGs^*11
100–149	10,000	40,743	5,278	5,926	5,129	3,187	2,266	1,357	1,343	1,334	98.3
150–199	10,000	89,997	2,963	3,965	2,525	2,664	1,871	1,262	1,221	1,215	96.3
200–249	10,000	140,520	2,937	4,014	2,307	2,819	1,899	1,364	1,323	1,321	96.8
250–299	10,000	189,405	3,220	4,391	2,567	3,147	2,179	1,592	1,537	1,531	96.2
300–349	10,000	232,919	3,797	4,933	2,726	3,737	2,422	2,218	2,141	2,136	96.3
350–399	10,000	288,912	3,397	4,452	2,350	3,330	2,104	1,809	1,776	1,773	98.0
400–449	10,000	339,686	3,409	4,439	2,421	3,364	2,218	1,876	1,832	1,825	97.3
450–499	10,000	389,483	3,157	4,029	2,348	3,082	2,132	1,806	1,748	1,737	96.2
≧500	10,000	471,487	3,114	3,857	2,154	3,048	1,982	1,900	1,842	1,835	96.6
Total	90,000	2,183,152	31,272	40,006	24,527	28,378	19,073	15,184	14,763	14,707	96.9

Sample ID^*1	#Tri11 in blastCOGs^*12	#Identical Tri11 in blastCOGs^*13	%Identical Tri11 in blastCOGs^*14	#Tetra6 in blastCOGs^*15	#Identical Tetra6 in blastCOGs^*16	%Identical Tetra6 in blastCOGs^*17	#opdCOGs in blastCOGs^*18	#Identical opdCOGs in blastCOGs^*19	%Identical opdCOGs in blastCOGs^*20	#opdCOG-only^*21	%opdCOG-only^*22
100–149	1,350	1,344	99.0	1,188	1,102	81.2	1,330	1,328	97.9	1,857	21.5
150–199	1,243	1,240	98.3	965	954	75.6	1,210	1,208	95.7	1,454	16.6
200–249	1,350	1,348	98.8	1,003	996	73.0	1,318	1,316	96.5	1,501	17.4
250–299	1,576	1,568	98.5	1,193	1,178	74.0	1,532	1,526	95.9	1,615	19.2
300–349	2,193	2,187	98.6	1,461	1,448	65.3	2,139	2,134	96.2	1,598	20.5
350–399	1,799	1,794	99.2	1,229	1,223	67.6	1,772	1,769	97.8	1,558	19.0
400–449	1,865	1,859	99.1	1,339	1,327	70.7	1,828	1,822	97.1	1,536	18.9
450–499	1,784	1,770	98.0	1,381	1,366	75.6	1,746	1,736	96.1	1,336	16.3
≧500	1,878	1,867	98.3	1,251	1,240	65.3	1,839	1,832	96.4	1,209	14.9
Total	15,038	14,977	98.6	11,010	10,834	71.4	14,714	14,671	96.6	13,664	18.3

^*1 100–149 shows the protein group having lengths between 100 and 149 aa, and so on. From each group, 10,000 proteins were randomly selected.

^*2 Number of 100-aa fragments with a sliding step of 10 aa.

^*3 Number of proteins assigned to a COG by Di20.

^*4 Number of proteins assigned to a COG by Tri11.

^*5 Number of proteins assigned to a COG by Tetra6.

^*6 Number of total proteins assigned to a COG by both Di20 and Tri11.

^*7 Number of total proteins assigned to a COG by Di20, Tri11 and Tetra6.

^*8 Number of proteins reliably assigned to a COG by BLASTP.

^*9 Number of blastCOG proteins assigned to a COG by Di20.

The percentage of false positives is 100 × (#Di20 in blastCOGs – #Identical Di20 in blastCOGs) / #Di20 in blastCOGs.

^*10 Number of blastCOG proteins assigned to a COG by both Di20 and BLASTP.

^*11 %Identical Di20 in blastCOGs is 100 × (#Identical Di20 in blastCOGs / #blastCOGs).

^*12 Number of blastCOG proteins assigned to a COG by Tri11.

The percentage of false positives is 100 × (#Tri11 in blastCOGs – #Identical Tri11 in blastCOGs) / #Tri11 in blastCOGs.

^*13 Number of blastCOG proteins assigned to a COG by both Tri11 and BLASTP.

^*14 %Identical Tri11 in blastCOGs is 100 × (#Identical Tri11 in blastCOGs / #blastCOGs).

^*15 Number of blastCOG proteins assigned to a COG by Tetra6.

The percentage of false positives is 100 × (#Tetra6 in blastCOGs – #Identical Tetra6 in blastCOGs) / #Tetra6 in blastCOGs.

^*16 Number of blastCOG proteins assigned to a COG commonly by Tetra6 and BLASTP.

^*17 %Identical Tetra6 in blastCOGs is 100 × (#Identical Tetra6 in blastCOGs / #blastCOGs).

^*18 Number of blastCOG proteins assigned to a COG by both Di20 and Tri11.

The percentage of false positives is 100 × (#opdCOGs in blastCOGs - #Identical opdCOGs in blastCOGs) / #opdCOGs in blastCOGs.

^*19 Number of blastCOG proteins assigned to a COG by both opdCOGs and blastCOGs.

^*20 %Identical opdCOGs in blastCOGs is 100 × (#Identical opdCOGs in blastCOGs / #blastCOGs).

^*21 #opdCOG-only is (#opdCOGs – #opdCOGs in blastCOGs).

^*22 %opdCOG-only is 100 × (#opdCOG-only / (#protein – #blastCOGs)).

As an additional comparison to the previous BLSOM analysis (Abe et al., 2009), we tested the tetrapeptide composition after grouping into six categories based on similar physicochemical properties (abbreviated as Tetra6), and found that the accordance was lower than that of Di20 and Tri11, as found previously. Therefore, the following analyses in the present study focus only on Di20 and Tri11, as conducted in the previous BLSOM study. A Venn diagram shows a high level of overlap between the COG assignments obtained for opdCOGs and blastCOGs (Fig. 2).

Fig. 2.

Venn diagram representing COG predictions obtained by three OPD methods for the comparison of opdCOGs and blastCOGs. The number and percentage values in parentheses show the number of opdCOGs properly assigned to a COG with BLASTP and its percentage, respectively.

The OPD method using Tri11 yielded better performance in function prediction than that without grouping or using Tetra6, like the previous BLSOM. In addition, Di20 gave a similarly high performance to Tri11. Levels of false positives (i.e., assignment to unrelated COGs) were 0.38%, 0.41% and 1.60% for Di20, Tri11 and Tetra6, respectively, and the level was less than 0.29% for opdCOGs. We believe that the final identification (i.e., opdCOGs) should be the most reliable prediction obtained with the present OPD strategy. Table 1 shows that for two data sets (100–149 aa and equal to or longer than 500 aa), there was no significant difference in the accordance level between opdCOGs and blastCOGs for the proteins reliably assigned to COGs with BLASTP.

Although we used a sequence alignment-free method, the results were almost identical to those obtained with BLASTP, in the case where COGs could be reliably assigned with BLASTP. This finding indicates that oligopeptide frequencies should represent basal features of functional motif peptides and of constituent elements for 3D structure formation. The proportions of genes categorized into particular COG categories with opdCOGs and blastCOGs are shown in Supplementary Fig. S1. Since opdCOGs matched almost completely with blastCOGs, a wide range of functional categories should be assignable by the OPD strategy.

Comparison with BLSOM method

To compare the present performance with our previous BLSOM, COGs for 4,240 Sargasso proteins, which were predicted from metagenomic sequences obtained from the Sargasso Sea (Venter et al., 2004) and analyzed in the previous BLSOM study, were predicted using OPD and BLSOM methods. The Sargasso proteins were first robustly assigned to COGs with BLASTP, based on the strict criterion described by Kosuge et al. (2006) and used for evaluation in our previous BLSOM method. Venn diagrams of COG assignments obtained by the BLSOM and OPD methods (Fig. 3A and 3B, respectively) show that the OPD method gives higher accordance than the BLSOM method (Table 2). Levels of false positives in the OPD method were almost the same as in the BLSOM method. The computation time for protein groups having sizes between 100 and 149 aa was approximately 11 h with one PC (CPU: Intel (R) Xeon (R) E5-2630, 2.6 GHz × 2). In the case of the OPD method, we can use several PCs simultaneously to reduce the computation time, and this method is therefore suitable for a PC-level study.

Fig. 3.

Venn diagrams representing COG predictions against 4,240 Sargasso proteins obtained by BLSOM (A) and OPD (B) methods. The number and percentage values in parentheses show the number of Sargasso proteins properly assigned to a COG with BLSOM and OPD methods and its percentage, respectively. The details of the BLSOM method and 4,240 Sargasso proteins assigned to a COG by BLASTP were described by Abe et al. (2009).

Table 2. Comparison between BLSOM and OPD

Oligopeptide freq.	#Seq.^*2	BLSOM^*1		OPD
Oligopeptide freq.	#Seq.^*2	Number	Percentage	Number	Percentage
Di20	4,240	3,590	84.7	4,172	98.4
Tri11	4,240	3,740	88.2	4,171	98.4
Tetra6	4,240	3,347	78.9	4,162	98.2
Consensus	4,240	2,986	70.4	4,148	97.8

^*1 The BLSOM method was described in Abe et al. (2009).

^*2 Number of Sargasso proteins assigned to a COG by BLASTP.

Proteins whose functions were not conclusively predicted with BLASTP

The value of a new strategy is reflected in its capability to predict functions of proteins for which sequence similarity searches are inconclusive. Hence, we examined the query proteins for which BLASTP could not conclusively assign COGs, i.e., the residual 74,816 proteins after subtraction of the above-analyzed 15,184 proteins from total proteins. After the above-mentioned four steps shown in Fig. 1, the OPD strategy could assign COGs to 13,664 proteins, excluding 14,714 proteins that were assigned to a COG in blastCOG proteins, from 28,378 proteins in total opdCOG proteins (18.3% of the 74,816 proteins), and these proteins were abbreviated as opdCOG-only. Table 1 shows that for the data sets of both 100–149 aa and equal to or longer than 500 aa, their difference in the percentage of opdCOG-only proteins was small.

Notably, when these opdCOG-only proteins were analyzed separately with BLASTP against the COG database, the best-hit COG was the same as that of the opdCOG in 99.5% of cases. Importantly, if we used only BLASTP, the predicted functions for these 13,664 proteins are tentative because their identity and coverage levels were less than 70%, but the combinatorial prediction with the present alignment-free method should make the prediction more reliable.

We finally characterized features of the 13,664 opdCOG-only proteins in more detail, to investigate the applicability of the present method to the “twilight zone” proteins, which show a very low BLASTP identity and coverage level. Figure 4 displays the BLASTP identity and coverage levels found for the opdCOG-assigned protein with the lowest E-value for each of the environmental metagenome queries against the COG database. Although a major portion of the data had more than 50% identity and more than 90% coverage, nine proteins had less than 30% identity or coverage. For many cases where BLASTP yielded a low percentage identity (e.g., < 50%; there were 402 proteins with less than 50% identity), the OPD strategy could assign COGs regardless of differences in protein length (Fig. 4). Among the 402 proteins in the example above, 238 gave no similarity by BLASTP, when default parameters were used to search the NCBI non-redundant database. The number of proteins having low coverage (e.g., 60% or lower) was very low, indicating that the OPD strategy could assign COGs to proteins that had relatively low but long-range similarity.

Fig. 4.

Identity and coverage levels found for the opdCOG protein with the lowest E-value for each of the environmental metagenome queries, which excluded blastCOG proteins. Nine sets with different lengths were as follows: (A) 100–149, (B) 150–199, (C) 200–249, (D) 250–299, (E) 300–349, (F) 350–399, (G) 400–449, (H) 450–499 and (I) equal to or longer than 500 aa.

Additionally, even for the substantial number of proteins showing very low identity percentages (e.g., < 40%, including approximately 30%), the present strategy could predict opdCOGs. Therefore, this strategy provides a new tool for predicting the functions of proteins with low sequence similarity identity because the two methods are based on clearly distinct principles: sequence alignment-free and -dependent methods.

DISCUSSION

In the present study, we randomly selected a total of 90,000 protein genes from 7,003,524 candidates derived from metagenome studies compiled by NCBI. Because the same number (10,000) of proteins with different lengths were randomly selected, the results shown in Fig. 2 could be directly compared without the confounding effect of data quantity. Since the strategy described here was established, we have begun to analyze all 7,003,524 proteins and will publish elsewhere the resulting opdCOGs along with the best-hit COGs obtained using BLASTP. When function assignments are made using only one method, they appear to be tentative, but the combined assignment with the sequence similarity search and the alignment-free method should give a more reliable prediction than that obtained using either method separately.

In this study, to test the performance of the newly developed method, we have focused on query proteins derived from metagenome sequences and COG proteins. When analyzing metagenome samples such as those obtained from novel and poorly characterized environments, protein genes of eukaryotes (i.e., fungi, nematodes, etc.) may become important. In this case, the OPD method can be easily changed by using data of eukaryotic orthologous groups (KOGs) (Koonin et al., 2004), eggNOG (Huerta-Cepas et al., 2019) or OrthoDB (Kriventseva et al., 2019). In analyzing higher eukaryote proteins, large multifunctional and multidomain proteins should become more important because the size of higher eukaryote proteins is often very large. In the present study, one function (COG in the present case) was predicted if more than 60% of all 100-aa fragments (with a sliding step of 10 aa) of one protein were predicted as having the same function. In the case of a large protein, the protein can be divided into long segments with window lengths of 500 aa or more with a sliding step of 10 aa, and a possible function can be predicted for each large segment. Our group is currently developing this strategy.

Oligopeptides are components of proteins and can be involved in the formation of both functional motifs and 3D structures. The present strategy, which focuses on oligopeptide frequency distance, resembles functional motif searches such as PRINTS (Attwood, 2002) and PROSITE (Sigrist et al., 2013), which are powerful, indispensable methods for the prediction of protein function. The functional motif search, however, does not adequately incorporate information pertaining to 3D structure formation. In the case of the OPD strategy, not only oligopeptides for functional motifs but also those contributing to 3D structure formation should be included for predicting functions. Furthermore, because functional motifs are obtained mostly from experimentally well-characterized proteins, it may be less useful in predicting the functions of less-characterized proteins originating from poorly characterized organisms. In contrast, the present strategy has the advantage that no prior knowledge about the target protein is required and therefore should be appropriate for analyses of diverse novel proteins.

Sequence similarity and functional motif searches undoubtedly are indispensable tools for predicting protein functions, but even after combining these powerful methods, many function-unknown proteins remain. To complement these conventional methods and provide additional new information that is helpful for integrative assessments, we previously developed BLSOM mainly by using HPCs, and here, we developed the OPD method suitable for PC-level computers. In addition, if we use HPCs, the present method can analyze not only the massive amount of data currently available but also the unimaginably big data that will be accumulated in the near future.

The most important contribution of these two alignment-free methods (OPD and BLSOM) is to predict the functions of the increasingly large number of function-unknown proteins derived from poorly characterized organisms, such as those studied using metagenomic approaches (Dudhagara et al., 2015; Noecker et al., 2017), and thus to serve a new and powerful tool in the post-genome era.

ACKNOWLEDGMENTS

This work was supported by Grants-in-Aid for Scientific Research (C: nos. 26330334 and 17K00401) from the Ministry of Education, Culture, Sports, Science and Technology, Japan. The computation in the early stage of the present study was done with the Earth Simulator of the Japan Agency for Marine-Earth Science and Technology.

REFERENCES

Abe, T., Kanaya, S., Kinouchi, M., Ichiba, Y., Kozuki, T., and Ikemura, T. (2003) Informatics for unveiling hidden genome signatures. Genome Res. 13, 693–702.
Abe, T., Kanaya, S., Uehara, H., and Ikemura, T. (2009) A novel bioinformatics strategy for function prediction of poorly-characterized protein genes obtained from metagenome analyses. DNA Res. 16, 287–297.
Altermann, E., Lu, J., and McCulloch, A. (2017) GAMOLA2, a comprehensive software package for the annotation and curation of draft and complete microbial genomes. Front. Microbiol. 8, 346.
Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402.
Andreeva, A., Howorth, D., Chothia, C., Kulesha, E., and Murzin, A. G. (2014) SCOP2 prototype: a new approach to protein structure mining. Nucleic Acids Res. 42, D310–D314.
Attwood, T. K. (2002) The PRINTS database: a resource for identification of protein families. Brief. Bioinform. 3, 252–263.
Chang, G. S., Hong, Y., Ko, K. D., Bhardwaj, G., Holmes, E. C., Patterson, R. L., and van Rossum, D. B. (2008) Phylogenetic profiles reveal evolutionary relationships within the “twilight zone” of sequence similarity. Proc. Natl. Acad. Sci. USA 105, 13474–13479.
Das, S., and Orengo, C. A. (2016) Protein function annotation using protein domain family resources. Methods 93, 24–34.
Dudhagara, P., Bhavsar, S., Bhagat, C., Ghelani, A., Bhatt, S., and Patel, R. (2015) Web resources for metagenomics studies. Genomics Proteomics Bioinformatics 13, 296–303.
Ferrán, E. A., Pflugfelder, B., and Ferrara, P. (1994) Self-organized neural maps of human protein sequences. Protein Sci. 3, 507–521.
Finn, R. D., Attwood, T. K., Babbitt, P. C., Bateman, A., Bork, P., Bridge, A. J., Chang, H.-Y., Dosztányi, Z., El-Gebali, S., Fraser, M., et al. (2017) InterPro in 2017-beyond protein family and domain annotations. Nucleic Acids Res. 45, D190–D199.
Galperin, M. Y., Makarova, K. S., Wolf, Y. I., and Koonin, E. V. (2015) Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res. 43, D261–D269.
Huerta-Cepas, J., Szklarczyk, D., Heller, D., Hernández-Plaza, A., Forslund, S. K., Cook, H., Mende, D. R., Letunic, I., Rattei, T., Jensen, L. J., et al. (2019) eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309–D314.
Iwasaki, Y., Abe, T., Wada, K., Wada, Y., and Ikemura, T. (2013) A novel bioinformatics strategy to analyze microbial big sequence data for efficient knowledge discovery: batch-learning self-organizing map (BLSOM). Microorganisms 1, 137–157.
Iwasaki, Y., Abe, T., Wada, K., Wada, Y., and Ikemura, T. (2017) An artificial intelligence approach fit for tRNA gene studies in the era of big sequence data. Genes Genet. Syst. 92, 43–54.
Kanaya, S., Kinouchi, M., Abe, T., Kudo, Y., Yamada, Y., Nishi, T., Mori, H., and Ikemura, T. (2001) Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM): characterization of horizontally transferred genes with emphasis on the E. coli O157 genome. Gene 276, 89–99.
Khor, B. Y., Tye, G. J., Lim, T. S., and Choong, Y. S. (2015) General overview on structure prediction of twilight-zone proteins. Theor. Biol. Med. Model. 12, 15.
Kohonen, T. (1990) The self-organizing map. Proc. IEEE 78, 1464–1480.
Kohonen, T., Oja, E., Simula, O., Visa, A., and Kangas, J. (1996) Engineering applications of the self-organizing map. Proc. IEEE 84, 1358–1384.
Koonin, E. V., Fedorova, N. D., Jackson, J. D., Jacobs, A. R., Krylov, D. M., Makarova, K. S., Mazumder, R., Mekhedov, S. L., Nikolskaya, A. N., Rao, B. S., et al. (2004) A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol. 5, R7.
Koonin, E. V., and Wolf, Y. I. (2008) Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic Acids Res. 36, 6688–6719.
Kosuge, T., Abe, T., Okido, T., Tanaka, N., Hirahata, M., Maruyama, Y., Mashima, J., Tomiki, A., Kurokawa, M., Himeno, R., et al. (2006) Exploration and grading of possible genes from 183 bacterial strains by a common protocol to identification of new genes: Gene Trek in Prokaryote Space (GTPS). DNA Res. 13, 245–254.
Kristensen, D. M., Wolf, Y. I., and Koonin, E. V. (2017) ATGC database and ATGC-COGs: an updated resource for micro- and macro-evolutionary studies of prokaryotic genomes and protein family annotation. Nucleic Acids Res. 45, D210–D218.
Kriventseva, E. V., Kuznetsov, D., Tegenfeldt, F., Manni, M., Dias, R., Simão, F. A., and Zdobnov, E. M. (2019) OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs. Nucleic Acids Res. 47, D807–D811.
Kuzniar, A., van Ham, R. C. H. J., Pongor, S., and Leunissen, J. A. M. (2008) The quest for orthologs: finding the corresponding gene across genomes. Trends Genet. 24, 539–551.
Nakao, R., Abe, T., Nijhof, A. M., Yamamoto, S., Jongejan, F., Ikemura, T., and Sugimoto, C. (2013) A novel approach, based on BLSOMs (batch learning self-organizing maps), to the microbiome analysis of ticks. ISME J. 7, 1003–1015.
Noecker, C., McNally, C. P., Eng, A., and Borenstein, E. (2017) High-resolution characterization of the human microbiome. Transl. Res. 179, 7–23.
Rost, B. (1999) Twilight zone of protein sequence alignments. Protein Eng. 12, 85–94.
Shi, J., Blundell, T. L., and Mizuguchi, K. (2001) FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol. 310, 243–257.
Sigrist, C. J. A., de Castro, E., Cerutti, L., Cuche, B. A., Hulo, N., Bridge, A., Bougueleret, L., and Xenarios, I. (2013) New and continuing developments at PROSITE. Nucleic Acids Res. 41, D344–D347.
Sillitoe, I., Dawson, N., Thornton, J., and Orengo, C. (2015) The history of the CATH structural classification of protein domains. Biochimie 119, 209–217.
Uehara, H., Iwasaki, Y., Wada, C., Ikemura, T., and Abe, T. (2011) A novel bioinformatics strategy for searching industrially useful genome resources from metagenomic sequence libraries. Genes Genet. Syst. 86, 53–66.
Venter, J. C., Remington, K., Heidelberg, J. F., Halpern, A. L., Rusch, D., Eisen, J. A., Wu, D., Paulsen, I., Nelson, K. E., Nelson, W., et al. (2004) Environmental genome shotgun sequencing of the Sargasso Sea. Science 304, 66–74.

Corresponding author

Register with J-STAGE for free!