Development of a Protein-Gene Motif Dictionary System for One-Stop Motif Analysis

Masahiro OHTOMO; Hiroaki KATO

doi:10.2477/jccjie.2020-0008

Abstract

The amino acid sequence of a protein is closely related to its structure and function. This is especially true for particular structural features called motifs, which are well-reserved sites in genome sequences. Biological data, such as the data for biopolymers, are rapidly increasing. Constructing a database for efficient analysis is important for identifying the structure and function of unknown biological data. Here, we constructed a protein-gene motif dictionary system for several model species using NoSQL, a database management system. This dictionary stored protein sequence motifs based on PROSITE, along with their corresponding mRNA sequences. Additionally, the database stored 3D structural information of the corresponding protein sequence motifs. The protein-gene dictionary has 49,265 registered entries, 120,047 sequence motifs, and 57,452 3D structural motifs from 7 model species. Software tools with graphical user interface were also developed to assist with intuitive search and analysis using the system. As a result, we discovered that zinc protease motif had co-occurrence with the cysteine switch motif. It was followed by the cysteine switch motif with a gap of 117 to 293 amino acids, however, its 3D Euclidean distance was preserved at around 12 Å.

1 INTRODUCTION

Genetic information is preserved in nucleotide sequences such as DNA. Amino acid protein sequences are translated through DNA duplication and editing. The 3D structure of a protein, which is dependent on its amino acid sequence, determines its biological function. A particular structural feature called a motif is well known to be closely related to the structure and function of proteins. Discovering common structural features of proteins and gene motifs is important, not only for analyzing protein sequences, structures, and functions, but also for analyzing gene functions. Biological data are increasing rapidly due to progress post the Human Genome Project, and in experimental devices. Meaningful biological knowledge can be discovered by combining many sequence and structural data. However, it is almost impossible to search and analyze multiple motif feature data without a computer. Therefore, constructing a knowledge database for these combined data are desirable for comprehensive feature analysis.

Databases for storing amino acid sequence motif information are available and accessible [1,2,3]. The PROSITE database is a widely used amino acid sequence motif database [1]. TRANSFAC, JASPER, and H-DBAS are well known gene sequence motif databases [4,5,6]. DSMP and MegaMotifBase databases store 3D information of protein sequence motifs [7, 8]. Protein and gene motif information are distributed across multiple databases, which are operated independently by organizations and researchers for various purposes. Databases for sequence and structural motifs tend to be aimed at proteins. Gene sequence motif databases tend to store well known sequence motifs such as the TATA box and GC box. These motifs are conserved untranslated regions. Gene sequence motifs in coding regions have not been well explored. GenomeNet, NCBI and ExPASy are examples of integrated database that provides cross-search for the collaborating biological databases [9,10,11]. Using these databases, the user can cross-search motif structure information and gene information for a protein. However, to obtain gene sequence and 3D structural information for the corresponding motif, you have to extract step-by-step referring to annotations.

Shoji et al. proposed a protein-gene motif dictionary system that stores protein sequence motifs based on the PROSITE regression patterns and their corresponding gene motifs in humans [12]. They showed that codon usage of a gene motif can explain in more detail, the sequential features of protein sequence motifs. Additionally, they successfully extracted protein sequence motif candidates using the gene motif. These motifs could not be extracted using regular expressions in PROSITE. Kobayashi et al. proposed that the motif codon reduced representation of the gene motif, and developed a system that estimates preserved motif positions in a gene sequence including the intron [13]. This system successful extracted EF-hand motif sites that were interrupted by introns. From these results, the protein-gene motif was able to analyze sequence motif features in more detail.

Using the protein-gene motif, we can describe sequence motif features, rather than only the amino acid sequence. A protein structure determines a biological function by folding according to this sequence information. By relating the protein-gene motif and structural information of the protein sequence motif, the motif sequence and structural features can be analyzed in more detail. Here, we constructed a protein-motif dictionary system for several species, storing the protein-gene motif and 3D structural information of the protein sequence motif. We also implemented a protein-gene motif dictionary management system for one-stop search and analysis of protein-gene motif information.

2 MATERIALS AND METHODS

2.1 Overview of constructing the protein-gene motif dictionary

We considered constructing a protein-gene motif dictionary system which stores protein sequence motifs and their corresponding gene sequence motifs and 3D structural motifs. Using the dictionary system, we can easily obtain genetic and 3D structural information of protein sequence motifs. Therefore, we developed the protein-gene motif dictionary system infrastructure for feature analysis of motif structures.

The protein-gene motif dictionary system was developed in three steps. First, we mapped protein and mRNA sequences, and extracted the protein sequence motif and its corresponding gene sequence motif. Second, we mapped protein sequence and structure information, and extracted 3D structure information from the protein sequence motif. Finally, the protein-gene motif dictionary system was used to store these three motifs information. Figure 1 shows an overview of the construction of the protein-gene motif dictionary system.

Figure 1.

Overview of constructing a protein-gene motif dictionary.

2.2 PROSITE Motif

The PROSITE database has various amino acid sequence motif information registered, such as a binding site, and active enzyme site of calcium and zinc. PROSITE motif expression is defined as three types: regular expression pattern of amino acid sequence, weight matrix and score by sequence alignment, and rule of natural language. The regular expression pattern is the most registered in the PROSITE database. Here, we used 1,309 regular expression patterns. Table 1 shows several regular expression patterns in PROSITE.

Table 1. Regular expression pattern in PROSITE

PROSITE ID	Name	Pattern
PS00024	Hemopexin	[LIFAT]-{IL}-x (2)-W-x (2,3)-[PE]-x-{VF}-[LIVMFY]-[DENQS]-[STA]-[AV]-[LIVMFY].
PS00142	Zinc protease	[GSTALIVN]-{PCHR}-{KND}-H-E-[LIVMFYW]-{DEHRKP}-H-{EKPC}-[LIVMFYWGSPQ].
PS00546	Cysteine switch	P-R-C-[GN]-x-P-[DR]-[LIVSAPKQ].

A regular expression in PROSITE has several rules. Each amino acid is expressed as one character. A “[ ]” pattern allows for an amino acid in a bracket. A “[GSTALIVN]” pattern allows for eight types of amino acids such as glycine (G), serine (S), threonine (T), alanine (A), leucine (L), isoleucine (I), valine (V) and asparagine (N). A “{ }” pattern does not allow an amino acid in a bracket. A “{KHD}” pattern does not allow three types of amino acids such as lysine (K), histidine (H), and aspartic acid (D). An “x” pattern allows any amino acid. If a particular regular expression is repeated, the number of times it is repeated is described in “().” An “x(n, m)” pattern repeats the amino acid any number of times between n and m. An “x(3)” pattern repeats the “x” pattern three times, and “x(2, 3)” pattern repeats the “x” pattern two or three times.

There are several regular expression patterns in PROSITE that have no intended biological function but match as strings in many cases. These patterns are described as “SKIP-FLAG = TRUE” in the “PROSITE.DAT” file. In our study, these motifs were excluded from the dictionary system. The “PROSITE.DAT” file includes a definition for each motif. The regular expression pattern is distinguished by a “PATTERN” string in the “ID” record. The definition file contains a variety of other definition descriptions. When batch processing a motif search, it is costly to scan the definition file to find a regular expression pattern one by one. To avoid this, we created a file in advance with only an “ID” record, “AC” record, and “PA” record in the definition file. The “ID” record defines the definition name and types, the “AC” record defines the PROSITE ID, the “PA” record defines the regular expression pattern of the motif.

2.3 Mapping of protein sequences and gene sequences

To construct the dictionary system, we collected protein and corresponding mRNA sequences. Protein and mRNA sequences were taken from the NCBI RefSeq database on December 6^th, 2016 [14]. These sequences are described in FASTA format. This format is composed of multiple lines, the first line header starts with “>,” and following the second line is the sequence information. The header is described with ID, protein name, and species. We collected protein sequences of 50 to 3,000 residues consisting of only natural amino acids, to use only high-quality sequences. Further, we collected the sequence data for 7 model species which include human, mouse, rat, cow, pig, frog, and zebrafish. Using the cross-reference data in the RefSeq database, we mapped protein sequences and mRNA sequences. To extract cross-reference data of specific model species, we used the Taxonomy ID (tax-id) defined by NCBI. Table 2 shows the tax-id of the 7 model species.

Table 2. List of Taxon identifiers

Taxonomy	Tax-id
Homo sapience (human)	9606
Mus musculus (mouse)	10090
Rattus norvegicus (rat)	10116
Bos taurus (cow)	9913
Sus scrofa (pig)	9823
Xenopus tropicalis (frog)	8364
Danio rerio (zebrafish)	7955

An mRNA sequence in RefSeq includes introns and non-coding regions. To extract the protein-gene motif using the Shoji system, we prepared protein sequences and the corresponding mRNA sequences which coincide with the protein sequence, with the same length of translated sequence. Using the Kobayashi system, we extracted the coding sequence (CDS) region in the mRNA sequence. A CDS region is extracted by global pairwise alignment with dynamic programming of the gene sequence and reverse translated gene sequence, which is a reverse translated protein sequence using representative codon expression [13]. We extended the Kobayashi system to give protein sequences, mRNA sequences, and cross-reference data, and extract the CDS region of mRNA sequences by batch processing. Using the given cross-reference data from the Kobayashi system, it is possible to speed up the extraction process because the system knows in advance the corresponding protein and mRNA sequences. For example, Figure 2 shows the estimated CDS region in a matrix metalloprotease 1 (MMP1) mRNA sequence using an MMP1 protein sequence.

Figure 2.

The estimated CDS region, using a protein and mRNA sequence of human MMP1. The result (right) shows that an “M” sequence is an original mRNA sequence, and “Q” is a reverse translated protein sequence. A “Segment” represents the CDS region information. Here, the CDS region is 144 to 1550 bases (with a stop codon removed). A part of this estimated result and sequence information was omitted.

2.4 Extracting the protein-gene sequence motifs

We extracted the protein-gene sequence motifs using the Shoji system. Protein and CDS sequences were inputted into the system in a multi-FASTA format file with multiple FASTA entries. Note, the order of the sequence described in each multi-FASTA file needs to correspond to the protein and CDS sequences. To implement the Shoji system using the C# language, the regular expression pattern in PROSITE was translated to the regular expression pattern of the C# language. First, we obtained each protein sequence, start index, and length, by a sequence motif search. The start index was defined as zero-based. If more than one of the same motifs matched a protein sequence, we distinguished this as an individual motif. Then, we extracted the gene sequence motif from the estimated CDS sequence corresponding to the protein sequence motif. The CDS sequence corresponds to the protein sequence in codon units. In addition, the CDS sequence length is three times the length of the corresponding protein sequence. The gene sequence motif can be extracted from the CDS sequence by three times the start index and length of the protein sequence motif. Figure 3 shows the results of the extracted protein-gene motif of zinc protease in human MMP1 protein and CDS sequence.

Figure 3.

Result of extracting zinc protease protein-gene sequence motif in human MMP1 protein sequence and its corresponding CDS sequence. “Total” represents the sequence length, and the extracted results are described under the “total” information.

2.5 Extraction of 3D structural motif

We collected 3D structural data of proteins in the Protein Data Bank (PDB) from a snapshot on January 1^st, 2017 [15]. We mapped sequence data and 3D structural data using the UniProt database for cross-reference data [16]. However, there are no direct cross-reference data for protein sequences in RefSeq and structural data in PDB. We created cross-reference data by integrating two cross-reference data, UniProt ID and PDB ID, UniProt ID and RefSeq ID.

We extracted 3D structural information corresponding to the protein sequence motif. However, there were two problems when mapping the protein sequence motif to its corresponding 3D structural information. First, protein structural data in the PDB which correspond to a protein sequence may defect or undefine the 3D structural information corresponding to the protein sequence motif. Therefore, it is necessary to confirm if the 3D structural information of the protein sequence motif exists in the PDB by scanning the PDB data. Second, corresponding PDB data may be registered as a quaternary structure that consists of several tertiary structures called chains. A quaternary structure is classified as a homo- or hetero- quaternary structure according to if it binds the same chains or not. To extract 3D structural information of a protein sequence motif, it is necessary to check each chain in the PDB data. Here, 3D structural information was extracted from all the chains. To solve this problem, we also constructed a protein sequence database that is stored as a PDB ID appended to a Chain ID and its sequence information. Using the database, it is possible to check if 3D structural information of a motif exists by string matching. Therefore, it is necessary to scan the PDB data to confirm if 3D structural information of a motif exists, to speed up the extraction process. To extract 3D structural information of a protein sequence motif, the PDB ID is identified using cross-reference data. Then, if 3D structural information of the protein sequence motif exists in the PDB data, all the atom information is extracted by scanning the PDB data. The extracted atom information is saved with header information in another file, named “(PDB ID)[(RefSeq ID)](PROSITE ID)_(order).pdb.” For example, the human MMP1 protein sequence (RefSeq ID: NP_002412.1) is conserved in three sequence motifs. The protein corresponding to the PDB protein structure is PDB ID: 1SU3A. The filename describing the 3D structural information of the second motif is “1SU3A[NP_002412.1]PS00142_2.pdb.” By Appending the appearance order of a sequence motif, this specifies a unique motif in case a sequence is conserved in the same motifs. Also, a protein structure in PDB may be registered as tested by nuclear magnetic resonance (NMR). The protein structure is registered as multi-model, but we used first model written by “MODEL 1” as representative model.

2.6 The protein-gene motif dictionary system using the MongoDB

We constructed the protein-gene motif dictionary system containing protein sequence motifs and their corresponding gene sequence motifs and 3D structural motifs. This system was constructed with a Not Only SQL (NoSQL) database, MongoDB, which is a document-oriented database [17]. The data schema of MongoDB is defined as a Java Script Object Notation (JSON) schema, which is highly readable by both humans and computers. The MySQL and Oracle databases need to have a defined schema in advance. Therefore, the schema needs to be redefined each time if a different schema is needed. To solve this problem, NoSQL was developed as a database management system without a schema. Currently, a traditional relational database management system may be insufficient because diverse biological data are increasing rapidly. Also, in several studies, the biological database was constructed using the highly flexible NoSQL [18, 19].

Table 3 shows a JSON schema in the protein-gene motif dictionary system. The NAME column is the header information in FASTA format. The SEQUENCE and LENGTH columns are the sequence information and sequence length. The MOTIF column is the protein-gene motif information that contains sequence information, length, and start position in each sequence motif. The 3D MOTIF column is the reference information to the 3D structural motif. By separately managing the NAME and MOTIF columns, it is possible to narrow the search and specify the protein family, and protein with a specific conserved motif.

Table 3. JSON schema of the protein-gene motif dictionary system by the MongoDB

Name	Description
NAME	ID, entry name, species
SEQUENCE	Information of each sequence
LENGTH	Length of each sequence
MOTIF	Protein-gene motif information
3D MOTIF	Reference for 3D motif information

3 RESULTS AND DISCUSSION

3.1 Constructing the protein-gene motif dictionary system for human

First, we extracted the protein-gene motif for human. Zinc protease sequence motif was extracted from 215 sites in 213 entries. The 3D structural motif was related to 459 sites in 23 entries. The zinc protease pattern matched the “VAAHELGHSL” sequence of 214 to 223 residues in the MMP1 protein (NP_002412.1). A corresponding gene sequence motif matched the “gtt gca gct cat gaa ctc ggc cat tct ctt” sequence of 642 to 671 bases in the CDS region of MMP1 mRNA sequence (NM_002421.3). This extracted result was confirmed accurate because the motif position corresponded to the protein sequence motif and gene sequence motif. Also, the protein conserved the cysteine switch and hemopexin motifs. These motifs were also correctly extracted and related. The human MMP1 protein was related to 24 PDB structures. Correspondence to the 3D structural motif was related as 215 to 224 residues in the 1SU3A and 1SU3B structures. In addition, correspondence to the 3D structural motif was related as 115 to 124 residues in the 1AYKA structure. Therefore, a protein-gene motif could be extracted and related even if there was a difference in the position of the protein sequence motif and 3D structural motif, and the motif exists in a multi-chain. Also, the 3D structural motifs of cysteine switch and hemopexin motif were extracted and related to the 1SU3A and 1SU3B structures.

The protein-gene motif dictionary system for human was contained in 15,761 entries and 907 motifs with a total of 45,273 sites. These motifs covered about 69% of the PROSITE regular expression patterns. Also, the 3D structural motif contained 407 motifs which totaled 44,242 sites. Table 4 shows the results of the constructed protein-gene motif dictionary system for human.

Table 4. Statistical data of the protein-gene motif dictionary system

	human	mouse	rat	cow	pig	frog	zebrafish
Entries	44,567	30,499	17,569	13,235	4,075	8,566	15,508
Residues	23,271,034	16,134,417	8,780,680	6,443,710	1,696,481	3,792,530	7,162,373
Hit entries	15,761	11,279	7,036	4,793	1,823	2,911	5,662
Motifs	907	899	880	874	632	719	797
Motif sites	45,273	27,573	14,585	10,603	3,402	5,843	12,768
3D motifs	475	196	139	149	77	1	17
3D motif sites	44,242	3,976	3,138	4,808	1,191	4	93

3.2 Constructing the protein-gene motif dictionary system for other species

We then constructed the protein-gene motif dictionary system for the other model species such as mouse, rat, cow, pig, frog and zebrafish. Zinc protease sequence motif was extracted from 164 sites in mouse, 86 sites in rat, 67 sites in cow, 27 sites in pig, 40 sites in frog, and 71 sites in zebrafish. The 3D structural motif of zinc protease was related to 14 sites in mouse, 4 sites in rat, 9 sites in pig, and 1 site in zebrafish. The 3D structural motifs of cow and frog were not related. No sequences with multiple conserved zinc protease motifs were observed in frog and zebrafish. A zinc protease sequence motif “VTAHELGHSL” in rat MMP1 protein was different in the second residue of alanine to threonine compared to the other species. There was no stored MMP1 protein in the zebrafish database. Only the pig database had related 3D structural information to the MMP1 protein of the other species except for human. A cysteine switch and hemopexin motif co-occurred with the zinc protease motif in human MMP1 protein, but was not observed in frog MMP1 protein. Using extracted protein-gene motifs for the other model species, it is possible to analyze the differences in motifs between species even in the same motif or protein types. Table 4 shows more details of the protein-gene motif dictionary system for the other model species. From the results of the protein-gene motif dictionary system for several species, the mouse database had the most stored entries after the human database, and the pig database had the least. Since the pig database had the least entries of the other model species, it is reasonable that it had the least stored motifs. Compared to the frog database, which had about twice as many entries as the pig database, the number of motif types was not that much different. In addition, compared to 907 motif types of the human database, which had about 10 times the number of entries, the 632 motif types covered about 70% of the total entries. The 3D structural motif of the pig database also had more motif types and sites than the frog and zebrafish database. Conversely, they covered about 20% compared to the human database. This trend was also true for the mouse database. We considered this to be because there is not much information on the structural data of proteins of the other model species compared with the human species.

As a result, we constructed a protein-gene motif dictionary system for several species, that contained the protein sequence motifs and their corresponding gene sequence motifs and 3D structural motifs. By classifying the information of each protein, it is possible for a one-stop search and analysis of the protein-gene motifs and their co-occurrence information conserved in the protein. Furthermore, it is possible for comparative analysis of motif features between species. This comparative analysis is not possible with the previous protein-gene motif dictionary system.

3.3 Implementing the management system for the protein-gene motif dictionary system

For one-stop motif analysis, we implemented a management system for the protein-gene motif dictionary system with a graphical user interface. The management system can intuitively search and analyze the protein-gene motif. The management system was developed in Visual C# with .NET Framework 4.5 in Windows 10.

When searching for the protein-gene motif of any protein, the species is specified, and protein name inputted. Search results are shown as a list, where the protein name can be selected. After selecting the protein, protein-gene motif information can be obtained. The protein-gene motif information included the PROSITE ID, sequence information, and the length and position of each motif. The 3D structure information included the PDB ID and Chain ID. Additionally, the protein-gene motif dictionary management system had visualization function for the 3D structural information using the Jmol software [20]. Also, there was a search function in the protein-gene motif dictionary management system using RefSeq ID and PROSITE ID. Using the PROSITE ID, a protein that preserved a specific motif can be searched. Additionally, we implemented an export function for the protein-gene sequence motif information in CSV format, and 3D structural information in PDB format. This function can be exported for selecting a specific protein or species database. Figure 4 shows the main window of the protein-gene motif dictionary management system and results of a narrow search in the human database using the PROSITE ID of zinc protease motif.

Figure 4.

A snapshot of the management system and results from searching for a protein-gene motif. ① Human species is selected, ② Then, a protein with a zinc protease motif (PROSITE ID: PS00546) is searched. ③ The search results are shown as a list, and interstitial collagenase isoform 1 preproprotein (RefSeqID: NP_002412.1) is selected from the list. ④ Protein-gene motif, and protein information such as sequence and co-occurrence of each motif are obtained. ⑤ By selecting the tab button, motif information of the protein sequence, gene sequence, or structural information can be changed. ⑥ In the “3D structure” tab, ID information of the motif can be obtained. ⑦ By clicking the “View Jmol” button, the 3D model can be viewed in the Jmol software.

3.4 3D structural feature analysis of co-occurrence motifs

Several proteins have multiple biological functions. These proteins may conserve multiple motifs. These motifs have a co-occurrence relationship. As a typical co-occurrence motif pattern, the S100 protein conserves two distinct EF-hand type calcium binding motifs. The two motifs are known to differ in the strength of their calcium bonds [21]. A protein structure constructs multiple secondary structure elements such as alpha helices and beta sheets. The conformations of these elements in a protein structure are known to preserve space distance although sequential order of each element in the protein sequence is not preserved [22, 23]. Therefore, it is considered that sequence distance and space distance are important for the functions and structures of proteins. Analyzing the sequence distance and space distance between co-occurrence motifs is thought to provide new knowledge about the function and structure of proteins. To analyze sequence and space distance between motifs, we implemented a sequence and space distance calculation module in the management system.

We calculated sequence distance and space distance between motifs in a protein. Sequence distance is defined as residues between the next index of the end residue of the N-terminal motif to the start index of the C-terminal motif. Space distance is defined as the Euclidean distance between the centroid of each 3D structural motif. The centroid is defined as the mean coordinate of all alpha carbons in a 3D structural motif.

We explored the co-occurrence motif pattern with zinc protease motif. This operation was executed using the management system with the narrow search function of the PROSITE ID. As the result, the cysteine switch motifs have been preserved in a certain type of zinc protease, such as the matrix metalloproteinase (MMP). These motifs co-occurred in three proteins which contained four chains. Although sequence distance was 117 to 293 residues, space distance was preserved at 12 Å for all of these. Figure 5 shows the 3D structures of MMP1 and MMP9 proteins and their co-occurrence motifs. Co-occurrence motif analysis for the other MMP proteins could not be analyzed because 3D structural information of the motifs was not determined.

Figure 5.

3D view of the MMP protein (1SU3A and 1L6JA) and colored co-occurrence motif. A cysteine switch motif and zinc protease motif are represented in blue and red. A green point of center of each motif represents the centroid of a motif.

MMP is a protein family of a kind of proteolytic enzyme. This protein family has degrading activity on matrix components and plays an important role in physiological phenomena such as wound healing and angiogenesis [24, 25]. MMP protein is secreted from cells as a zymogen. It is activated as an enzyme by degrading the pro-peptide region where zinc ion of an active site is coordinated with the cysteine residue conserved in the pro-peptide region [26]. This mechanism is named cysteine switch, which is registered as cysteine switch motif in PROSITE [27]. Furthermore, a region including two histidine residues for coordinating the zinc ion have a known zinc protease motif in PROSITE [28]. For the MMP protein pro-peptide of the proteolytic enzyme, co-occurring zinc protease and cysteine switch motif and 3D distance is constant and conserved regardless of the sequence distance, and are important in structural biology. For these reasons, it is reasonable that the cysteine switch motif and zinc protease motif are conserved in the same protein and are preserved at a certain Euclidean distance. Using the management system, users were able to complete a one-stop analysis for co-occurrence motifs.

There is no comparative system directly to the results of our dictionary system. Therefore, you have to refer manually each entry from RefSeq, UniProt and PDB databases. For example, using MMP1 protein (NP_002412.1), we have verified the results. First, the extracted translate region and its related protein sequence were confirmed by the ‘CDS’ feature section in RefSeq. This feature describes RefSeq ID of related gene and CDS region. In the case of NM_002421.3, CDS region of the gene sequence describes 144 to 1553 bases (included a stop codon). Second, verification of extracted motif confirms the ‘Region’ feature section in RefSeq. From ‘Region’ feature, a cysteine switch motif preserved 90 to 97 residues. A zinc protease motif is included Metalloprotease region. Third, verification of mapping 3D structure of protein sequence confirms UniProt sequence mapped RefSeq sequence. A UniProt sequence (P03956) is mapped 24 structures from ‘Structure’ section in UniProt. Finally, verification of extracted 3D structural motif confirms PDB. A PDB structure (1SU3) is preserved three motifs and correctly extracted 3D information.

The sequence distance was also manually obtained as the number of residues between the co-occurrence motifs. The space distance is the Euclidean distance of mean coordinate. The mean coordinate was calculated from 3D information in PDB. As the result of above verification processes, we have confirmed the co-occurrence motif analysis correctly.

4 CONCLUSION

In this work, we developed a protein-gene motif dictionary system that stored protein sequence motifs and their corresponding gene sequence motifs and 3D structural motifs. This dictionary stored 956 sequence motifs (total of 120,047 sites). Furthermore, there are 569 related protein structure information (total of 57,452 sites). We also have implemented a management system where users are able to do one-stop search and analysis using the protein-gene motif dictionary system. This system has a distance analysis method for co-occurrence motifs. As a result, we found that the Euclidian distance of the cysteine switch motif and the zinc protease motif was preserved at 12 Å, although the sequence distance was flexible. This is novel knowledge discovered by our system because the distance of co-occurrence motif has not shown quantitatively in any databases. Unfortunately, we could not execute comparative co-occurrence motif analysis between species due to insufficient protein structure information for PROSITE sequence motifs. However, this problem can be solved by further development of the PDB database.

In the future work, we are going to consider co-occurrence motif analysis using combination of our dictionary system and the motif searching system based on sequence binary decision diagram (seqBDD) [29]. By combining two systems, a comprehensive search of co-occurrence motifs will be possible, and the search results can be efficiently analyzed and registered in the dictionary system. The zip file that contained our protein-gene motif dictionary system and protein-gene motif data are available at the following URL. We also provide a zip file that only contained human data set in the zip file.

https://sunflower.kuicr.kyoto-u.ac.jp/~ohtomo/ProteinGeneMotifDictionary

This research has been partially supported by the Kayamori Foundation of Informational Science Advancement. We would like to thank Editage (www.editage.com) for English language editing.

REFERENCE

[1] C. J. A. Sigrist, E. de Castro, L. Cerutti, B. A. Cuche, N. Hulo, A. Bridge, et al., Nucleic Acids Res., 41, D1, D344 (2013). , doi:10.1093/nar/gks1067 PMID:23161676
[2] T. K. Attwood, M. E. Beck, A. J. Bleasby, D. J. Parry-Smith, Nucleic Acids Res., 22, 3590 (1994). PMID:7937065
[3] S. El-Gebali, J. Mistry, A. Bateman, S. R. Eddy, A. Luciani, S. C. Potter, et al., Nucleic Acids Res., 47, D1, D427 (2019). , doi:10.1093/nar/gky995 PMID:30357350
[4] V. Matys, E. Fricke, R. Geffers, E. Gössling, M. Haubrock, R. Hehl, et al., Nucleic Acids Res., 31, 374 (2003). , doi:10.1093/nar/gkg108 PMID:12520026
[5] A. Sandelin, W. Alkema, P. Engström, W. W. Wasserman, B. Lenhard, Nucleic Acids Res., 32, D91 (2004). , doi:10.1093/nar/gkh012 PMID:14681366
[6] J. Takeda, Y. Suzuki, R. Sakate, Y. Sato, T. Gojobori, T. Imanishi, et al., Nucleic Acids Res., 38, suppl_1, D86 (2010). , doi:10.1093/nar/gkp984 PMID:19969536
[7] K. Guruprasad, M. S. Prasad, G. R. Kumar, Bioinformatics, 16, 372 (2000). , doi:10.1093/bioinformatics/16.4.372 PMID:10869035
[8] G. Pugalenthi, P. N. Suganthan, R. Sowdhamini, S. Chakrabarti, Nucleic Acids Res., 36, Database, D218 (2007). , doi:10.1093/nar/gkm794 PMID:17933773
[9] GenomeNet, https://www.genome.jp/
[10] National Center for Biotechnology Information (NCBI), https://www.ncbi.nlm.nih.gov/
[11] Expert Protein Analysis System (ExPASy), https://www.expasy.org/
[12] J. Shoji, H. Kato, Joint Conference on Informatics in Biology, Medicine and Pharmacology, 35 (2013).
[13] T. Kobayashi, H. Kato, The 43th Symposium on Structural Activity Relationships 2015 & the 10th Japan-China Joint Symposium on Drug Discovery and Development, 43, 81 (2015).
[14] NCBI RefSeq, RefSeq FTP, ftp://ftp.ncbi.nlm.nih.gov/refseq/ (Reference to 2016/12/06).
[15] P. W. Rose, A. Prlić, A. Altunkaya, C. Bi, A. R. Bradley, C. H. Christie, et al., Nucleic Acids Res., 45, D1, D271 (2017). , doi:10.1093/nar/gkw1000 PMID:27794042
[16] The UniProt Consortium, Nucleic Acids Res., 47, D1, D506 (2019). , doi:10.1093/nar/gky1049 PMID:30395287
[17] D. B. Mongo, Inc., MongoDB, ver. 3.0.3, https://www.mongodb.com/ (Reference to 2015/05/12).
[18] N. K. Gundla, Z. Chen, Procedia Comput. Sci., 91, 460 (2016). doi:10.1016/j.procs.2016.07.120
[19] S. Wang, I. Pandis, C. Wu, S. He, D. Johnson, I. Emam, et al., BMC Genomics, 15, Suppl 8, S3 (2014). , doi:10.1186/1471-2164-15-S8-S3 PMID:25435347
[20] R. M. Hanson, J. Appl. Cryst., 43, 1250 (2010). doi:10.1107/S0021889810030256
[21] D. Kligman, D. C. Hilt, Trends Biochem. Sci., 13, 437 (1988). , doi:10.1016/0968-0004(88)90218-6 PMID:3075365
[22] A. Abyzov, V. A. Ilyin, BMC Struct. Biol., 7, 78 (2007). , doi:10.1186/1472-6807-7-78 PMID:18005453
[23] N. V. Grishin, J. Struct. Biol., 134, 167 (2001). , doi:10.1006/jsbi.2001.4335 PMID:11551177
[24] W. G. Stetler-Stevenson, J. Clin. Invest., 103, 1237 (1999). , doi:10.1172/JCI6870 PMID:10225966
[25] M. P. Caley, V. L. C. Martins, E. A. O’Toole, Adv. Wound Care, 4, 225 (2015). , doi:10.1089/wound.2014.0581 PMID:25945285
[26] G. Bozzuto, P. Ruggieri, A. Molinari, Ann. Ist. Super. Sanita, 46, 66 (2010). , doi:10.4415/ANN_10_01_09 PMID:20348621
[27] J. F. Woessner, Jr., FASEB J., 5, 2145 (1991). , doi:10.1096/fasebj.5.8.1850705 PMID:1850705
[28] C. V. Jongeneel, J. Bouvier, A. Bairoch, FEBS Lett., 242, 211 (1989). , doi:10.1016/0014-5793(89)80471-5 PMID:2914602
[29] K.Yamato, H. Kato, T. Katsuragi, Y. Takahashi, J. Comput. Chem. Jpn., 19, 1 (2020). doi:10.2477/jccj.2019-0028

Corresponding author

Correction information

Register with J-STAGE for free!