Genes & Genetic Systems
Online ISSN : 1880-5779
Print ISSN : 1341-7568
ISSN-L : 1341-7568
Full papers
RNA design using simulated SHAPE data
Mohadeseh LotfiFatemeh Zare-Mirakabad Soheila Montaseri
Author information
JOURNAL OPEN ACCESS FULL-TEXT HTML

2017 Volume 92 Issue 6 Pages 257-265

Details
ABSTRACT

It has long been established that in addition to being involved in protein translation, RNA plays essential roles in numerous other cellular processes, including gene regulation and DNA replication. Such roles are known to be dictated by higher-order structures of RNA molecules. It is therefore of prime importance to find an RNA sequence that can fold to acquire a particular function that is desirable for use in pharmaceuticals and basic research. The challenge of finding an RNA sequence for a given structure is known as the RNA design problem. Although there are several algorithms to solve this problem, they mainly consider hard constraints, such as minimum free energy, to evaluate the predicted sequences. Recently, SHAPE data has emerged as a new soft constraint for RNA secondary structure prediction. To take advantage of this new experimental constraint, we report here a new method for accurate design of RNA sequences based on their secondary structures using SHAPE data as pseudo-free energy. We then compare our algorithm with four others: INFO-RNA, ERD, MODENA and RNAifold 2.0. Our algorithm precisely predicts 26 out of 29 new sequences for the structures extracted from the Rfam dataset, while the other four algorithms predict no more than 22 out of 29. The proposed algorithm is comparable to the above algorithms on RNA-SSD datasets, where they can predict up to 33 appropriate sequences for RNA secondary structures out of 34.

INTRODUCTION

Non-coding RNAs participate in various biological processes essential for cell survival, such as gene regulation and DNA replication (Storz and Gottesman, 2006). The function of these RNAs is dictated by their tertiary structures. Therefore, understanding the mutual relation between RNA structure and sequence sheds light on the biological role of RNAs. In fact, generating artificial RNA sequences folded to a desired RNA structure, RNA design (RD), is of known importance in drug design (Lagoja and Herdewijn, 2007).

In the last two decades, many algorithms and computational methods have been proposed for solving the RD problem (Churkin et al., 2017). For each given RNA secondary structure, RNAinverse (Hofacker et al., 1994) in the Vienna package (Lorenz et al., 2011) generates a sequence based on local search. INFO-RNA (Busch and Backofen, 2006) includes two main steps for designing an RNA sequence. First, a dynamic programming algorithm is employed to generate initial sequences, and then a stochastic local search is applied on the sequences. In the second step, neighbor sequences are evaluated based on distance minimization score. MODENA (Taneda, 2011) applies a multi-objective genetic algorithm in which a set of weak Pareto optimal sequences is generated based on the combination of structural stability and similarity scores. Frnakenstein (Lyngsø et al., 2012) uses a genetic algorithm based on multiple structural constraints. GGI-Fold (Ganjtabesh et al., 2013) also uses a multi-objective genetic algorithm to design sub-sequences for sub-structures derived from the target structure. The Gibbs sampling method is then applied to improve each sub-sequence. Finally, the sub-sequences are combined to design a sequence corresponding to the target structure. ERD (Esmaili-Taheri et al., 2014) uses an existing database of sequences (namely, STRAND [http://www.rnasoft.ca/strand/]) to design RNAs. In this algorithm, the target structure is divided into sub-structures such that each of them contains one multi-loop. A sub-sequence is then extracted from the database for each sub-structure. Finally, an evolutionary algorithm is employed to improve the combined sub-sequences folded to the target structure. RNAifold (Garcia-Martin et al., 2013, 2015) applies constraint programming and large neighborhood search algorithms for solving the RD problem.

In addition to the aforementioned RD algorithms using hard constraints, soft constraint-based methods have emerged to enhance the design process. AntaRNA employs both IUPAC format sequence and GC-content constraints to find the best sequence for a given structure using ant colony optimization (Kleinkauf et al., 2015). IncaRNAfbinv (Retwitzer et al., 2016) merges incaRNAtion (Reinharz et al., 2013) with RNAfbinv (Weinbrand et al., 2013) to solve the RD problem based on biologically meaningful constraints. RNAexinv (Avihoo et al., 2011) predicts a sequence for a given structure from shape representation and physical attributes.

In all of these methods, an RNA secondary structure prediction (RSSP) algorithm is applied to predict the structures of the designed sequences. Therefore, the quality of the final results is highly dependent on the RSSP algorithm. These approaches mainly apply hard constraints, such as minimum free energy (MFE), to evaluate RSSP. However, the results of these methods show that MFE cannot correctly predict some RNA secondary structures. Recently, the accuracy of RSSP has been improved by adding experimental SHAPE (selective 2’-hydroxyl acylation analyzed by primer extension) data (Wilkinson et al., 2006) to the free energy function as a soft constraint. SHAPE data correlate reactivity with flexibility of RNA nucleotides to predict the secondary structure (Low and Weeks, 2010). Although some algorithms, such as RNAstructure (Reuter and Mathews, 2010), GTfold (Mathuriya et al., 2009) and RNAsc (Zarringhalam et al., 2012), use SHAPE data as pseudo-free energy to increase the accuracy of RSSP, these data have never been employed to solve the RD problem. The reason is that obtaining the data experimentally itself requires RNA sequence, while the sequence is unknown in the RD problem. Thus, using SHAPE data in the RD algorithms remains a challenge. In this regard, a SHAPE data simulator could help the RD algorithms in designing more accurate sequences. Sükösd and colleagues presented a statistical model to simulate these data using three probability distribution functions for unpaired, stacked and helix-end nucleotide pairs (Sükösd et al., 2013).

In this paper, we show how to use simulated SHAPE data for RNA design. In this respect, a harmony search algorithm is presented to accurately predict RNA sequences folded to the target structure based on MFE and simulated SHAPE data as pseudo-free energy. For simulating SHAPE data, we use an extended version of the Sükösd simulator to provide a better solution for the RD problem.

We perform our algorithm on 29 RNA structures extracted from the database Rfam, known as a standard benchmark (Taneda, 2011), and 34 RNA structures from the RNA-SSD dataset (Andronescu et al., 2004). We compare our algorithm with four well-known algorithms, INFO-RNA (Busch and Backofen, 2006), ERD (Esmaili-Taheri et al., 2014), MODENA (Taneda, 2011) and RNAifold 2.0 (Garcia-Martin et al., 2015), on these datasets. Our algorithm precisely predicts 26 new sequences for the extracted structures from the Rfam dataset while the other algorithms predict 22 out of 29. The proposed algorithm is comparable to those algorithms on the RNA-SSD dataset, where they can predict 33 appropriate sequences for RNA secondary structures out of 34.

MATERIALS AND METHODS

In this section, some definitions and notations are explained to facilitate understanding of the proposed method. An RNA sequence is defined as follows:   

R= r 1 r 2 r n ,| R |=n, r i { A,C,G,U },1in,
where A, C, G and U stand for adenine, cytosine, guanine and uracil, respectively. Each RNA secondary structure is formed by hydrogen bonds between complementary bases A = U, CG and GU called base pairs. An assortment of unpaired bases and a set of consecutive base pairs are called a loop and a helix, respectively. Therefore, a secondary structure of RNA R can be defined as follows (Hofacker et al., 1994):   
S= s 1 s 2 s n ,| S |=n, s m {( , ) , .},1mn,
where ′(′ and ′)′ represent a paired position and ‘.’ shows an unpaired position. A helix sub-structure (′h′) of S is displayed as a 4-tuple <′h′,i,j,k> where i < j and   
s m ={ (,      imi+k-1, ),      jmj+k-1.

A loop sub-structure (′l′) of S is represented as a 4-tuple <′l′,i,i,k> where sm = ′. ′ and imi + k − 1.

Free energy of a given RNA secondary structure is computed as follows (Low and Weeks, 2010):   

Δ G FE =ΣΔ G helices +ΣΔ G loops , (1)
where ΔGhelices and ΔGloops represent energy values of helices and loops of the secondary structure according to the laws of thermodynamics (Turner and Mathews, 2009). For each position i in a base pair, ΔGi is calculated as pseudo-free energy using the following equation (Low and Weeks, 2010):   
Δ G i =mln( SHAPE   reactivity( i ) +1 ) +b, (2)
where SHAPE reactivity(i) is the SHAPE data of nucleotide i, and parameters m and b are set to 2.6 kcal/mol and −0.8 kcal/mol, respectively. Pseudo-free energy of the structure is computed as follows (Low and Weeks, 2010):   
Δ G SHAPE =1×ΣΔ G start-end +2×ΣΔ G interior ,
where ΔGstartend and ΔGinterior are calculated based on ΔGi (Equation 2) for base pairs at the start, end (start-end) and interior of the helix. This pseudo-free energy is added to the free energy (Equation 1) according to the following formula for RNA secondary structure prediction (Low and Weeks, 2010):   
Δ G total =Δ G FE +Δ G SHAPE . (3)

In the RD problem, our aim is to minimize ΔGtotal (Equation 3) instead of ΔGFE (Equation 1) for designing an RNA sequence that folds to a target structure.

The proposed algorithm for the RD problem

Harmony search is known as an appropriate evolutionary algorithm to solve multi-objective optimization problems (Yang, 2009). Since RNA design can be defined as a multi-objective problem (distance and free energy functions) (Taneda, 2011), it would be convenient to employ a harmony search algorithm in order to solve it.

A harmony search algorithm called HRDSSD is proposed for designing an RNA sequence using SHAPE data (http://bioinformatics.aut.ac.ir/HRDSSD/). The algorithm takes as input a target secondary structure S in order to predict the best sequence that can fold to the input structure. The details of random solutions for harmony memory, fitness function, a new harmonic and termination condition of the algorithm are described below.

Harmony memory

The numbers of harmonics, random RNA sequences, are produced as a harmony memory. To generate the kth harmonic,   R k = r 1 k r 2 k r n k ,  for target structure S = s1s2sn, a probabilistic approach is used. For each position i of Rk, a nucleotide is chosen with probability of Pt{A,C,G,U} as follows:   

P A ( r i k =A| s i {( , . } ) = j=1 n χ . ( s j ) n ,
where   
χ a ( b ) ={ 1a=b, 0else.
And   
P t{ C,G,U } =p( r i k =t| s i {( , . } ) = 1- P A 3 .
The empty positions, r i k , are filled with the complements of their paired bases (A = U, CG and GU).

Fitness function

For each pair of target secondary structure and RNA sequence in harmony memory, SHAPE data are simulated using the probability density functions (see sub-section 2.2). A secondary structure is computed for each designed sequence, based on minimizing ΔGtotal (Equation 3) using GTfold (Mathuriya et al., 2009). The distance between predicted and target secondary structures is considered a fitness function.

Generating a new harmonic

To construct a new harmonic, the target structure S is decomposed to its sub-structures (helices, loops) and is added in a set Ω. Let a new harmonic be as follows:   

R new = r 1 new r 2 new r n new ,  r i new =N,    1in,
where ‘N’ displays an unknown nucleotide in position i. A harmonic such as R H = r 1 H r 2 H r n H is randomly selected from the harmony memory. The predicted structure of RH is decomposed to sub-structures; and is added to the set ΩH. Then, some 4-tuples, such as <p,i,j,k>∈ΩH∩Ω, p∈{′l′, ′h′}, are randomly selected to set each r m new to the nucleotide r m H where, imi + k − 1 and jmj + k − 1.

Random selection of a harmonic from the harmony memory is repeated to fill Rnew with ‘A’s, ‘C’s, ‘G’s and ‘U’s until the generated random value is smaller than the harmony memory, considering rate. Finally, the unknown positions of Rnew are randomly filled with ‘A’s, ‘C’s, ‘G’s or ‘U’s. The nucleotides corresponding to base pairs should be complementary.

Using this method, we generate 10 new harmonics. For each new harmonic, SHAPE data are simulated to compute ΔGtotal (Equation 3) function using the probability functions (see sub-section 2.2). Then, these harmonics are sorted by the value of ΔGtotal. One of the three first harmonics with the lowest fitness value (see sub-section 2.1.2) is replaced with the worst harmonic in the harmony memory, subject to the fitness value of this harmonic being better than the fitness value of the worst harmonic.

The algorithm terminates when the fitness value of the best harmonic is equal to 0, or the number of generations reaches |S|/4.

Simulating SHAPE data

As mentioned above, the HRDSSD algorithm generated some sequences for the target structure. For each sequence, the algorithm must compute the value of fitness function. A part of this function needs the SHAPE data for each nucleotide. Hence, we require a simulator to generate SHAPE data for each sequence and target structure. Although Sükösd et al. presented a statistical model to simulate these data for unpaired, stacked and helix-end pair bases (Sükösd et al., 2013), their model does not consider the type of nucleotides.

In this section, Sükösd’s simulator is extended to consider the type of nucleotides. The proposed stochastic method is constructed using the RNA sequences, native structures and experimental SHAPE data of 16S rRNA (1542 nt) and 23S rRNA (2904 nt) (K. M. Weeks, personal communication). We partitioned (into 12 clusters) the extracted SHAPE data from the following regions of these RNAs based on types of nucleotides:

(1) Unpaired region: represents a loop of RNA.

(2) Stacked region: shows consecutive base pairs except start and end of the helix.

(3) Helix-end region: indicates base pairs at the start and end of a helix.

For each cluster, a probability density function is generated by Easyfit 5.5 software (http://www.mathwaves.com). We employed this software because the size of each cluster (see Table 1) is not too large and it has been shown that the software is accurate for small sets (Disfani et al., 2012).

Table 1. Frequency of various nucleotides in unpaired, stacked and helix-end regions
ACGU
Unpaired769208421366
Stacked236534600387
Helix-end7819630290

Figures 1, 2, 3 display twelve probability density functions for estimating SHAPE data based on various nucleotides in different regions.

Fig. 1.

Probability density functions in unpaired regions. (A)–(D) show the probability density functions in the unpaired regions on nucleotides A, C, G and U, respectively.

Fig. 2.

Probability density functions in stack regions. (A)–(D) show the probability density functions in the stack regions on nucleotides A, C, G and U, respectively.

Fig. 3.

Probability density functions in helix-end regions. (A)–(D) show the probability density functions in the helix-end regions on nucleotides A, C, G and U, respectively.

RESUTS

In this section, we evaluate our algorithm, HRDSSD, on 29 RNAs extracted from the Rfam database, called dataset A (Taneda, 2011), and also datasets B and C obtained from RNA-SSD containing 24 and 10 RNAs, respectively (Andronescu et al., 2004). In this study, the HRDSSD algorithm is run on a machine with Intel(R) Core(TM) i5-2450M CPU 2.50 GHz and 4 GB RAM. We constrained the running time of the proposed algorithm to 10 minutes (Garcia-Martin et al., 2015) in each execution.

We compared three different versions of the proposed algorithm, our model (HRDSSD), the Sükösd model (HRDSSD-suko) and one without using SHAPE data (HRDWSD), to show that the simulated SHAPE data can improve the algorithm. HRDSSD is then compared to four well-known algorithms, ERD (Esmaili-Taheri et al., 2014), RNAifold 2.0 (Garcia-Martin et al., 2015), MODENA (Taneda, 2011) and INFO-RNA (Busch and Backofen, 2006).

Comparison of three different versions of the proposed algorithm

In this sub-section, we show the impact of the Sükösd model (three probability density functions (Sükösd et al., 2013)) and our model (twelve probability density functions) to design RNA sequences. Thus, we made a new version of HRDSSD called HRDSSD-suko by replacing the Sükösd model with our model to simulate SHAPE data. In addition, another version named HRDWSD is constructed by replacing ΔGtotal (Equation 3) with ΔGFE (Equation 1) to represent the effectiveness of simulated SHAPE data on the RD problem.

Table 2 shows the average time (Avg Time) and the number of RNA sequences designed accurately (NSDA) by the three versions of our algorithm on datasets A, B and C after 50 runs. The number of RNA sequences that are accurately designed by HRDSSD is higher than the number attained by the other algorithms. The results show that considering the type of nucleotides is important in SHAPE data simulation. The average time of HRSWSD is lower than the other methods because there is no extra computation for simulating SHAPE data.

Table 2. Comparison of HRDSSD, HRDSSD-suko and HRDWSD on datasets A, B and C (50 runs per structure, time limit set to 10 minutes)
dataset A
(29 structures)
dataset B
(24 structures)
dataset C
(10 structures)
Algorithm nameAvg TimeNSDAAvg TimeNSDAAvg TimeNSDA
HRDSSD413.23263868.722435.749
HRDSSD- suko338.24254812.9323185.29
HRDWSD71.70241085.75249.288

The details of our results for each RNA molecule in datasets A, B and C can be seen in columns 3–8 of Tables 5, 6, 7. The first and second columns of these tables show the name and length of the RNAs. The success count (SC) for each approach specifies the number of successful executions out of a total of 50 executions (10 minutes for each execution). The running time of each algorithm is the average time of successful runs in less than 10 minutes. If there is no successful run, the computational time is denoted by ‘–’.

Table 5. Comparison of HRDSSD and the other methods on dataset A (50 runs per structure, time limit set to 10 minutes)
Rfam AClengthHRDSSDHRDSSD-sukoHRDWSDERDMODENAINFO-RNARNAifold 2.0
SCtimeSCtimeSCtimeSCtimeSCTimeSCtimeSCtime
RF00001117501.945010.12502.18500.92280.79490.15278.52
RF00002151501.12504.14501.64492.98132.000360.03
RF00003161502.00508.202317.00415.94112.1800
RF00004193500.94502.66500.58502.51480.77192.73460.35
RF0000574500.28500.66500.30500.03350.37490.04470.08
RF0000689500.38500.96500.44500.17440.34400.06500.10
RF00007154500.86502.96500.82500.71480.58420.30440.40
RF0000854500.22500.60500.24500.01460.24500.00500.06
RF00009348509.505032.044911.734922.52393.13081.27
RF000103570000000
RF000113821282.00000000
RF00012215501.02503.50500.82501.52470.96323.24500.41
RF00013185500.96503.20500.70501.42381.13132.00500.42
RF0001487500.26500.80500.26500.04400.40500.02500.12
RF00015140500.54502.08500.58500.86440.57191.74470.33
RF000161290000000
RF00017301502.185010.00501.14502.13422.76480.80482.49
RF000183605010.125042.46507.040000
RF0001983500.36501.04500.30500.09460.33470.03500.10
RF00020119500.54501.8000000
RF00021118500.28500.74500.26500.13400.63500.09500.58
RF00022148500.50501.44500.48501.02440.6881.05420.21
RF000244510000000
RF00025210501.06503.50500.68503.36450.8710.93500.33
RF00026102500.24500.54500.22500.02450.3623.11500.06
RF0002779500.28500.64500.28500.03490.35500.04500.21
RF000283444687.3341174.544517.675026.420011.54
RF0002973500.34500.88500.40500.06390.2830.01500.08
RF00030340507.985028.74505.940442.340265.55
sum1247413.231241338.24116771.70108972.8887522.0554336.3288983.25
Table 6. Comparison of HRDSSD and the other methods on RNA-SSD dataset B (50 runs per structure, time limit set to 10 minutes)
id rnassd1lengthHRDSSDHRDSSD-sukoHRDWSDERDMODENAINFO-RNARNAifold2.0
SCtimeSCtimeSCTimeSCtimeSCtimeSCtimeSCtime
AB01582785750152.6649223.025013.32507.154520.643711.683940.66
AF029195105450525.4248354.405029.665011.5804732.67432.15
AF056938139950841.30381738.505030.085043.85043112.640
AF0968366475022.7649118.06504.48504.320433.403516.13
AF106618351505.185024.62501.54501.84413.56470.58494.58
AF107506338503.765024.60501.78502.38393.59442.0412.19
AF1414854745010.645078.30503.32503.42455.93334.610
AJ0111493775012.385041.38486.85501.430371.72222.90
AJ1307795075013.5849109.71503.98502.41476.19491.72199.87
AJ13257278150103.6238312.26509.98507.1204212.902218.87
AJ13362212971599.003462.0626176.235018.79049233.9245289.25
AJ2364557526314.172341.5011136.185018.2205255.650
D3877785950162.0820342.205017.085014.14042164.690
L11935265501.92507.04500.80500.78491.51490.81501.21
L7711714762298.504238.004238.005023.6305055.280
LIU92530290790.574191.001562.73501.020221.26434.08
S708383904650.574689.80487.38502.990462.964918.11
U63350419505.145026.06501.76501.54494.14483.49444.73
U81771492507.805057.76502.28501.990462.710
U846293004819.464653.48494.55500.97519.40402.40502.43
X617716604986.1836383.284992.475013.1901717.090
X81949120118340.503982.0047206.455017.76048128.073119.23
X9967614432544.500440.505031.17046220.370
Z83250261506.10503.70501.72500.68352.20500.74451.85
sum9293868.729024812.9310011085.751200232.3535567.179801273.41587438.22
Table 7. Comparison of HRDSSD and the other methods on RNA-SSD dataset C (50 runs per structure, time limit set to 10 minutes)
id SSD2lengthHRDSSDHRDSSD-sukoHRDWSDERDMODENAINFO-RNARNAifold2.0
SCtimeSCtimeSCtimeSCtimeSCTimeSCtimeSCtime
165500.2500.72500.28500.04390.31500.02500.10
279500.26500.7500.3500.03460.35500.01500.09
3122501.7505.44500.96500.2963.33500.06500.18
4166500.72502.26500.44500.34500.86500.10500.53
5180505.125040.58089.28000
6314502.26509.24501.26504.83491.921110.95111.51
7340504.125012.84501.56503.19452.641423.02242.29
8372505.15016.34501.665013.94463.02431.17252.19
93760000000
105835016.265097.08502.82504.39459.24438.00129.68
Sum45035.74450185.24009.2840836.3232621.6827273.3427216.58

Next, we show that the 3D structures of the RNA sequences designed using SHAPE data are more similar to the native structures. To do this, we select the 3D structures of six RNAs from PDB whose secondary structures are pseudoknot-free (see Table 3). The 3D structures of these RNAs are then converted to secondary structures by DSSR software (Lu et al., 2015).

Table 3. The RNA dataset extracted from PDB
PDB idLengthPDB idLength
1a51411kxk70
1cq5431jxr46
1kqs chain 91223JQ4 chain B118

Next, each structure is given as an input to HRDSSD, HRDSSD-suko and HRDWSD algorithms ten times. The output sequences are given to RNAcomposer (Popenda et al., 2012) to predict the 3D structures. Finally, these structures are aligned to the PDB structures by the Rclick server (Nguyen and Verma, 2015). Table 4 shows the average overlap between the predicted 3D structure of each designed RNA sequence and its real 3D structure.

Table 4. Average structure overlap between the predicted and real 3D structures
PDB idHRDSSDHRDSSD-sukoHRDWSD
1a5156.119.7653.90
1cq519.7723.2521.86
1kqs chain 918.3618.4418.28
1kxk19.2918.7114.86
1jxr21.3020.2220
3JQ4 chain B35.3432.9732.46
Average28.3622.2226.89

The average structure overlap represented by HRDSSD shows that this method may design RNA sequences more similar to those in nature.

Comparison of HRDSSD with four well-known algorithms

Columns 9 to 16 of Tables 5, 6, 7 present the results of the INFO-RNA, RNAifold 2.0, MODENA and ERD algorithms on datasets A, B and C. As shown, the proposed approach precisely predicts 26 new sequences for the extracted structures from dataset A, while the other four methods can design at most 22 accurate RNA sequences. HRDSSD is comparable to the other methods on datasets B and C where they can predict 33 accurate sequences for RNA secondary structures out of 34.

As shown in Tables 5, 6, 7, the running time of our algorithm (HSDSSD) is not as fast as the other algorithms, because most of these algorithms employ the Vienna RNA package (Lorenz et al., 2011) for predicting the secondary structure of designed sequences. Although GTfold is the fastest software that uses SHAPE data, the Vienna RNA package, which uses only the free energy function, is faster. In other words, applying SHAPE data in any RNA secondary structure prediction algorithm would consume more time.

CONCLUSION

In this paper, we developed a model to simulate SHAPE data taking into account different types of nucleotides in each region of an RNA. In this model, twelve probability density functions were fitted for various nucleotides in loops, stacked and helix-end pair regions. A harmony search algorithm called HDRSSD was then employed to design an RNA sequence for the given secondary structure by simulated SHAPE data. The proposed algorithm, HDRSSD, predicted more accurate RNA sequences than the other algorithms on the databases Rfam and RNA-SSD. The results showed that SHAPE data not only improve RNA secondary structure prediction, but also improve the RD algorithms.

ACKNOWLEDGMENTS

Special thanks to Ms. Parisa Hosseinzadeh and Bita Pourmohsenin for their help in editing the article.

REFERENCES
 
© 2017 by The Genetics Society of Japan
feedback
Top