Inferring gene correlation networks from transcription factor binding sites

Gene expression is a highly regulated biological process that is fundamental to the existence of phenotypes of any living organism. The regulatory relations are usually modeled as a network; simply, every gene is modeled as a node and relations are shown as edges between two related genes. This paper presents a novel method for inferring correlation networks , networks constructed by connecting coexpressed genes, through predicting co-expression level from genes promoter’s sequences. According to the results, this method works well on biological data and its outcome is comparable to the methods that use microarray as input. The method is written in C++ language and is available upon request from the corresponding author.


INTRODUCTION
Gene expression is at the basis of molecular biology. Through this procedure, cells use the information stored in DNA, the genes, to perform and control the vast number of functions required for them to survive in various conditions (Davidson et al., 2003;Lodish et al., 2007). Gene expression consists of a series of regulated processes, to be precise, transcription, RNA modifications (eukaryotic genes), and translation. By regulating these processes and thereby controlling the amount of produced proteins, cells can respond to any stimulus (Brazhnik et al., 2002;Latchman, 1998;B. Lewin, 2004).
Recently, there have been many efforts for discovering how genes regulate expression of themselves (Hecker et al., 2009;Huynh-Thu et al., 2010;Kabir et al., 2010;Meyer et al., 2007;Yip et al., 2010). Transcription, the first step in the gene expression which converts DNA to RNA via a complex machinery is an extremely regulated process (Jacob and Monod, 1961;Lodish et al., 2007). This step is mainly dependent upon the successful binding of Transcription Factors (TFs) proteins to explicit positions in genes upstream or promoters, also known as TF Binding Sites (TFBSs) (Chua et al., 2004;B. Lewin, 2004;Mahdevar et al., 2012). In general, 4% of all genes produce TFs and usually multiple TFs act together to regulate the expression of a given gene (Bar-Joseph et al., 2003;Schlitt and Brazma, 2007;Yu et al., 2003). Unfortunately, finding these TFs and their corresponding TFBSs is costly and laborious; therefore, computational discovery of TFs and TFBSs has attracted considerable attention (Bailey and Elkan, 1994;GuhaThakurta, 2006;Hertz and Stormo, 1999;Mahdevar et al., 2012;Thijs et al., 2002;Tompa et al., 2005).
Information flow from DNA to proteins begins with transcription process; besides, it is the origin of response to numerous stimuli (Latchman, 1998;Lodish et al., 2007). As a result, regulation of transcription machinery is more crucial to cells than other aforementioned regulatory steps (Davidson, 2001).
TFs that involve in transcription process are also proteins, i.e., products of some genes can influence the expression of other genes. This means that genes are in a network and interact with each other in order to react to stimuli or to perform vital processes of cells. Of course, there are other biological networks in cells, e.g., protein -protein -interaction networks or metabolic pathways. These networks are usually depicted as nodes connected by edges. Nodes represent genes, proteins, or metabolites that are cell response to stimuli. Edges often represent relations, such as the binding of a TF to its target TFBS or direct molecular interactions.
Understanding how genes regulation occurs and identifying the interactions between genes in a living system, namely, network inference, are important topics in today's biology. In this paper we have focused on the later topic.
In general, there are two strategies for modeling gene networks: physical modeling and influence modeling (Gardner and Faith, 2005). Following two paragraphs describe these strategies in brief.
Physical approach tries to find true molecular interaction between TFs and TFBs that control transcript synthesis. Specifically, in physical approach the goal is to identify the TFs and the corresponding TFBSs that regulate transcription process. Restricting regulators to TFs made this approach easy to apply but also weak to infer the network and to describe regulatory mechanism.
The goal in the influence approach is to discoverindirect -regulatory influence of each transcript on the synthesis of all transcripts; that is, these models do not utilize or consider proteins, TFBSs, or TFs data explicitly, but, implicitly.
After choosing the network model, network parameters, specifically, scores of edges between all pairs of genes or nodes in the network must be inferred through available data. As aforesaid, this process is known as network inference or reverse engineering.
High-throughput technologies such as Microarray have led to an exceptional growth in data related to sequence features and gene expression. Essentially, four kinds of data are available to employ in gene networks inference process, namely, 1 -amount of existing transcripts, 2proteins, or 3 -metabolites and 4 -genome data.
Measuring the concentrations of RNA transcripts, produced proteins, or metabolites give an insight into the gene expression and are the fundamental resources for several studies. Comprehensive overviews of these methods studies can be found in (Barabasi and Oltvai, 2004;DeJong, 2002;Filkov, 2005;Hecker et al., 2009;Sîrbu et al., 2010).
Genomics data, e.g., genes and TFBSs, are supportive to the networks inference: relations between TFBSs and gene expression can be revealed by this kind of data. Methods that use genome sequence data fall into two categories: those that rely on prior knowledge of TFBSs, and those that discover TFBSs by computational effort and do not take prior information about TFs and TFBSs into consideration. The research work done by Beer and Tavazoie (2004) is a good example for the first category; they employed both known instances of TFBSs and novel TFBSs extracted from promoter sequences to predict the expression level of Saccharomyces cerevisiae (yeast) genes by training a Bayesian Network on microarray data of environmental stress and cell cycle (Beer and Tavazoie, 2004). Accuracy of their method on yeast genome was around 0.70. In a more recent work that belongs to the latter category, Pavesi and Valentini (2009) applied a machine learning approach to classify sets of co-expressed genes using information extracted from their regulatory regions.
In this paper, we propose a new method that infers genes correlation network from their promoter sequences by using a neural network without exploiting any a priori knowledge about input promoters, their potential TFBSs, or need to clustered input. A collection of genes promoter sequences is the only input of the proposed method and the result is correlation network of genes corresponding to these promoters. In brief, the method begins by extraction of potential TFBSs from all input sequences through computation; then, it tries to find expression similarity between all pairs of inputs with respect to the discovered TFBSs by a specific type of neural networks; at last, it uses calculated similarities to find groups of inputs and to build correlation network of given genes. Performance and limitations of the proposed method on inferring correlation networks form both simulated and biological data are presented in the result section.

METHOD
In this section, we explain the proposed method in detail. Simply, the input of the method is a collection of promoter sequences, P = {p 1 , p 2 ,…, p n }, with n sequences of arbitrary lengths; and the output is correlation network of input, inferred from properties of computationally discovered TFBSs that exist among input promoter sequences. The predicted network is shown by graph G = (V, E) with n nodes.
In order to assign weight to edge set E from discovered collection of TFBSs, a neural network and a neighbor joining method are utilized, as follows: TFBSs occurring in P are used as inputs to the neural network, its outcome will be the amount of correlation between expression patterns of two P members. At last, an application of neighbor joining method to these correlations yields elements of E.
The general steps of our method, including Extracting Promoters Information, Finding Expression Correlation, and Building Correlation Network are described in the following subsections.

Extracting promoters information
Aim of this step is to extract information of all input sequences. TFBSs, short subsequences in promoter sequences with high frequency in comparison to random sequences (Das and Dai, 2007;GuhaThakurta, 2006), are very important to transcription machinery (Jacob and Monod, 1961). Because of their higher frequency, it is possible to discover known and novel TFBSs by searching for overrepresented subsequences. Certainly, many computationally TFBSs discovery methods are available; for an overview of these methods see (MacIsaac and Fraenkel, 2006;Tompa et al., 2005). Among them, AmotiF tool (Mahdevar et al., 2012) has good sensitivity and specificity, runs fast, and is planned to find overrepresented pattern with respect to the biological facts about TFBSs (See Mahdevar et al. (2012) for more details about this method). Therefore, we have chosen AmotiF tool to find TFBSs. Experiments with other TFBSs discovery methods confirm the validity of this choice: AmotiF achieved the best accuracy in the discovery of random TFBSs planted in a set of sequences. Precisely, the accuracy of AlignACE (Roth et al., 1998), AmotiF (Mahdevar et al., 2012), BioProspector (Liu et al., 2001), and MEME (Bailey and Elkan, 1995) in finding 10 TFBSs planted in 100 random sequences of 500 nucleotides was 39%, 62%, 56%, and 53%. Where, TFBSs range from 6 to 20 base pairs in length and have 5 to 10 instances with 60% to 90% identity. Furthermore, AmotiF gained the highest score in the famous assessment provided by Tompa et al. (2005), as stated by Mahdevar et al. (2012). By investigating TFBSs of S. cerevisiae Promoter Database (SCPD) (Zhu and Zhang, 1999) and TFBSs of Escherichia coli Database (RegulonDB) (Salgado et al., 2004), we have found that TFBSs positions are far from being random, to be exact, TFBSs positions have little standard deviation: 28% of maximum value, which is consistent with the results of Beer and Tavazoie (2004). Moreover, we have found that in these two species 70% of all TFBSs are placed in the first 250 nucleotides of the promoter sequence. We have employed these properties to adjust scores of TFBSs returned by AmotiF; simply, by reducing score of TFBSs that violates them.
Existing TFBSs discovery methods may split a TFBS into two or more TFBSs. For instance, they may report first few nucleotides of a TFBS as a single TFBS and the rest of it as another one. Thus, the proposed method merges TFBSs whose occurrences are very similar in all input sequences into one TFBS. Furthermore, the proposed method removes TFBSs that exist in almost all inputs, since these types of TFBSs have no discriminative information.
At last, score of every TFBS in each input promoter will where m is the number of remaining TFBSs and t i,j is the score of jth TFBS in the ith sequence, normalized by dividing it to the square root of number of nonzero t k,j , where, 1 ≤ k ≤ n. Explicitly, , for all 1 ≤ i ≤ n and 1 ≤ j ≤ m. This normalization reduces scores of weak TFBSs interspersed throughout the input promoters.

Finding expression correlation
Here, the goal is to achieve expression correlation between all two members p i and p i of P from matrix T and store them in We have developed a Self -Organizing Map (SOM), a special type of neural networks, for this purpose.
We choose SOM because it is an unsupervised clustering method with no needs to specify the exact number of clusters or even estimating this number by fine-tuning a parameter. SOM is relatively stable, e.g., its outcome is not dependent to the initial weights, in contrast to methods such as K-means clustering (Martin et al., 2007). SOMs are not prone to over-fitting or overlearning (Meissner et al., 2009) and their training procedure is just a simple iteration (Kohonen, 1989). Furthermore, replacing SOM by other clustering methods (e.g. K-means or hierarchical clustering) has resulted in worse performance.
Fundamentally, neural networks are particular type of machine learning algorithms which consist of a set of neurons and applicable for pattern classification, identifying correlations in the data, and many more problems. There exist two kinds of learning for a neural network: supervised and unsupervised.
In supervised learning, training patterns with known classification repeatedly are presented to the network and network parameters or weights are adjusted such that the number of misclassified inputs decreases. These two procedures, i.e. presenting patterns to the network and adjusting the weights of the network, are iterated until some convergence criterion is satisfied.
Unsupervised learning is appropriate when there is no prior information about classification of inputs. In unsupervised learning, the task is to classify or to cluster inputs such that close members, or members who belong to single class or cluster, share common properties or show higher similarity according to a predefined measure.
SOMs are unsupervised neural networks suitable to both classifying and clustering data according to their similarities, consisted of two layers of neurons (Kohonen, 1989): input layer and output layer, as illustrated in Fig. 1.
Input of the developed SOM is information collected from every input promoter sequence, stored as rows of T; therefore, the network has m neurons in its input layer.
Each input neuron connects to every output neurons via variable connection weights. Weights are random at start and adjusted during training process, such that close nodes become more sensitive to similar rows. In order to get to stable weights, training or presenting whole input to the network and adjusting its weights, should be iterated multiple times. By examination, we have found that the number of training (λ) should at least be proportional to n and 50 × n is an appropriate value. Networks weights would eventually become fixed, because the training procedure has a decreasing variable called the learning rate; in almost all cases that we have tested, 50 × n is enough to reach this point and further iterations are unnecessary.
Output neurons are in connection with their adjacent or neighboring neurons and compete to have minimum distance to an input pattern. There are ways to define neighborhood relationship and topology of the output layer, e.g., 2D, 3D, honeycomb, or grid neighboring. Output layer of SOM in this paper has grid structure as depicted in Fig. 1. In order to capture all distinct patterns exist in inputs, number of the output neurons that are not very close to each other should be more than the number of possible input patterns. Again, by examination we have found that 121 × min(n, 2 m ), which depends on both m and n, is appropriate for the number of neurons in output layer, σ. Since we have n sequences, and each of which is composed of m two valued variables, number of distinct patterns is less than or equal to the minimum of n and 2 m . Thus, to have distance of 10 neurons between winning neuron of each pattern on the output layer, number of output neurons should be (10 + 1) 2 × min(n, 2 m ) or more. (Calculating network specifications from the number of input sequences and number of discovered TFBSs has made our method more practical on inputs with various sizes.) After training procedure, Pearson correlation coefficient is calculated between activities of output neurons when ith or jth rows of input are presented to input neurons and results reported as correlation between ith and jth input promoters. Algorithm 1 shows procedure of finding expression correlation through the scores of TFBSs found in the input promoters with more computational details.
Algorithm 1: Training SOM Initialization: let the weight between input neuron 1 ≤ i ≤ m and output neuron 1 ≤ j ≤ σ , w i,j , be a random value in range [-1, +1]; Training: for p ← 1 to λ for r ← 1 to n and function δ(.), neighborhood function, is defined as (4) where functions row(.) and col(.) respectively return row and column of input index in the output layer. μ is time decreasing variable that affects neighboring magnitude .
( 5 ) end for end for Compute Correlations: Let c i,j be the Pearson correlation coefficient of the output layer activity when information of p i or p j are presented to the input layer; specifically, let (6) for all 1 ≤ i, j ≤ n.
Building correlation network In this step, the goal is to construct graph G = (V, E) from matrix C. Nevertheless, in the biological networks genes are in groups with many interactions (Barabasi and Oltvai, 2004), these groups connect to other groups via special member genes with large number of connections. We are interested to find these groups. Graph G has n nodes correspond to n genes, that is, V = {v 1 , v 2 , …, v n }. The edge between gene i and j, e i,j , has initial weight of c i,j . In the beginning, genes are placed randomly on a n by n 2d plain; then, each one slightly moves toward genes with higher mutual correlation by a small amount. To be exact, position of gene 1 ≤ i ≤ n at time t + 1, , is computed from by the following formula: (7) where function α(t) is time -decreasing temperature function, calculated according to the rule , and , 1 ≤ i ≤ n, are randomly scattered on a n by n 2d plain.
This procedure repeats until genes positions become fixed or temperature reaches specific value of α(t) < 0.001. Now, genes with similar expressions are closer and it is easy to identify gene groups. We have used a method called neighbor joining to find these groups as follows.
Initially, each group consists of only one unique gene; then, two groups with minimum distance, i.e., neighbors, join to form a new group. Distance of newly formed group to others is average of distances of two joined neighbors to them. Groups join if the distance between them is smaller than a certain threshold value. Experiments revealed that this threshold value should increase to some extent by growth of n, and ln(n) works well. (The threshold value for joining groups should depend to some extent on the number of genes (n), because in the beginning genes are randomly scattered on a n by n plain and initial distances between them depends on n. We have examined several functions of n, among them ln(n) is good for this purpose.) After finding groups, edges weights will be adjusted: weight of edges whose both vertices are in the same group increases by 25%; and, finally, edges with low weights, less than 0.25, will be removed.

RESULTS
This section presents the results of running our method on both simulated and biological datasets. Nonetheless, we need some measures to evaluate them; thus, we discuss employed measures at the outset.

Measures used to evaluate the results
In order to judge the accuracy of proposed method on inferring correlation networks, a number of well -known standard statistical measurements, shown in Table 1, are employed. Measures introduced in Table 1 rely on four statistics: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). TP is the number of edges in real network that predicted to be edges by the method; TN is the number of edges that does not exist in both real and constructed networks; FN is the number of edges ignored by the method; and, FP is the number of edges that the method draws incorrectly.
Simply, sensitivity (SN) and specificity (SP) quantify prediction accuracy, respectively, via true positive and true negative values. Correlation coefficient (CC) calculates same value by considering all four statistics. By dividing the number of true prediction by all predictions, accuracy (ACC), probability of obtaining true predictions, can be calculated.

Simulated datasets
We have used rMotifGen tool (Rouchka and Hardin, 2007) to generate simulated promoters datasets with respect to properties of biological promoters. Each dataset has 5 to 50 random promoters, each of which is a sequence of length 200 to 600 base pairs. Promoters are in separate groups and members of each group have common TFBSs. A TFBS is 7 to 19 base pairs long, has zero to at most one-third of its length random mutations, and standard deviation of its position in promoters is 28% compared to random positioning. Figure 2 presents the average of SN, SP, CC, and ACC over 100 datasets: all four measures are very high when sequences have at least one TFBS. Moreover, Fig. 2 reveals that accuracy of predictions can be improved by increasing the number of common TFBSs; also it shows that if sequences have no TFBSs at all, then these four measures will dramatically reduce to about 0.20. This means that sequences are clustered according to their TFBSs and networks inferred by our method are more similar to real networks in comparison to those generated v t i + ( ) at random. Figure 3 shows the network generated for a dataset with three sets of sequences; members of each set share a single common TFBS with one mutation, so, the result must be a tripartite graph. There are only five wrong edges in displayed graph and SN, SP, CC, and ACC respectively are 1.00, 0.94, 0.90, and 0.95.

Comparison with gene ontology resource The Gene
Ontology™ (GO) project provides another resource to look for functional, locational, or procedural similarities within a list of genes. In brief, GO describes genes characteristics by an organized vocabulary of three categories or ontologies: molecular function, cellular component, and biological process. Ontologies are hierarchies of defined terms and each gene might be labeled with one or more terms of ontologies.
Several studies have provided us with evidences that genes that are biologically related would maintain this relation both in their expression levels as well as in their GO data (Sevilla et al., 2005). For instance, Sevilla et al. (2005) have concluded that correlation coefficient between gene expression and GO data is about 0.70 when Resnik measure of ontology similarity were exploited to estimate genes relationship (Sevilla et al., 2005). Consequently, it is reasonable to compare networks generated by GO data with networks generated by promoter data which our method provides. To perform this comparison, we built 100 datasets, each with 20 to 40 promoter sequences that were selected randomly from yeast genome (promoter sequences were downloaded from Saccharomyces Genome Database Project (Cherry et al., 1998)). We run our method on these datasets and compare resulted network with networks generated by GO data. SN, SP and ACC of this comparison are 0.60, 0.64 and 0.61, respectively.
In another experiment, we build several datasets from yeast genes that contribute in disparate biological processes, e.g. translation process and cytokinesis process. Since genes of a dataset belong to very distinct processes, their expression should be very dissimilar (Wang et al., 2004); and, networks of these datasets should tend to be multi -partite. Running our method on these datasets resulted in networks that are very similar to multi -partite graphs, as expected; SN, SP, and ACC all are about 0.60.
Biological dataset Our biological dataset contains 500 base pairs of upstream sequences of S. cerevisiae (yeast) genes downloaded from YEASTRACT (Teixeira et al., 2006). Among all yeast promoters, we have chosen 254 genes which known to be regulated by 29 TFs.
Here our goal is to calculate the accuracy of the proposed method in predicting expression correlation from downloaded upstream sequences; thus, we have compared the output of our method for each pair of genes in the dataset with correlation reported by microarray technology (Lee et al., 2007). Mean square deviation of this comparison was 0.27. Comparison derived a SN of 0.51, SP of 0.56, and CC of 0.41.
It is common to evaluate the accuracy of methods that use microarray data to infer regulatory interactions by their ability to reproduce the experimental data or by their ability to identify verified interactions. For example, Sîrbu et al. (2010) have compared evolutionary algorithms presented for regulatory network inference; they concluded that five studied methods, namely DE+AIC (Noman and Iba, 2006), GA+ANN (Keedwell and Narayanan, 2005), GLSDC (Kimura et al., 2003), PEACE1 (Kikuchi et al., 2003), and GA+ES (Spieth et al., 2004) have overall accuracy of 0.44 in predicting identified interactions, which is comparable with accuracy of our method: correlation coefficient value of our method is 0.41. Also, they have found that average of mean square deviation of five studied methods over five sample genes is about 0.21; as noted above, mean square deviation of our method is 0.27, which is slightly higher than this value. These results suggest that capability of our method for approximating the microarray data of yeast is   3. A sample network generate by our method. Each five genes, starting from gene number 1, have their unique TFBS; therefore the perfect result is a tripartite graph. Nodes with higher correlation are closer and edge thickness indicates correlation magnitude between its two heads. Inferring gene correlation networks from TFBSs as good as other five methods. Figure 4 shows two networks that are generated from microarray data (left) and from the data that are provided by our method (right) for eight yeast genes: HSC82, HSP26, SIS1, SSA1, and SSA4 that are regulated by the heat shock factor protein (HSF), and DAL2, DAL4, and DAL7, which have a TFBS known as the upstream induction sequence (UIS). Furthermore, according to the GO™ database the first five genes are involved in the protein folding process and the rest contribute to the heterocycle metabolic process (Ashburner et al., 2000); further exploration using the Amigo ® browser ("AmiGO: official online tool-set for searching GO at http:// amigo.geneontology.org,") has revealed that these two processes are separated just in the second level from the root of the GO biological processes hierarchy. Thus, expression pattern of genes that have protein folding process label should be different from those that share heterocycle metabolic process label (Wang et al., 2004). Genes DAL2, DAL4, and DAL7 have strong expression correlation with each other and correctly are positioned far from other genes in both networks, whereas according to microarray data there is no similarity between expression patterns of DAL4 and HSP26 (Lee et al., 2007) and our method incorrectly identified slightly strong correlation between them (Fig. 4, right). Also predicted similarities between DAL2 and SIS1 or SSA4 are incorrect. Sensitivity, specificity, correlation coefficient, and accuracy of this example are 0.89, 0.94, 0.53, and 0.75, respectively.

CONCLUSION
This paper presented a new method for inferring gene correlation networks. The input of this method is a set of genes upstream sequences or promoters, each with arbitrary length, and the output is correlation network of given genes.
Briefly, the method begins by extracting TFBSs of input promoters, altering their scores according to the standard deviation of positions where they occurred; and, merging similar ones into single one. Finding expression correlation between each pair of genes from these scores by a self -organizing neural network is the next step. In the final step, it generates the network graph by moving genes with similar expression patterns towards each other and joining them into genes groups.
This method has been tested on simulated and biological data. Results show that the accuracy and efficiency of this method in inferring correlation networks, merely, from promoters is almost equivalent to the accuracy of methods that use microarray data as input, assessments of some of which have been presented by Sîrbu et al. (2010) and Allen et al. (2012); and to the accuracy of methods that start with clustered promoters, as presented by Beer and Tavazoie (2004).
Accuracy of the proposed method is a function of the quality of its input, which is the result of a TFBSs discovery method; therefore, improving those methods will improve the proposed method as well.
Collecting the results of two or more TFBSs discovery methods on the given upstream sequences and presenting those TFBSs that are reported by many of them to the network, seems to be another promising way to improve accuracy. Also studies have found associations between histone acetylation patterns and gene activity (Kurdistani et al., 2004), which in turn could prove helpful to our method.
When biological data are used instead of simulated data, the accuracy of proposed method drops by 0.30 in terms of CC; the reason for this significant drop is that there are more determining factors in regulation of genes Fig. 4. Networks generated by two data sources: left by microarray data, right by data that our method provides. Expression patterns of genes DAL2, DAL4, and DAL7 are very different from others, so they are positioned at distant relative to others in both networks. Genes with high expression correlation are close to each other and weak correlation coefficients (< 50%) are depicted by dotted lines. than presence or absence of TFBSs; factors like those stated in the above paragraph or epigenetic regulation (van Driel et al., 2003).
Combining data provided by this method with other existing data, e.g. GO data and microarray data and testing above mentioned possibilities is our planning for future works.