Random matrix theory for an inter-fragment interaction energy matrix in fragment molecular orbital method

The statistical properties of the inter-fragment interaction energy matrix of the fragment molecular orbital method are analyzed using the random matrix theory. The eigenvalue and eigenvector distributions, the inverse participation ratio, and the unfolded eigenvalue statistics are compared with the corresponding random matrix ensemble. Cluster analysis of the fragments with strong correlations is presented using the inverse participation ratio of the random matrix theory.


Introduction
The random matrix theory (RMT) was introduced in physics in the 1950s to describe nuclear energy spectra. Because the first principal calculation of the spectrum of such a complex interacting many-body system is difficult, we attempted to consider an ensemble of the Hamiltonian, which is the mathematical representation of the hermitian matrix, with random components instead of a single Hamiltonian with definite components [1][2][3][4][5]. Many findings and applications have been established, both in Mathematics and Physics, which correspond to RMT. A prominent application is the nonperturbative description of strings and D-branes, which is known as the matrix model [6].
Recently, RMT has been widely applied to multivariate techniques. One example is in principal component analysis (PCA) [7][8][9][10], which is a multivariate statistical technique that decomposes a data set into several inter-correlated quantitatively dependent variables called principal components. In the past two decades, RMT has been applied to time-series analysis, particularly in the field of econophysics [11][12][13]. In particular, if the cross-correlation matrix of stocks is regarded as a random matrix, then the matrix can be described by a corresponding ensemble of RMT [1][2][3][4]. Several studies confirmed the assumption that correlation among stocks would be detected as a deviation from the random matrix [11][12][13][14][15]. One of the advantages of RMT is that it provides criteria to distinguish eigenvectors whose components contain correlated elements from those whose components do not contain correlated elements. This has shed some light on the problem, unsolved for a long time, of the number of eigenvectors deemed to be meaningful in PCA where only some of the eigenvectors (i.e., the one with the largest eigenvalues) have been typically analyzed as the principal component. The application of RMT has been adopted in other areas as well [16]. For example, in biophysics, RMT was used to analyze the NMR spectra of biomolecule [17,18], to classify RNA [10][11][12][13][14][15][16][17][18][19][20][21], to investigate the optical properties of proteins [22,23], to extract a specific network topology in proteome [24,25], to describe the displacement covariance matrix of the elastic network model [26], and to perform sequence analysis of protein residues [27,28]; moreover, RMT was used in molecular dynamics [29][30][31].
The fragment molecular orbital (FMO) method [33][34][35][36] is an ab initio computational method that can be used to determine the electronic states of a complex molecule, particularly a huge biomolecule, very efficiently. In this method, structurally complex molecules are divided into small fragments and ab initio or density functional quantum-mechanical calculations are performed for each fragment and fragment pair. During the calculation, a quantity of the pair interaction energies is obtained -also called as the inter-fragment interaction energies (IFIEs) -that is associated with the two fragments and interpreted as a two-body interaction between them. The IFIEs contain useful information on intra-and inter-molecular interactions, which we cannot be obtained by using any other method. Several methods have been proposed to analyze the pair interactions in detail. Configuration analysis for fragment interaction (CAFI) allowed extracting the stable components of the polarization and charge transfer for IFIEs [37]. In the pair interaction energy decomposition analysis (PIEDA), pair interactions were decomposed into the electrostatic, exchange-repulsion, charge transfer, and dispersion components [38]. A fragment interaction analysis based on local MP2 (FILM) was implemented to extract the electron correlation component of the interaction energy in each orbital [39]. A theoretical scheme to evaluate effective, screened interactions between fragments was proposed [40]. Recently a new attempt was made for the cluster analysis of IFIE matrix [41], where advanced clustering analyses based on a self-organizing map and multi-dimensional scaling were carried out, and a novel method for virtual ligand screening to explore drug candidates binding to target proteins was proposed. The interaction of protein complexes was extensively investigated ; in particular, numerous important findings were obtained by the FMO method on the influenza virus [50,55,58,68,69,73,75,76,79,85]. The matrix representation of IFIE values is called the IFIE matrix, and an IFIE map was introduced for the comprehensive analysis of the entire set of IFIEs [53], in order to extract the information regarding the interaction among fragments through the colored map. The visualized cluster analysis of the protein-ligand interaction (VISCANA) was also introduced for the virtual ligand screening based on the FMO [46]. The VISCANA classifies the structurally similar ligands in terms of the interaction patterns of a ligand with amino acids. Because the IFIE matrix is a type of a cross-correlation matrix, investigating it by the RMT is quite intriguing.
Herein, we study the inter-fragment interaction energy matrix of the FMO method by the random matrix theory. We mainly present the statistical properties of the eigenvalues and eigenvectors and the inverse participation ratio (IPR). The probability density of the eigenvalues is not agreement with Wigner's semicircle law [90]. There are four eigenvalues located outside the Wigner's semicircle distribution. The nearest neighbor and the next-to-nearest neighbor unfolded eigenvalue statistics that are consistent with those of the Gaussian orthogonal and symplectic ensembles, respectively. All IPRs are above the RMT values; this implies that the correlation sectors of fragments are rather localized. The distribution of the components of the eigenvector with the largest eigenvalue is a unique feature. It is a binary distribution; this means that the fragments belonging to the corresponding sector interact with the same weight. This may have profound implications for the interaction between the protein and DNA. Thus, we propose a cluster analysis based on IPR and RMT.

Random matrix
For an N×N real symmetric random matrix, we denote the eigenvalues k, where k=1, 2, …, N and set 1 > 2 > …> N, and also denote the i-th component of the k-th eigenvector associated with the k-th eigenvalue 1 by uki, where i=1, 2, …, N. In the limit N → ∞, the probability density, , of the eigenvalues is analytically given [90] by (1) where  is the variance of the matrix elements. Equation (1) is known as Wigner's semicircle law. The probability density is bounded by (2) Some of the universal quantities in the RMT are the nearest-and the next-to-nearest-neighbor spacings, Pnn(s) and Pnnn (s), respectively. (3) where a spacing s is defined by s=i+1-i and i is the unfolded eigenvalue [91,92]. In the limit N → ∞, the components of the eigenvectors uki are described by the Gaussian distribution with a mean of zero and unit variance [91,92] as follows: The IPR is a useful quantity in characterizing the distribution of the components [5,[93][94][95][96][97][98][99][100], which is defined for each eigenvector, uk, by (6) IPR measures the reciprocal of the number of eigenvector components that contribute significantly. A delocalized eigenvector with identical components uki=1/√N has Ik=1/N, and a completely localized eigenvector with a single nonzero component has uki=1 for particular i, and the remaining zeros have Ik=1. For the Gaussian orthogonal ensemble, the average is < Ik > =3/N. Therefore, a deviation from 3/N implies a correlation among the fragments. In particular, a deviation greater than 3/N implies a localized correlation and that less than 3/N implies a global correlation. In RMT, the eigenstate whose eigenvalue deviates from the RMT value (1), i.e., goes outside of the RMT boundary (2), is considered as a correlated sector. The correlation magnitude is evaluated from the deviation magnitude. The eigenvectors with especially small eigenvalues have large IPR values in most cases, implying a strong correlation between the fragments inside the corresponding sector. This is in contrast with PCA, where only some of their largest eigenvalues and the corresponding eigenvectors are typically analyzed.
To analyze the correlated clusters in more detail, we study IPRs further, using the method of Utsugi et. al. [15], where the z-index is defined in each fragment as follows: (7) where i represents the fragment number and  is the threshold for IPR when extracting the eigenstates which deviate from RMT prediction. The summation in Eq. (7) is over the components that satisfy the IPR condition Ik >= . The i-th fragment is regarded as significantly correlated if zi is larger than a certain threshold .
In FMO, the RMT applied as follows. If all fragments are correlated at random, the IFIE matrix is a random matrix. Because IFIE matrix is a real symmetric matrix, it becomes a real symmetric random matrix. The universality class of the real symmetric random matrix is the Gaussian orthogonal ensemble. Therefore, the eigenvalue distribution of the random IFIE matrix follows (1); the eigenvalues are bounded by (2); the nearest-and next-to-nearest-neighbor eigenvalue spacings, are (3) and (4), respectively, after unfolding the eigenvalue; the distribution of the eigenvalue components follows (5); and the IPR is < Ik > =3/N. Let us assume that there is a single cluster composed of correlated fragments. In that case, the components of the correlated fragment cluster are embedded in a random IFIE matrix. The eigenvalue distribution of the IFIE matrix slightly deviates from (1), where one of the eigenvalues is usually pushed outside of the upper limit of (2); the IPR of the corresponding eigenvalue deviates from 3/N, and the eigenvector components that correspond to the correlated fragment shift from (5), where it is usually the largest or the second largest eigenvalue. The nearest-and next-to-nearest-neighbor eigenvalue spacings, (3) and (4), are not subject to change. When there are several clusters of correlated fragments, each fragment interacts with the other in the same cluster. In that case, the components of the correlated fragment cluster are also embedded in a random IFIE matrix. Some of the eigenvalues are usually pushed outside of the upper limit of (2), and the eigenvalue distribution of the IFIE matrix slightly deviates from (1); sometimes, some of the eigenvalues are also pushed outside of the lower limit of (2) to satisfy the sum rule of the eigenvalues. The IPR of the corresponding eigenvalues deviates from 3/N. The eigenvector components that correspond to the correlated fragment shift from (5) indicate a different eigenvector for each correlated cluster. Therefore, among the eigenvalues obtained by diagonalizing the IFIE matrix, each eigenvalue that deviates from (2) is a cluster of fragments. In addition, examining the ingredients of the eigenvector, the components whose distribution deviates from (5) are fragments constituting the cluster. This is in principle the same as PCA; however RMT utilizes the quantitative criterions, (1), (2), and (5) and sometimes (3) and (4) as judgment conditions for the procedure. The RMT is a taxonomy of the random matrix, and a statistical property is analytically derived for every category of the random matrix [5,16]. For example the IFIE matrix examined here corresponds to the Gaussian orthogonal ensemble whose eigenvalue distribution follows Wigner's semicircle law (1), and in the case of the time series analysis [11-15, 26, 29-32], it is a kind of the symplectic matrix, so the eigenvalue distribution follows the Marcenko-Pastur law [101].

IFIE matrix and Dataset
We analyze the IFIE matrix [47,53,102] of the cyclic AMP receptor protein complex ((CRP)-cAMP-DNA), PDBID:1O3Q [103]. The IFIE matrix is available on URL [102]. Previous studies [47,53] elucidate the sequence-specific binding of the CRP complexed with a cAMP and DNA duplex and the stability using the FMO calculations. In these studies, the IFIEs were analyzed considering the interactions of CRP-cAMP with each base pair, the DNA duplex with each amino acid residue, and each base pair with each residue. In the calculation [47,53], they adopted the fragments in FMO as the amino acid residue, DNA bases, DNA backbone, and cAMP. The quantum-mechanical calculation is at MP2/6-31G level. The IFIE is a 245×245 matrix. The consistency with the IFIE matrix which is used here is checked with the figures for the same IFIE matrix in Refs. [47,53].

Results and statistical properties
The probability density of the components of the IFIE matrix, normalized to unit variance, i.e.,  =1, is shown in Fig. 1. The function is characterized by a sharp peak around the origin and two broad low peaks on either side.
We diagonalize the IFIE matrix numerically and obtain the eigenvalues k. The probability density of the eigenvalues is shown in Fig. 2 To test for universal properties, we calculate their distributions (Fig. 3); these are consistent with the predictions of the Gaussian orthogonal ensemble in RMT. Because of the small size of the IFIE matrix, the convergence is poor and we cannot make a definite conclusion. Fig. 4 represents the IPRs. All IPRs exceed the value predicted by the RMT, 3/245. This means that a cluster of correlated fragments tends to be localized. The IPRs of a large number of eigenvectors considerably deviate from the values of the RMT.
Figs. 5-10 show the components of the eigenvectors and their probability densities, for u1 to u8, u50 to u57, u100 to u107, u150 to u157, u200 to u206, u239 to u245, and the eight eigenvectors with the largest IPR values. On the horizontal axes of the eigenvector components shown in Figs u235, u243, u232, u230, u237, u221, u239, u234, u242, u223, u229, and u244 from left to right and top to bottom in descending order of IPR.
The absolute value of the components is shown as the diameter of a sphere. The strength of the correlation is also expressed in order of red, orange, yellow, and green. Table 1. Correlation sectors obtained from u1 to u4, and from 10 eigenvectors in descending order of IPR.

The 1st eigenvector
In general, the sector seen from eigenvector u1 is interpreted as the global trend of the whole system. In the present study, the distribution of the components is binary; this means that the fragments belonging to the corresponding sectors interact with the same weight. The fragments belonging to this sector are as follows: ; here, the letters B and C after the semicolon represent the chain numbers, and the fragments without a letter correspond to chain A and the former 53 fragments, i.e., up to GLY207, and the remaining 31 fragments, i.e., after HIS21, form two subsectors in the sense that their components have opposite signs in the eigenvector. These sectors consist of a DNA backbone (sugar and phosphate), CRP amino acids, and CMPs. This is extended over the entire complex, except for the bases of DNA. This mode is consistent with the interaction between a DNA backbone and CMP, as shown in the lower right panel of Fig.8 in Fukuzawa et. al. [47].

Sectors with high IPRs
In RMT an eigenvector with an IPR that deviates from the RMT value describes a localized strong correlation sector. On the contrary, in PCA only some of the largest eigenvalues and eigenvectors are typically analyzed. In this IFIE matrix, there are eigenvectors whose IPRs are considerably larger than the RMT value. In particular, the sector associated with u253 is composed of LYS22, TYR23, LYS44, and GLU93, and it is a subsector of u2. The sector associated with u243 is composed of LYS188, GLU191, and ASP192, and it is a subsector of u2 and u4. The sector associated with u232 is composed of ASP138, VAL139, THR140, GLY141, ARG142, ILE143, GLN145, THR146, and GLY177, and it is a subsector of u3. The sector associated with u230 is composed of ASP8, THR10, GLU12, ASP111, ILE112, MET114, ARG115, and GLN119, and it is an isolated sector. The sector associated with u237 is composed of GLU34, LYS35, GLU37, PHE76, GLU77, GLU78, and ARG103, and it is a subsector of u2 and u4. The sector associated with u221 is composed of CMP1, CMP2, ILE20, HIS21, TYR23, TYR41, ASP68, ILE70, SER83, and GLU96, and it is a subsector of u2. The sector associated with u239 is composed of GLU34, LYS35, GLU37, GLU78, GLU81, and ARG103, and it is a subsector of u2 and u4. The sector associated with u234 is composed of CMP1, CMP2, GLN32, GLU34, LEU39, ASP68, GLU72, GLU81, ARG82, SER83, and ARG122, and it is a subsector of u2. The sector associated with u242 is composed of CMP1, CMP2, ASP68, ILE70, GLU72, ARG82, GLN119, ARG122, and ARG123, and it is a subsector of u2. The sector associated with u223 is composed of ASN149, LYS152, ASP155, MET157, HIS159, ASP161, GLN164, LYS166, THR202, and VAL204, which is an isolated sector. These findings are summarized in Table 1.
Next, we examine the  dependence of the z index defined by Eq. (7). The results are shown in Fig. 13 for  = 0.25, 0.2, 0.15, 0.1, and 0.05. Because  is an IPR threshold, a large  implies a cluster with strong local correlation. The results show that there are strong local correlations between the DNA-binding helix-turn-helix motif and the CRP on the opposite side of the DNA.

Summary
We studied the statistical properties of the IFIE matrix eigensystem of the FMO method using RMT and reported results in detail. The probability density of the eigenvalues is not consistent with Wigner's semicircle law. Conversely, the nearest and next-to-nearest neighbors unfolded eigenvalue statistics that are consistent with those of the Gaussian orthogonal and symplectic ensembles, respectively. The eigenvector u1 is a unique feature. The distribution of the components is binary and the IPR agree with the RMT value, implying that only specific fragments perform interactions with a uniform strength and that those fragments are randomly scattered over the entire cyclic AMP receptor protein complex. Because all the correlation sectors obtained from u2, u3, and u4 are subsets of u1, it is understood that there exists a hierarchical structure, where specific fragments form subsectors with strong interactions between the fragments that belong to the u1 sector. The IPRs of a large number of eigenvectors considerably deviate from the values of RMT, implying that there are many local sectors of fragments with strong correlation. Since most of these local sectors form a subsector of u2, u3, or u4, we find that there is an additional hierarchical structure under the first hierarchy. It would be quite an interesting task to perform an in-depth analysis for other proteins, protein complexes, and complexes of proteins and nucleic acids including PIEDA.

Acknowledgements
I greatly appreciate the FMO tutorial course of the Chem-Bio Informatics (CBI) society annual meeting held in 2013, as the ideas underlying this report were inspired from this meeting. This study was performed using numerical data from the cpf file, trunc-DNA_CRP_mp2D_631G_FZC.cpf, from the Revolutionary Simulation Software for the 21st century (RSS21) project [102]. The corresponding software is the BioStationViewer Version 16.00 produced by T. Nakano, K. Kitaura, N. Sasaki, Y. Mochizuki, and S. Amari and these results were achieved by using Innovative Simulation Software for an Industrial Science Project. Figs. 11, 12, and 13 were obtained using Rasmol [104].