Edited by Hiroshi Iwasaki. Yoshihide Hayashizaki: Corresponding author. E-mail: yosihide@gsc.riken.jp |
What is written in the human genome sequence? If we think about it carefully, we will realize that, even when we have decoded the full base sequence of the genome, we will still be a long way from grasping the mechanism that guides the phenomenon of life, and which allows this genetic code to produce an individual from a fertilized egg and thereby create the next generation. Although it is true that the story of life unfolds within the dynamic created by interaction with the environment, the initial settings are all contained in the genome sequence. The phenomenon of life has a layered structure, so that, for instance, RNA is made from genomic DNA and proteins are in turn made from RNA; the genome sequence represents the most basic of the settings in the genetic information inherited from the parent and its function is expressed in the direction of the next layer up the chain reaction set off by the code. To understand the phenomenon of life, there is therefore an additional need for systematic analysis of each layer and of the linkage between the layers. As mentioned above, the RIKEN Omics Science Center (OSC), of which I am the director, has been engaged in a project running parallel to the human genome project with the aim of achieving a complete analysis (transcriptome analysis) of RNA that plays a complementary role to the genome, and developing an analysis of the transcription network that transcends layer boundaries, thus tackling head-on the question of what is written in the human genome sequence. In recent years especially, we have been engaged in high-volume data analysis using a next-generation sequencer.
We had the idea of embarking on a transcriptome analysis at a time when the world’s attention was focused on the human genome project. It was in 1995, when, as mentioned above, the US NIH announced its intention to complete the decoding of the human genome within eight years, that I was appointed the Project Director of RIKEN Genome Science Laboratory. With Americans leading the world in genome science with the backing of their government, I needed to work out a meaningful research strategy which would allow us to hold our own in the world. My proposal in response was that we develop a full-length cDNA technology capable of a complete RNA analysis and use it to decode the transcriptome. The advance of the life sciences requires the decoding not only of the genome, but of the whole of RNA, as the genome sequence alone cannot tell us which gene is expressed when. However, with the technology available at the time, it was not possible to efficiently prepare a full-length cDNA library. We therefore set up a full-length cDNA project and set about developing a full-length cDNA technology to allow efficient analysis of large volumes of RNA data. When the human genome sequence draft was published in 2001, the full-length cDNA database assembled using our technology provided the most important information for the decoding of the genome; namely, it indicated which parts of the genome are transcribed, in other words whereabouts (on the genome) the genes are located. In this way it contributed to the international human genome project by furnishing complementary data.
Looking back, it seems to me that, although I may always have deliberately steered in a different direction to the world’s major projects, I have sought to follow a research strategy capable of providing complementary results (Table 1). This strategy of deliberately following a different direction perhaps characterizes my natural research style. For instance, as mentioned above, at a time when the international human genome project was focusing energies on the whole human genome sequence, we deliberately turned out attention to RNA and set up the full-length cDNA project. Then, after the decoding of the whole human genome was completed, the US NIH set up the ENCODE, the Encyclopedia Of DNA Elements project which aimed to identify all functional elements in the human genome sequence; but in Japan, we at the RIKEN OSC became the base for the Genome Network Project sponsored by the Ministry of Education, Culture, Sports, Science and Technology. In this new research activity, we at RIKEN organized, and collaborated in, an international research consortium for the Functional Annotation of the Mammalian Genome (FANTOM), a novel attempt to carry out network analysis using the full-length cDNA data. The contrarian streak which led me to deliberately follow an independent path came to the fore again when, directly after the decoding of the human genome was completed, the US NIH launched the $1,000 genome technology project, which aimed to develop an ultra-fast sequencer which would permit human genome decoding for $1,000. We instead chose to attempt the development not of a complete sequencing technology that could decode the whole of the human genome, but rather a typing technology (SmartAmp) which would simply and quickly measure a target section. This technology is very simple and can be used as a point-of-care technology in clinical and outpatient practice and other environments.
![]() View Details | Table 1 Projects in the world and Hayashizaki’s group |
Meaningful research, I think, means research capable of producing results that agree with those of worldwide research, even though it may be of an independent nature. I believe that to make a contribution to world science demands the independence to establish a strong position and the insight to partake in international collaboration. In the present review, I would like to present our transcriptome research while making reference to the contrarian research strategy that I have pursued within the world’s cutting-edge research.
As outlined above, in 1995, when the US NIH announced that it would complete the decoding of the human genome within eight years, I felt that I needed to establish a research strategy that could hold its own in the world in the field of genome research. It was obvious that we could not compete in human genome research against the US, which had a budget several tens of times larger. So, for the three following reasons, we decided to develop a full-length cDNA technology and use it for a comprehensive and systematic analysis of RNA.
Development of a sequencer was already proceeding in the US with government backing, but in Japan, there was no such backing, and the opportunity had been missed.
The cDNA synthesis technology of the time was still a long way from anything that could be described as full-length. I will come back to this later, but, to analyze RNA from which an intron (intervening sequence: a domain which, after transcription into RNA, is removed and not translated) has been removed via a special mechanism known as splicing, the full-length cDNA has to be analyzed and decoded after synthesis. In the case of alternative splicing in particular, where the intron removed differs according to the tissue and the stage of differentiation, producing different mature RNA even if the RNA is transcribed from the same gene on the genome, it is very important to analyze the full-length RNA sequence. Because of this, full-length cDNA is an essential technology and contributes to progress in the life sciences. If we could develop a full-length cDNA technology, we believed that, when a sequencer was developed, the field of transcriptome analysis would receive a great boost and our contribution to the world would be evident. Today, when the world has come as far as developing a next-generation sequencer, the full-length cDNA technology we developed is still the only RNA full-length analysis technology in the world.
The gene prediction and function analysis carried out based on the human genome sequence is performed using a variety of computer programs with reference to previously obtained genetic information. These programs are designed on the basis of the open reading frames (ORF) that code proteins, sequence conservation between species, promoter sequence motifs, etc. In other words, they were effective in finding the typical protein-coding genes featured in textbooks of the time. However, it is not easy to predict genes with previously unknown features, for instance genes with ncRNA or long introns. On the other hand, full-length cDNA is a copy made using as a template mRNA transcribed from the genome. In general, the genes on the genome undergo various modifications after being transcribed, and assume the form of mature mRNA after splicing to remove unnecessary sections (introns). Full-length cDNA is a copy of this mature mRNA, and determining its base sequence gives access to the sequence information of the exon (the section which codes protein). In this way, the primary structure of a protein can also be predicted. The reliability of the analytical data can be improved by adding these full-length cDNA data as experimental information for the previously impossible task of gene prediction.
It is clear that combining genome and cDNA analysis data, rather than using them separately, is likely to lead to new discoveries. For instance, comparing genome sequence and full-length cDNA sequence information reveals information on the position of the genes on the chromosome. Also, if a number of full-length cDNA clones are mapped onto a specific domain of the genome, and variation is seen in the combination of exons used, information on alternative splicing can be obtained.
Single-stranded RNA transcribed from DNA forms hydrogen bonds between its own bases and thus readily forms higher-order structures. Because this makes synthesis of full-length cDNA difficult, the main focus of cDNA analysis at that time was limited to collecting large amounts of cDNA fragments known as expressed sequence tags (EST). In the full-length cDNA project, the following ways of overcoming this difficulty and collecting full-length cDNA clones in a highly efficient manner were developed:
1) An elongation method of full-length cDNA by trehalose thermo stabilization (Fig. 1)
![]() View Details | Fig. 1 Full-length cDNA synthesis with trehalose. |
2) A method of selecting full-length cDNA using the 5’ cap structure (cap trapper method, Fig. 2)
![]() View Details | Fig. 2 Cap trapper method. |
3) Normalization and subtraction methods using biotinylated known RNA to collect unknown cDNA in a highly efficient manner, and
4) High-efficiency cloning vector systems.
The trehalose method is a technology elaborated together with Piero Carninci, who was working in my group as a postdoctoral fellow at the time. It was concluded that, for RNA to assume a secondary structure, reverse transcriptase terminates synthesis at the stem-and-loop structural part during the synthesis of first-chain cDNA. By raising the reaction temperature to dissolve the RNA secondary structure, it should be possible to resolve this problem, but when the temperature is raised, reverse transcriptase is deactivated. In connection, we found a number of interesting reports in the literature: ‘when baker’s yeast is subjected to heat shock, trehalose concentration increases’ (Hottiger et al., 1994); and, in another report ‘when baker’s yeast, which is a strain deficient in trehalose synthetase, is subjected to heat shock, it dies as it has very poor heat resistance’ (De Virgilio et al., 1994). Generally, heat shock causes the production of denatured protein within the cell, which, because it is highly toxic to the cell, sets off two phenomena. One is that the ubiquitin pathway becomes active, and the other is that the chaperone protein refolds the denatured protein. Accordingly, we derived from the two reports quoted above the working hypothesis that trehalose may act as the chaperone that promotes refolding, and that trehalose may act to protect protein from heat denaturation. Based on this working hypothesis, it was found that adding trehalose to the reverse transcriptase reaction fluid raised the optimum temperature by 20 degrees and made possible the synthesis of first-chain cDNA with a length of 16,000 bases. As the longest cDNA at the time was no more than 2,000 to 3,000 bases, this was a startling result. We succeeded subsequently in the synthesis of full-length cDNA (Fig. 1).
After completing these full-length cDNA technologies, in order to sequence the synthesized cDNA in short time at high volume, we worked jointly with the Shimadzu Corporation to develop the world’s first 384-lane capillary-type sequencer, RISA (Fig. 3). The 96-capillary sequencer ABI 3700, which is the main device used by the international human genome project, came into use in 1998, but RISA had been developed for RIKEN’s full-length cDNA production one year earlier, in 1997. To exclude human error in the process from clone creation through sequence data analysis to database creation, we also elaborated an integrated system in which all processes operate in the 384 format. To this end we also had to develop a fully automated line extracting plasmids from Escherichia coli. Finally, using this system, which allows analysis of approximately 46,000 full-length cDNA clones per day, we were able to assemble mouse full-length cDNA in large volumes.
![]() View Details | Fig. 3 RISA sequencer. |
From first thinking about the full-length cDNA project, then developing the full-length cDNA technology and constructing the pipeline for analysis, through to producing the actual data, took up a total of four years. This four-year period with no papers published was the hardest time for me, but within the RIKEN organization, the scientists who were my supervisors showed a generous understanding of the situation and waited patiently. Thanks to them, in 1999, I was able to make a presentation at the Cold Spring Harbor Laboratory on our full-length cDNA technology and the data we had collected. To my great honor, the content of this presentation was featured in the scientific journal Science under the title ‘A Mouse Chronology’ (Pennisi, 2000).
There was a range of opinion as to the publication of the full-length cDNA data we had analyzed. We could continue to use the data exclusively for our own research without publication. However, in the interest of advances in life science research, we decided that the full-length cDNA data should be made publicly available. Moreover, seeing from day to day the consistently open and fair attitude of my supervisor, the first director of the RIKEN Genomic Sciences Center, Dr. Akiyoshi Wada, I began to feel that being open would increase the value of the full-length cDNA.
It is only after data has begun to be used around the world as a database that its value becomes apparent. Our database acquired functional annotation through the activities of the international FANTOM consortium described below and was published as the world’s first international standard transcriptome database for a higher animal in Nature on February 8, 2001 (Kawai et al., 2001). The full-length cDNA clone bank of 20,000 clones and the international standard database have undergone a wide range of improvements and modifications, but even today these data are used by researchers around the world and the database is accessed once every five seconds. This database was used in the analysis of the human genome draft sequence prepared by the International Human Genome Sequencing Consortium as the experimental basis for identifying transcriptional units. Thus, the first paper on the human genome draft sequence (Lander et al., 2001) was published as joint research in the same journal, Nature, on February 15, 2001, one week after the publication date of our above paper. This demonstrated that, as originally expected, our full-length cDNA data was able to contribute complementary information to the human genome sequence. Further, in 2002, a special issue on the mouse by the journal Nature carried data on 103,000 clones of full-length cDNA isolated from 2,000,000 clones extracted from 263 mouse tissues (Okazaki et al., 2002). This was the world’s first map to present a complete correspondence between the genome and RNA and was the basis for an upsurge in positional candidate cloning of pathogenic genes. Another example of our contribution is the induced pluripotent stem (iPS) cell research. Dr. Shinya Yamanaka of Kyoto University selected 23 transcription factors specific to the ES cell from our full-length cDNA database, as a result of which he successfully established the iPS cell.
After assembling the full-length cDNA, we appealed to scientists around the world to cooperate in joint research aimed at achieving its functional annotation, and the international research consortium Functional Annotation of Mammalian Genome (FANTOM) was formed in 2000. Currently this consortium has benefited from the participation of researchers at 51 institutions in 19 countries around the world active in a wide range of specialist fields including molecular biology, cell biology, and bioinformatics.
The consortium has so far completed four project stages and the fifth stage is in progress (Fig. 5). As of 2011, the international FANTOM consortium had grown into an “jamboree” type of consortium with the longest history in the world. In FANTOM 1 and 2, full-length sequencing and functional annotation were carried out on a total of 60,770 full-length cDNA clones (Kawai et al., 2001, Okazaki et al., 2002). The following FANTOM 3 saw the development of two new transcriptome analysis tools, the Cap Analysis of Gene Expression (CAGE) method and the Gene Signature Cloning (GSC) method, which can identify transcription start sites and transcription termination sites comprehensively and at high throughput. This afforded many new insights which could not have been obtained from full-length cDNA technology alone. Particularly noteworthy were the findings that more than 70% of the genome undergoes transcription, and that more than half of the transcripts are ncRNA (Carninci et al., 2005). This, the discovery of the RNA continent, had a great impact, and this research was chosen by the journal Science in 2010 as one of the ten pieces of research with the greatest impact of the last 10 years, ‘The Dark Genome’ (Pennisi, 2010).
Working in close coordination with the Genome Network Project of the Ministry of Education, Culture, Sports, Science and Technology, the next stage, FANTOM 4, combined the next-generation sequencer and the CAGE method to establish a technique for obtaining the gene expression profile. Then, through work centered on this technology, the world’s first dynamic transcriptional regulatory network to function at promoter level in cell differentiation was successfully elaborated using a model of monoblastoid to monocytoid cell differentiation in the acute human myelogenic leukemia cell line THP-1 (Suzuki et al., 2009). The findings were published as a database and have been used widely as a world standard in life science research.
This network analysis posited the new and original concept of the Basin Network. For the cell to maintain a constant character, a limited number of regulatory transcription factors and ncRNA form a network, and through positive and sometimes negative feedback, regulate each other to maintain a constant concentration within the nucleus. This stable energy state was called a ‘Basin Network’. Once this network is formed, the cell controls peripheral genes through the concentration of this limited number of transcription factors and other regulatory factors. As this Basin Network controls the phenotype the cell takes on, it can be seen as the definition of the cell itself. The aim of FANTOM 4 was to elucidate this network of factors at molecular level.
In this research stage, we made ample use of the CAGE method independently developed by our research group. Using the Cap Trapper method we developed to trap the RNA Cap site, CAGE is a technology that traps the Cap structure, creates a library with cDNA fragments exclusively from the Cap site domain, and sequences them with a next-generation sequencer. This method can be used to measure which part of genomic DNA contains what degree of promoter activity in which direction (i.e. where RNA is transcribed from and in which direction). In this way, we can discover in what part of the genome the promoter of the transcription factor itself is located.
As the activity of the promoter consists of the sum of the activities of a separate transcription factor that regulates the promoter of the transcription factor (or sometimes, the recognition sequence (motif) of this transcription factor itself), and of the motifs present in the promoters of the various transcription factors, the formula in Fig. 4 applies. The results of tracking the process of differentiation of the THP-1 cell from monoblast to monocyte using the CAGE method revealed, to our surprise, the presence of 29,857 promoters on the whole genome. This meant that there was a finite number of promoters on the whole genome which were susceptible to analysis, and that analysis of the changes in the regulatory effects of these promoters could be used to explain changes in the character of human cells at molecular level. The results of an actual analysis of the ‘Basin Network’ in the THP-1 cell differentiation process allowed us to successfully explain at molecular level the transcriptional regulatory network operating in this process.
![]() View Details | Fig. 4 The linear expression model and the motif activity. eps is the number of CAGE tags measured, and Rpm is the reaction efficiency of motif m. From these data, Ams, the activity of motif m, can be calculated. |
![]() View Details | Fig. 5 History of FANTOM projects. |
These activities have been carried forward into FANTOM 5, which is now under way. In FANTOM 5, based on the network analysis technology built up in FANTOM 4, promoter mapping for a range of cell types is being carried out. The aim of this is to explain the mechanism of cell diversity, whereby cells sharing the same internal genome nevertheless exercise diverse functions.
A glance back over the path of the FANTOM projects to date shows that, having set itself the aim of understanding life at molecular level, the object of the project’s research is moving steadily up the layers in the system of life, progressing thus from an understanding of the ‘elements’ - the transcripts - to an understanding of the ‘network’ - the transcriptional regulatory network, in other words the ‘system’ of an individual life form. This kind of data-driven research is often dominated by America and Europe due to budgetary reasons, but with FANTOM, the originality of the Japanese strategy has paid off, making this what I am proud to think is an encouraging example of our country playing a world-leading role.
In the following, I will present the activities of FANTOM to date.
In the initial activities of FANTOM, the rules and methods of gene functional annotation were set and definitions were laid down to allow gene functional annotation to be carried out efficiently (Kawai et al., 2001). These functional annotation methods and the database have now become the international standard.
Using a definition of the transcriptional unit (TU) as the transcription domain from the initial exon to the terminal exon, FANTOM 2 carried out base sequence determination, functional annotation, and categorization of 60,770 sets of mouse full-length cDNA. These activities represented the world’s first standardization of mammalian full-length cDNA, and the resulting scientific paper appeared in a special issue of Nature together with a report on mouse genome decoding (Okazaki et al., 2002). As a result of this comprehensive full-length cDNA analysis, not only was it possible to identify the protein-coding gene sequence and its position on the chromosome, but an unexpected diversity was discovered in ncRNA and alternative splicing. Moreover, the discovery was made of many sense-antisense pairs, in which a certain domain is transcribed in mutually opposite directions, suggesting that the expression of one of the pair is controlled. This result, which could not have been obtained from a computer prediction using genome sequence information alone, illustrates the value of full-length cDNA analysis.
In the third phase of activities, the functional annotation of a total of approximately 103,700 full-length cDNA clones was carried out. It was in this project that we developed the new transcriptome analysis tools known as the Cap Analysis of Gene Expression (CAGE) method and the Gene Signature Cloning (GSC) method, which can identify the transcription start site and transcription termination site comprehensively and at high throughput. Because these technologies make it possible to detach fragments (tags) of specific length from the 3’ terminal and 5’ terminal of a cDNA clone, join the tags up, and sequence them in one go, they are used in high-efficiency gene expression analysis, transcription start site identification, and promoter domain prediction analysis. Using these technologies and the Gene Identification Signature (GIS) developed by the Genome Institute of Singapore, analytical information on the transcription start site and termination site was collected for 11,567,973 in the mouse and 13,706,472 in the human. These analysis results overturned the conventional assumption that only 2% of the genome is transcribed by showing that 70% or more of the genome is transcribed as RNA (Carninci et al., 2005). Further, the existence of more than 23,000 non-protein-coding RNAs was confirmed, establishing that half or more of transcription products consist of ncRNA (discovery of the RNA continent). It was also shown that ncRNA has a variety of functions and that 73% of transcription products carry out sense-antisense transcription (Katayama et al., 2005).
These findings were made possible precisely because of a data-driven approach to analysis which focused directly on the data (Fig. 6).
![]() View Details | Fig. 6 Examples of mammalian ncRNAs. |
In FANTOM 3, the existence of a large amount of functional RNA was established. If the action of this functional RNA and its relationship with other molecules could be used to elucidate the molecular network present in the cell, it would be possible to better understand the origin and differentiation of life and the mechanism of disease onset.
In cell differentiation, the fertilized egg sometimes passes through a number of precursor cell stages before developing into the final target cell type. Through mutual regulation, the free energy of the transcription factors expressed in each of these cell states reaches a specific equilibrium state. In this state, the concentration of the transcription factors remains within a fixed range, including in the peripheral genes which they control. We called this steady state the attractor basin. The differentiation of the cell into a cell type of the next stage constitutes a shift to the next specific attractor basin.
Using this attractor basin concept, FANTOM 4 was able to represent as a time course the interactive network of transcription factors at work in the cell differentiation process. Using the THP-1 model cell line (a human acute monocytic leukemia cell line), 30 motifs which govern the cell differentiation process were identified from among approximately 200 transcription factor binding motifs in the process of monoblastoid to monocytoid cell differentiation, allowing a representation of the dynamic transcriptional regulatory network (Suzuki et al., 2009).
This stage saw the development of the deep CAGE method (deep sequencing with CAGE), which enables large-volume expression analysis at promoter level through combination with a next-generation sequencer. This technology made it possible to perform sequencing at a 10-fold deeper level than in analysis prior to that point, giving the analysis more quantitative ability, more comprehensiveness, and more accuracy. As a result, it became possible to detect, with a probability of 99.99% or more, one unit of a transcription product expressed in one cell, thus permitting the almost complete apprehension of the transcription start site of the THP-1 cell.
This analysis of the transcriptional regulatory network demonstrated that cell differentiation is not managed by a small number of master genes, but comes about through the joint roles played by a number of transcription factors. Also, the technique we developed for analyzing the dynamic transcriptional regulatory network at promoter level made it possible to make a high-level prediction of the transcriptional regulatory network from experimental data alone with no need to refer to information from the literature on transcriptional regulation. The application of this technique will increase the possibility of achieving artificial control of target cell differentiation, and will no doubt also contribute to regenerative medicine.
At present, the FANTOM 5 project has only just begun. The aim of this stage, based on the analysis methods developed in FANTOM 4, is to analyze the promoter map and transcriptional regulatory network in a wide variety of cell types to gain an integrated understanding of cell diversity. The resulting findings are likely to be useful, among other things, in the development of RNA biomarkers, and to contribute to preemptive medicine.
Scientific breakthroughs often originate in new technology. The RIKEN genome strategy which I devised was designed in the belief that developing original technologies was the basic key to establishing a strong position. I also believe that establishing such a position is a prerequisite for contributing to world science.
Based on this view, I pursued the three policies listed below as director of the RIKEN Omics Science Center. They are written down on a slide which I always use and I hope you will forgive me if they sound a little arrogant.
1) Don’t copy other people.
2) Start by developing independent technology.
3) Follow a data-based research style (data-driven technology verification research)
In Japan, research focused on RNA from an early date, and a unique position was achieved which allowed us to lead the world in technology for full-length cDNA clone collection and base sequence determination. Going forward, research into RNA including ncRNA will no doubt continue to be a biologically important theme. Genome-wide gene analysis data is also likely to prove increasingly useful in developmental biology, immunology, and a wide range of other research fields. Japan was a later starter in genome sequencing and in the development of next-generation sequencers, but it has elaborated a wide variety of methods in sample preparation technology and information processing technology for transcriptome analysis. As these technologies influence the creative originality of gene research and its very character, I believe that there will be an increasing need to devote energies to them.
Winning the Kihara Prize from the Genetics Society of Japan has proven a good opportunity to summarize the path that I have followed. My research resume can in one sense be described as the continuous pursuit of the question: what is written in the genome? Through my efforts, I have sought to answer various questions: what kind of RNA will be found encoded when the base sequence of the genome is unlocked? What kind of network regulates that RNA? How does the regulatory network determine the physiological and pathological character of cells at various stages of differentiation, cancer cells, and so on? The question ‘What is written in the genome?’ is now being replaced by the question ‘What elements are written there and what kind of network does that group of elements use to form itself into a system?’. I will be happy if presenting this one part of my own research path encourages the younger generation of researchers to carve out new paths of their own in the future.
|