Edited by Etsuko Matsuura. Masa-Toshi Yamamoto: Corresponding author. E-mail: yamamoto@kit.jp |
Since D. melanogaster genome sequence was published in 2000 (Adams et al., 2000), D. pseudoobscura genome sequence was released in 2003, and additional ten Drosophila species genome sequences were reported recently (Clark et al., 2007; Gilbert, 2007). The genomes of the melanogaster subgroup, D. erecta, D. sechellia, D. simulans and D. yakuba, were included. Sequencing and comparative analysis among the genomes of evolutionarily related species are important approach which will give us great insights into the functional aspects of genomes, for example, genome annotation and identification of genome features, as discussed in several reports (Bergman et al., 2002; Richards et al., 2005; Zdobnov et al., 2002).
BAC libraries are the important and basic resource for any kind of genomic analyses concerning the whole genome, such as full sequencing of BAC inserts, construction of BAC and BAC-based maps (Ren et al., 2003), comparative analysis of gene structures and synteny (Larkin et al., 2003; Leeb et al., 2006), and powerful transformation technologies (Bischof et al., 2007; Oberstein et al., 2005; Venken et al., 2006) using DNA fragments larger than 100 kb to specific sites in the Drosophila genome.
We thus constructed BAC libraries and analysed the BAC end sequences (BESs) of five species, which were D. melanogaster, D. simulans, D. sechellia, D. auraria, and D. ananassae. D. melanogaster was selected as a control in order to measure the various qualities of the BAC libraries and the construction processes. The other four species are extensively utilized and recognized important organisms for research on speciation, hybrid incompatibility (Sawamura et al., 1993, 2004), food preference (Higa and Fuyama, 1993), mating behaviour (Tomaru et al., 1995), and male recombination (Tobari, 1993). D. auraria was specially selected for a species in the montium subgroup, which are interesting model organism for the research on cold resistance and diapause in adaptation biology (Kimura, 1988). The five species are abbreviated by DME1 for D. melanogaster, DSM1 for D. simulans, DSE1 for D. sechellia, DAU1 for D. auraria, and DAN1 for D. ananassae. All five species belong to the melanogaster species group. As these species diverged at different times as shown in Fig. 1 (Tamura et al., 2004), the BAC and the BES-based comparison between these species would be expected to provide sufficient information and abundant molecular resources for various research with the references of the genome sequence information recently reported (Clark et al., 2007; Gilbert, 2007).
![]() View Details | Fig. 1 Phylogenetic relationships of the melanogaster Species. The divergence times are based on the work in Tamura et al. (2004). The species for which we constructed BAC libraries are written in bold. |
After sequencing BESs, we assessed the sequence quality in order to perform accurate BES mapping. The high quality portion of the sequences were then compared with the D. melanogaster genome sequence (Berkeley Drosophila Genome Project (BDGP) release 3: http://www.fruitfly.org/). These mapping processes allowed us to identify which BAC clone includes the homologous genes or regions corresponding to the region of interest in D. melanogaster. The high quality BESs data have been deposited to DDBJ under accession nos. AG909574 through AG981700. All information concerning the BAC libraries and BESs are also available via Drosophila Genetic Resource Center, Kyoto Institute of Technology (http://www.dgrc.kit.ac.jp/en/), and searchable at the web site (http://fly.gsc.riken.jp/). The results for genome-wide comparative BES mapping can be viewed through the web site of RIKEN with data embedding into UCSC genome browser (Kent et al., 2002).
Drosophila strains from which genomic DNA was extracted were Hikone-R of D. melanogaster, Rakujuen RM13 of D. simulans, ss01 of D. sechellia, A662 of D. auraria, and AABBg1 of D. ananassae. All strains are available from DGRC (Kyoto Inst. Technology), Kyorin University or Ehime University through the web sites (http://kyotofly.kit.jp/stocks/).
BAC libraries were constructed according to the procedures previously described (Fujiyama et al., 2002). In brief, frozen fly bodies were homogenized in 0.2 M sorbitol, 20 mM Tris-HCI (pH 7.5), and 10 mM EDTA, then centrifuged at 5,000 rpm for 10 min after removal of large debris. Sediments were embedded in 1% agarose gel, treated with Sac I, and subjected to pulsed-field gel electrophoresis. The DNA fragments ranging from 80 to 200 kb were isolated and ligated with pKS145 vector. Transformation was carried out electronically using E. coli DH10B as a host strain. Ampicillin-resistant transformants were collected and stored in 384-format plates.
BAC clones from each library were rearrayed into 96 well plates and then cultured overnight in 1.5 ml of 2x LB medium containing 50 μg/ml ampicillin at 37°C. The BAC DNA was isolated by using a Montage BAC 96 Miniprep kit (Millipore) and dissolved in 35 μl of 10 mM Tris-HCl buffer (pH 8.0). Cycle sequencing reaction was performed in a 15 μl reaction volume containing 10 μl BAC DNA solution, 1.5 μl Big Dye terminator Ready Reaction mix (Applied Biosystems) and 4.8 pmole of sequence primer. Custom-designed sequence primers at the forward and reverse sides were TTCCCAGTCACGACGTTGTA and TCACTATAGGGCGGCCGCTA, respectively. PCR conditions were as follows: one cycle of 5 min at 95°C, followed by 75 cycles of 30 sec at 95°C, 10 sec at 50°C and 4 min at 60°C. The reaction mixture was purified by isopropanol precipitation followed by 70% ethanol wash and loaded on ABI 3700 automated capillary DNA sequencers (Applied Biosystems).
Base-calling and vector identification was performed by the Phred program (Ewing and Green, 1998; Ewing et al., 1998), which assigns a quality value (QV) to each base. After the Phred scoring step, we further eliminated vector sequences containing E. coli-derived sequences from the sequence as follows. We queried each BES on both the E. coli and D. melanogaster genomes by BLASTN (Altschul, 1997). If a BES hit with E. coli, and the hit meets the criteria (length ≥ 120 bp and percent identity ≥ 90%), then the hit was regarded as a sequence derived from E. coli. We confirmed that this criterion does not eliminate the D. melanogaster sequence.
We identified additional bacterial sequences, such as a commensal bacterium, Wolbachia, by searching with the BESs on the non-redundant nucleotide database, called ‘nt’ provided by NCBI (ftp://ftp.ncbi.nlm.nih.gov/blast/db/). When significant hits (condition “length ≥ 50 bp and identity ≥ 80%”) were observed against bacterial sequences, we regarded them as bacterial contamination and eliminated them from the BES data set.
In determination of the region of BESs with high enough score to be analyzed, we have developed a new approach by which unnecessary removal of useful sequence information is reduced as much as possible. Determination of a high QV region in a BES is often carried out by extracting a continuous region of QV ≥ 20 and length ≥ 100 bp. It is generally observed that QVs are low at the first 1–50 bp region and the end of the BES, therefore a QV-position curve usually shows a single broad peak, which shows a high QV region to be selected. However for some BESs, the QV-position curve drops under 20 (QV) around 200 bp position, which is possibly caused by a dye blob as pointed out previously (Zhao et al., 2001). The low quality regions typically are several bases in length. In the case that there are two or more high quality peaks in a BES, the single-peak assumption allows us to obtain only a single block of high QV region either before or after the short low QV regions, resulting in the elimination of other high QV regions from the BES. To solve the problem, and significantly extend the query sequence without lowing the mapping quality, we employed all long high QV regions as a query, even if the sequence contains intervening low QV regions, as long as they were short. To utilize this idea, we first identified blocks with high QV (QV ≥ 20 runs continuously at least 100 bp), and then we combine the blocks containing intervenient low QV regions. We call the procedure ‘multi-block setting’. This is useful especially for comparative BES mapping to identify homologous regions in a genome of related species. Because even homologous or syntenic DNA sequences of phylogenetically distant species sometimes differ significantly, (i.e. nucleotide identity < 75%), it is important to use a longer sequence as the query for the accurate identification of homologous regions. The low frequency of low quality or incorrect bases in a query DNA sequence may be a minor problem when the query is long enough. For instance, in the case that a BES contains a 10 bp region with the QV = 10, the error is estimated to be only 1 base. As the number of sequence errors in the low quality regions combined by the method should be very small, the sequence quality hardly gets worse by the multi-block setting. The multi-block settings extended the BES by 46%–66% longer than before, recovering those useful regions. The average extended regions per BES are ranged from 178 bp to 218 bp, while low QV bases incidentally included in the query were only 1.9%–2.8% of the resultant BES on average.
For self-comparison, BLASTN was performed with the parameters ‘-U T -F ''m D'' -e 1e-4’. The “U” option uses lower-case filtering of FASTA sequence. The “W” option specifies the word size in initial seeds. The “F” option uses filtering by the Dust program. The “e” option specifies expectation value, or the expected number of sequences to be reported. We set a criterion to identify the region (assumed to be novel repetitive sequence) in the BES which has multiple hits in the self comparison as follows. When only one region of BES matched with only another BES, it might be the same location of the genome was sequenced as the two BESs. To be more confident with the interpretation that the matches are novel repeat sequences, we required two BES matches (i.e. three similar sequences) to regard them as repeat sequences. We describe the conditions as below more rigorously. When a region (from x1 to x2 in base position, x1 < x2) in the query sequence hits with the region in the subject (another BES) sequence (from y1 to y2) with plus direction, then the hit is represented as h(s1, s2), where, s1 = (x1, y1), s2 = (x2, y2). Here s1 is a vector to represent the first base of the hit, and s2 is a vector to represent the last base of the hit. Let us suppose there is another hit h(t1, t2) for the same query sequence but with another BES with the same (plus) direction, where t1 = (x′1, y′1), t2 = (x′2, y′2), and x′1 < x′2. y′1 < y′2. The region distance is defined as d = √|x1 – x′1|2 + |x2 – x′2|2, which indicates the degree of difference between the positions of the edges of the two hits along the query. Let x1,min = min(x1, x′1), x1,max = max(x1, x′1), x2,min = min(x2, x′2), x2,max = max(x2, x′2), U = {x1,min, x1,min +1,···, x2,max}, V = {x1,max, x1,max +1,···, x2,min}. Then, U is the union of the two hits in the query, and the V is the overlapped region between the two hits in the query. Then the overlap rate is defined as r = |V|/|U|. A set of selected hits were called multi-hits if all the hits have e-values ≤ 0.01, and region distance d (described above) ≤ 20 bp and overlap rate r (described above) ≥ 0.5. The multi-hits portions were masked from BESs data set before comparative BES mapping.
BESs were mapped onto the D. melanogaster genome sequence (Release 3) taken from the UCSC WWW site (http://genome.ucsc.edu/, version of January 2003), which was provided by the Berkeley Drosophila Genome Project (BDGP). BLASTN was performed with the parameter set “-r 1 -q -1 -G 1 -E 2 -U T -F “m” -e 1e-2” (called ‘arranged set’), as well as the default parameter set. The parameters indicate match score, mismatch penalty, gap initial penalty, gap extension penalty, set lower cases non-seeds, filtration, and maximum e-value to report. The parameter was based on the suitable set for genome comparison where target identity of a match is around 75%, while the default parameters assume that the target identity of a match is about 90% (Korf et al., 2003). If the BLAST hits were split, a chaining rule was applied to select and combine consistent split hits for a query BES, as PIPMaker does (Schwartz et al., 2000). The BLAST hits with heterochromatic sequences were not considered.
Matched regions were extracted and clustered by the blastclust program in BLAST suite with parameters “-p F –L 0.9 –S 99 –b F” (Korf et al., 2003). This clusters sequences that are expected to be nearly identical (99 percent identity, 90 percent coverage). For the top 30 clusters for each species, we aligned the sequences for each cluster by CLUSTALW version 1.83 (Thompson et al., 1994) and constructed a consensus by the prophecy program in EMBOSS package (Olson, 2002) for each cluster. Then BLASTN with default parameters was performed against the nt database (non-redundant nucleotide database in NCBI, ftp://ftp.ncbi.nlm.nih.gov/blast/db/).
When a pair of BESs of a BAC clone were mapped to the genome with significant hits, the BESs or BAC clone was categorized as either valid or invalid, according to the following definition. If the two BES hits on the D. melanogaster genome are inward indirection and their distance Lin on the D. melanogaster genome is reasonable (Lin ≤ 300 kbp), the BESs were called valid. In the ‘unmapped’ BESs, if each BES was mapped on the wrong direction or on different chromosomes or Lin > 300 kbp, the BESs were called invalid. The invalid BESs imply that the BESs are correlated with any chromosomal rearrangements, such as long insertion and deletion, while valid BESs are not. The valid BESs for which the insert sequences overlap each other can be assembled to be a contig. The invalid BESs were clustered when their locations are close and the directions are the same. The difficulty arises when there are contradictions among valid and invalid BESs due to any artefacts, or shortness of BESs. There would be artefacts if there are duplications and subsequent mutations in any of the species, which make it difficult to identify real syntenic regions. To reduce artefacts, we selected the regions with breakpoint candidates around which we have more than one supports of both valid and invalid BESs.
We constructed BAC libraries composed of between 8,000 and 12,500 clones for the five species. The clone numbers of BAC libraries and BESs are shown in Table 1 (Initial total BAC libraries and Initial total BESs). Base-calling was performed by the Phred program (Ewing and Green, 1998; Ewing et al., 1998), which assigns a quality value (QV) to each base. When a base has a Phred QV ≥ 20, the error rate is ≤ 1%. Such a base is called a “Q20 base.” To assess the quality of a sequence, the “Q20 length” was defined (Zhao et al., 2000, 2001). The Q20 length is the number of Q20 bases in each sequence. Note that it is not a real length but the count of high quality bases, though it has the term “length.” Fig. 2 shows the cumulative Q20 length distribution. The Q20 length for other BAC end sequencing projects is around 400–500 bp (Larkin et al., 2003; Zhao et al., 2000, 2001). The “Q20 lengths” of the five species were very long (ranged from 588 bp to 614 bp). A BES with Q20 bases equal or longer than 100 bp was regarded as successful. The rate of the number of the successful sequences to the total number of BESs was called the “success rate.” The success rate is also a common indicator for quality assessment of sequencing (Zhao et al., 2001). The success rate for each species is shown in Table 1 (Success rate). They are as high as 88% to 93%, while the typical rates of other projects are 79% with the condition of high quality length > 200 bp (Larkin et al., 2003), and 81% with the condition of read length > 100 bp (Zhao et al., 2001). The average Q20 lengths ranged from 565 bp to 598 bp (Table 1, Average length). Total numbers of succeeded bases were shown in Table 1 (Total (bp)). A BES with Q20 length longer than 100 bp was regarded as a ‘successful’ sequence. The number of such BESs ranged from 14,341 to 20,355 (Table 1, Q20-HQ BESs). For most of the BAC libraries, both BESs are ‘succeessful’, although some BAC libraries have only one-side ‘succeessful’ BES (Table 1, BESs one-end, BESs mate-pairs). For D. auraria as an example, 977 BAC clones ‘succeeded’ for one-side BESs, and 7,304 BAC clones ‘succeessful’ for both BESs.
![]() View Details | Table 1 Statistics of the BAC end sequences |
![]() View Details | Fig. 2 Q20 Length Distribution of Each Species. The X-axis represents the Q20 length of each sequence. The Y-axis represents the accumulated fractions of BAC end sequences of the species. The Q20 length of about half of all the BESs is about 600 bp. For the species DAN1, about the half of the sequences are relatively low quality (a little upper than the others). Only 10% of the sequences were low quality, when we set the threshold of Q20 length ≤ 100 bp (X = 100). |
After filtering vector sequences and bacterial sequences, a portion of the BESs were extracted in such a way that each sequence consists of continuous (at least 50 bp) high QV bases (QV ≥ 20). The condition was fulfilled by the multi-block setting method (see Materials and Methods). As the high quality sequences, 77,165 BESs were extracted for all the species (Table 1, Q20-passed BESs). From the extracted high quality sequences, repetitive sequences were identified by the RepeatMasker program (version 2004/03/06 with RepBase 8.12) with options “-xsmall –species drosophila.” Detailed results are available at the WWW sites (http://fly.gsc.riken.jp/). The fraction of repetitive sequences ranged from 10% to 16% (Table 1, ‘Masked rate, b’). In the count, a BES that does not contain non-repetitive regions longer than 100 bp was designated repeat, because the BES was dominated by repeat sequences.
In addition, we performed intra-species self-comparison of the BESs with the NCBI-BLAST (sequence similarity search) program to identify putative repetitive sequences yet unknown and to remove them from the BES. A homologous region found by BLAST is called a hit here; this is called a high-scoring segment pair or HSP in BLAST theory. If a region in a BES has multiple hits with many different BESs in the self-comparison, they are very likely to be novel repetitive sequences or duplicated regions. However, since it is not easy to distinguish the former from the latter, we set a criterion to identify novel repetitive sequences, which are designated multi-hits (see Methods). We masked the multi-hits in the BESs data and selected BESs which contained continuous non-masked regions longer than 100 bp. Eventually, BESs ranging from 10,092 to 15,354 were used for further analysis for each species (Table 1, Passed BESs by MM) and designated unique-sequences. The BESs not selected at this stage are designated self-matches.
The unique-sequences were compared with the D. melanogaster release 3 genome sequences to identify which locations the BESs correspond to. BLASTN was performed with both the default parameter set and an arranged parameter set (see Methods). The BLAST with the arranged parameter set output 14 and 17% more significant matches than that with the ‘default’ parameter set for BESs of DAU1 and DAN1, respectively, which are the two distant species from the mapped species D. melanogaster. For two closely related species and D. melanogaster itself, there are no significant differences between the two parameter sets in the numbers of significant matches. As the result, the number of significant hits (at least 50 bp, e-value1e-5) ranged from 5,546 to 14,177 for each species (total 53,701, Table 1, BESs with significant hit). If the two BESs hits of a BAC clone are inward in direction and the distance Lin of them on the public D. melanogaster genome is reasonable (Lin ≤ 300 kbp), the BESs were called pair-end mapped (PEM). A cluster of PEMs suggests that the region is a part of a synteny block or a long homologous region. A snapshot of the pair-end mapped BAC libraries is shown in Fig. 3. The statistics of the mapped BES are shown in Table 2 in accordance with chromosome arms of mapped DME1, and the corresponding Muller’s elements. All the statistics concerning base pairs were calculated on the mapped D. melanogaster genome. Most of the region of the D. melanogaster genome was covered by the mapped BAC clones of the three species DME1 (94%), DSM1 (97%), and DSE1 (96%). On the other hand, the two phylogenetically distant species, DAU1 and DAN1, showed relatively low coverage (57% for DAU1, and 47% for DAN1); in particular, no PEMs were found on chromosome 4. The low coverage rates by DAU1 and DAN1 are reflected in the high sequence divergence between these two species and D. melanogaster rather than the quality of the BAC libraries. Here we assume that there are no strong biases to affect the statistics in construction of BAC libraries, such as unbalanced distribution of a restriction enzyme site used in the construction of BAC libraries. The coverage of chromosome 4 of DME1 is less (69%) than other autosomes (> 90%). This result could reflect that the chromosome 4 of D. melanogaster contains higher proportions of repetitive sequences than the other chromosomes. The chromosomes of D. melanogaster contain approximately seven times as much dispersed repetitive and nomadic DNA as those of the sibling species D. simulans (Dowsett and Young, 1982). The portion of the repetitive sequences detected by the RepeatMasker program was 27.3% for chromosome 4, while 5.6–9.4% for the others. The ratio of transposable elements in chromosome 4 is 4.2 times larger than that of the other chromosomes (Gonzalez et al., 2002). To take account of the repeat contents, we calculated the rate of total number of bases covered by the BAC inserts mapped in D. melanogaster genome against the number of bases of non-repetitive sequences in D. melanogaster genome for each chromosome arm. The rate (net coverage) should be around 1.0 or higher when the number of BACs is large enough to cover all the unique sequences. The rate for the chromosome 4 was 0.96 and the rates for the others ranged from 0.99 to 1.04. Thus the net coverage of chromosome 4 was at the same level as the others when we take account of the abundance of repetitive sequences. The total coverage of DSM1 and DSE1 were higher than that of DME1, because the clone numbers of BAC libraries of DSM1 and DSE1 were larger than the DME1 library. The rate of ‘unmapped BAC clones’ to ‘the all BAC clones with unique-sequences’ were 1.3% (DME1), 2.2% (DSM1) and 2.0% (DSE1), as shown in Table 3. More portions of BAC clones were mapped for DME1 than those of DSM1 and DSE1. When a paired BESs for a BAC clone significantly hits but the BESs are not PEM, the BAC clone was counted as ‘unmapped’.
![]() View Details | Fig. 3 Mapped BESs on the Public D. melanogaster Genome Sequences. The first line represents the positions on the genome. The short vertical lines indicate the position of BESs. The break lines connecting the two vertical lines (paired BESs) indicate the inserts of the individual BAC clones. Mapped BAC clones meet the following conditions, 1) both BESs of a BAC clone are inward direction, 2) the distance of the BESs are less than 300 kb. This is the case of BAC libraries of DAU1. The Fig. was produced using the UCSC genome browser. |
![]() View Details | Table 2 Statistics of Mapped BESs, and estimated coverages |
![]() View Details | Table 3 Mapped and Unmapped BAC libraries |
Fig. 4 shows the numbers of BESs categorized in the course of the analysis pipeline. The numbers of the self-match BESs (removed BESs at the step of self-comparison) of DAN1 are relatively larger than the others. This might be related to the fact that the number of the detected repetitive sequences in the BESs of DAN1 in Table 2 is smaller than for the other species. This implies that there are many unknown repetitive sequences in DAN1 that are not detected by the current RepeatMasker program.
![]() View Details | Fig. 4 Cumulative Numbers of BESs in Each Category for Each Species. The Y-axis represents the number of BESs. X-axis represents species of the BESs analyzed. |
We performed in situ hybridization using three BAC clones to identify what locations the BAC inserts came from in the species D. ananassae. The signals of the three BAC clones (DNB1-008L04, DNB1-016M14, and DNB1-001A14) were located at the cytogenetic regions of 33C, 52B, and 49D on salivary gland chromosomes, respectively, in the D. ananassae genome. On the other hand, these BAC clones were mapped onto the D. melanogaster genome in silico to the cytogenetic regions of 86D1–D3, 77D1–77E2, and 67F3–68A1. These results indicate that the homology of the chromosomal elements (Muller, 1940) between D. melanogaster and those of D. ananassae have been confirmed by in situ hybridization with the three BAC clones.
We calculated the ratio of the number of BESs that were mapped on exons to the total number of BESs that were mapped on the D. melanogaster genome. We used the annotation described in the UCSC genome browser (Kent et al., 2002) in which mRNA sequences in RefSeq database (Pruitt and Maglott, 2001) were mapped. The rates were between 12% and 16% for all the five species. This indicates that most of the mapped regions were non-exons. Thus we could perform comparative mapping successfully even for non-coding region for relatively distant species such as D. melanogaster and D. ananassae, which diverged approximately 44 million years ago (Tamura et al., 2004).
We examined the significant BLASTN hits (HSP) of PEM-BESs to explore the evolution of the species’ genome sequences. The features of the sequence alignment (percent identity, length, and negative logarithm of E-value) of BLAST hits give us different aspects of the alignments, although they are not independent of each other. The averages of the features are summarized in Table 4. The length of BLAST hits can be an indicator of homology, given that the average lengths of the BESs were almost the same among the species. The phylogenetic distances between D. sechellia and D. melanogaster, and between D. simulans and D. melanogaster, were not resolved well here. However, the BLAST analysis gives some clues. Even the lengths of BESs of DSM1 were slightly shorter than those of DSE1 on average, all the three features suggest that the mapped genomic sequences of D. melanogaster are more homologous to those of D. simulans than those of D. sechellia (Table 4). Statistically, Wilcoxon rank sum test showed p-values of 1.6 × 10–29, 4.2 × 10–2, and 1.9 × 10–4, for percent identity, length, and e-value, respectively. This suggests that D. sechellia has undergone more divergence than D. simulans after the speciation of the ancestors. For a 57 kb regions (56F10–16), it was suggested that the noncoding sequences have been shortened by deletion in the genomes of D. sechellia and D. simulans (Kawahara et al., 2004). The number of the deletions were observed to be 7.5 times the number of insertions in D. sechellia (Kawahara et al., 2004). Therefore, the main reason of the divergence might be the deletions in the intergenic regions in D. sechellia.
![]() View Details | Table 4 Statistics of BLAST hits |
We have estimated the insert size of the BAC libraries as follows. Let Lin be the distance of two regions on the D. melanogaster genome that are the most homologous regions of the two BESs of a BAC clone. Here we consider only PEM BAC clones. Although the actual insert size of the BAC libraries could be different from Lin to some extent, the Lin can be regarded as an estimate of the insert size of the BAC libraries. Most of the insert size ranged from 100 to 150 kb, which are typical lengths for the BAC inserts. For some portion of the BAC libraries, the insert size is below 50 kb. We measured the insert size by electrophoresis for some species. Since the insert sizes of 20% of the clones were actually below 50 kb, the BAC clone distribution obtained from in silico comparative BES mapping can be considered to be appropriate.
As shown in Fig. 4, the BESs categorized as ‘self-match’, which are intra-species sequence matches in self-comparison of BESs, in D. ananassae (DAN1) are relatively large. This may be related to the fact that chromosome 4 of DAN1 are almost exclusively composed of heterochromatin, and the heterochromatin part is much larger than those of other species. Besides, from ‘Masked rate’, Table 1, slightly small portion of the BESs of D. ananassae (DAN1, 10.7%) was identified as repetitive sequences by the RepeatMasker program. Therefore the BESs categorized as ‘self-match’ in D. ananassae might contain lots of novel repetitive sequences, which do not exist in the database of RepeatMasker. To explore the possibility, we extracted the sequences categorized as ‘self-match’, and clustered them by sequence similarity, and constructed a consensus sequence for each cluster. The consensus sequences were compared with the non-redundant nucleotide database, called ‘nt’ provided by NCBI (ftp://ftp.ncbi.nlm.nih.gov/blast/db/). Most of the consensus sequences matched best with un-annotated regions in D. melanogaster genome. They are likely to be novel repetitive regions. Parts of consensus sequences matched best with genes of different species of Drosophila with percent identity between 80% and 90%. This suggests that these BESs are parts of genes homologous to the matched genes. Most of the homologues may comprise a family with other paralogues in Drosophila, which are often found in multiple regions in the genome. Another possibility is that they are parts of pseudogenes. In fact, the two clusters of DAU1 matched with the same transposon mariner transposase pseudogene of Bactrocera neohumeralis (AF348438). Both matches are longer than 129 bp and similarity (percent identity) was higher than 86%.
By using the comparative BES mapping, we predicted breakpoints as boundaries of synteny blocks that were generated between one species and the D. melanogaster genome. When a pair of BESs of a BAC clone matched significantly to D. melanogaster genome, the BESs were categorized as either valid or invalid (See Materials and Methods, and Fig. 5). The invalid BESs imply that the BESs are involved in any chromosomal rearrangements. Using such BESs, we predicted the boundaries of genome alignments or syntenic regions, where some chromosomal rearrangements presumably have occurred. We predicted them with only BES-genome map, not using two genome sequences.
![]() View Details | Fig. 5 Chromosomal Rearrangements and the Dot plot for Confirmation of Breakpoints. (a). The upper three tandem thick arrows (open, shaded, and solid) indicate the genomic sequences of species S (Genome S), from which BAC clones were constructed. Below them, the three thick arrows represent the reference genome sequences of species T (Genome T = D. melanogaster), which is used to map BESs. The shaded thick arrows indicate the syntenic regions at which an inversion occurred in either species. A pair of small arrows and a line indicate BAC end sequences and an insert, respectively. BESs with straight lines indicate valid BESs, while BESs with dotted lines indicate invalid BESs. When there is an inversion between S and T, there exist two breakpoints (vertical broken lines). Given the form of the BES-genome map, the existence and location of the inversion can be inferred. The two regions in X-axis (BP region.1 and region.2) were magnified in (b) and (c). The same contigs were mapped in the two regions, indicating that the two breakpoints were confirmed by the contigs. (b). The ''dot plot'' using the contigs of D. simulans (DSM1) and the genome sequences of D. melanogaster (DME1) corresponding to the predicted breakpoint region ''BP region 1'' (chromosome arm 3R, 3,856,802–3,896,705 in base positions; approximately 40 kb in length including 10 kb margins on both upstream and downstream). The Y-axes represent the base positions of alignments of the several contigs. The plot points indicate the endpoints of the BLAST alignments between contigs and the target genome sequences. The white arrow shows the breakpoint at around 18,000 bp in X-axis. (c). The ''dot plot'' using the contigs of D. simulans and the genome sequences of D. melanogaster corresponding to the predicted breakpoint region ''BP region 2'' (chromosome arm 3R, 17,546,012–17,585,421 in base positions; approximately 40 kb in length including 10 kb margins only on upstream). The white arrow shows the breakpoint at around 15,000 bp in X-axis. |
As the result, 111 estimated breakpoints (3, 2, 47, and 59 for DSM1, DSE1, DAU1, and DAN1, respectively) were found in the species examined (Appendix A, Table A). Between D. melanogaster and D. simulans, two major inversions are long known, one in chromosome 3R (Horton, 1939), and the other in chromosome 4 (Horton, 1939; Podemski et al., 2001). The two breakpoints found in this study are of the inversion in chromosome arm 3R. We did not detect any breakpoints in chromosome 4, because of few invalid BES hits in this chromosome. Genomic contigs of D. simulans sequence were taken from the WWW site of D. simulans genome sequencing project (http://www.genome.wustl.edu/genome_group.cgi?GROUP=6), and compared to D. melanogaster genome. There were long contigs of D. simulans covering the region in question. The dot plots of contigs from D. simulans against two sequences from D. melanogaster are shown in Fig. 5. This Fig. indicates the presence of an inversion and distantly located two breakpoints that correspond well to the cytological breakpoints, 84E and 93E/F.
![]() View Details | Table A Predicted Breakpoints |
We thank Ehime University for providing Drosophila samples to construct BAC libraries. We also thank C. Kawagoe, T. Katayama, X. Son, and all technical staff of the Human Genome Research Group and the Genome Core Technology Facility, RIKEN Genomic Sciences Center for their excellent work and support. This work was supported in-part by a grant of the National Bio Resource Project from the Ministry of Education, Science, Sports, and Culture of Japan and a Special Fund for the RIKEN Genomic Sciences Center.
|