2014 Volume 3 Issue 1 Pages A0030
A new peak detection method has been developed for rapid selection of peptide and its fragment ion peaks for protein identification using tandem mass spectrometry. The algorithm applies classification of peak intensities present in the defined mass range to determine the noise level. A threshold is then given to select ion peaks according to the determined noise level in each mass range. This algorithm was initially designed for the peak detection of low resolution peptide mass spectra, such as matrix-assisted laser desorption/ionization Time-of-Flight (MALDI-TOF) mass spectra. But it can also be applied to other type of mass spectra. This method has demonstrated obtaining a good rate of number of real ions to noises for even poorly fragmented peptide spectra. The effect of using peak lists generated from this method produces improved protein scores in database search results. The reliability of the protein identifications is increased by finding more peptide identifications. This software tool is freely available at the Mass++ home page (http://www.first-ms3d.jp/english/achievement/software/).
The first step of protein identification using mass spectrometry after a data acquisition is to extract ion peaks from mass spectra. In protein analysis methods,1–3) database search tools are used to represent the match between the ions from the experiment and the theoretical calculation from the sequence provided in a protein sequence database. In de novo sequencing software, an accumulation of ion features in the spectra is applied to build ion series candidates.4,5) Various scoring schemes6–10) are usually used to rank or select the candidates in these database search tools and de novo sequencing software. This technology has also been used in the detection of protein biomarkers11,12) in cancer diagnostics. Quality of peak lists is a prerequisite that ensures reliable scores from these protein identification methods.
Consistent progress of peak detection in mass spectrometry has led to various types of ion peak identifying software. But with new techniques applied in mass spectrometer, new peak detection methods are required to cope with signal and noise features contained in these spectra. When a mass instrument is used to measure a sample at the protein/peptide level, a spectrum with signal peaks that represent intact protein or peptide ions is presented. Furthermore, in the identification of proteins using tandem mass spectrometry, there is a greater need for selection of accurate peaks of fragment ions from the peptide.
Combining the liquid chromatography (LC) technique with the MALDI (Matrix-assisted laser desorption/ionization) MS system enhances the high-throughput platform in proteome analysis for MALDI mass spectrometry. LC-MALDI-TOF (Time-of-Flight for ion detection system) is a commonly used instrument type which measures ions from the digested protein samples. The peak lists generated for LC-MALDI spectral data should show the advantages of using high sequence coverage for the detected proteins.
A high quality peak detection tool also maximizes the number of ion signal peaks while keeping the noise peaks to a minimum, since exclusion of all noise is very difficult. Various different peak detection methods13–19) have been proposed and implemented in computer software. Key points such as the use of signal to noise ratio in filtering out noise peaks, combining specific shape functions in order to fit models to peak shapes/isotopic clusters and find their relative ion peaks, or using a model based on estimated parameters20) to distinguish peptide ion from background peaks, or applying a continuous wavelet transform (CWT) to localize a signal, have all been considered and applied in peak detection algorithms. An evaluation21) of publicly available peak detection software showed that the software tools based on CWT18) has the best performance in selected MS spectral data. The peak detection method using CWT is capable of detecting peaks by finding ridges in the wavelet transform space, but in addition to choosing a suitable mother wavelet function, it is also parameter dependent. When a mass spectrum contains several different characteristics that are inconsistent with the spectral range, such as peak shape and width, it becomes more difficult to optimize several parameters to identify all possible signal peaks. With an increase of scale factor, longer computing time may be required to refine the peak parameters. The method using Bayesian20) to estimate parameters for modeling spectral peaks reported that it could outperform the wavelet method in another test for identifying peptide ions in MS spectra, but the parametrical model did not include isotopic pattern of ions. While this is usually a key factor to identify peptide fragment ions, an extra step must be considered to find mono-isotopic fragment ion following the initial detection.
The aim of this research work is to find a suitable method in peak detection and to build an appropriate computer program around it. It was initially meant to solve the problem of low resolution in the MALDI-TOF-TOF mass spectra. Since the peaks in the spectra can be distorted from their symmetrical shapes and become not well-resolvable between isotopic peaks, it is more difficult to apply any peak detection methods based on model fitting. Therefore, a method using intensity classification has been proposed; where the shape of the peak is less significant. The noise level for each selected range at mass (m) over charge (z), m/z is expected to be found and the ion peaks can be selected in a more robust manner through determining accurate noise level in the data. We implement a new algorithm based on this classification in a computer program MWD (Multi Window Detection), that the actual peak detection is more reliant on the intensity distribution for each selected m/z range.
As observed from spectra, the noise level may vary from different mass ranges in a peptide spectrum. To more accurately find the noise level, the mass spectra from peptide dissociation is divided into a number of ranges according to the precursor ion mass. An interval of ΔM in Dalton (ranged 100–300 Da) is selected and used to go through the processing of the spectra, from the low mass end to the high mass end, where the high mass end is close to the precursor ion mass. In this paper, the ΔM value of 120 Da is chosen because it is a value close to the average mass of all amino acid residues. The noise level in each divided range is determined from the data points involved in that range. It is unlike other peak detection programs, where some form of signal-to-noise ratio threshold is simply used to filter peaks but MWD applies the detected noise level over different mass range to determine how peaks are selected.
Figure 1(a) illustrates a screenshot of a spectrum at a selected mass range and intensity classification of the data points involved in the range [Fig. 1(b)]. Determining the noise level of a selected mass range requires the classification of 3 discrete groups of data points, presuming that 3 classes A, B, and C of peak intensities can be found, as shown in Fig. 1(b).
Strong signal peaks are classified in A, but these are usually few in number. C contains the majority of data points but these points are mainly made up from noise. The points in B might contain some signal peaks, which have lower intensities than those in A, but these peaks are at the noise level boundary and are the most difficult to determine for peak selection. The partial peaks from A and B are marked in the spectrum of Fig. 1(a). Together with Fig. 1(b), this is an example to depict how peak intensities are classified in the selected mass range.
In most cases, there is no simple method to distinguish the data points in B from other two classes. Two steps, Determination of Noise Level and Refinement of Noise Level, were designed in the MWD process. The ion peaks in A are explicitly extracted from the first step in a given mass range. The further step might identify more ion peaks in B. The more details are described in following sections.
Determination of noise levelIf cluster C is classified as the main class in this method, the points in A can be considered as the “outliers.” Therefore, an outlier detecting method such as Z-score22) can be applied in order to determine whether any points in A can be distinguished. Then, the points found in A are temporarily removed and the noise level is determined from the points in C and perhaps some from B. This is because the points in B may contain some signal peaks with low intensity and mix with some noise peaks.
To implement the idea to a computer program, all the data points in a selected mass range are firstly arrayed in an ascending order. Z-score is then calculated for each point Zi, corresponding to the point with intensity value Ii in the range. The calculation is based on the formula:
![]() | (1) |
![]() | (2) |
![]() | (3) |
Zi value for each point reflects how far a measured value Ii is from the mean/median value. A larger Z corresponds to a more significant intensity from the others in the class. The criterion for the Z value is determined by learning from a determined data group, where the ion peaks in the spectra have been well identified. A value of 3.0 is set as an initial criterion for the Z value to decide if the point is classified as class A. The intensity mean value Imean of all data points left in the region after removing the points in A is simply calculated. Another variable Ri is defined as:
![]() | (4) |
All these properties related to the defined parameters are based on the assumption that noise points are symmetrically distributed around the mean value; i.e., Gaussian distribution. This symmetric distribution is easier to find when spectral raw data are used from noise statistics. However peak intensity instead raw data point is used in this method. The normality of distribution from the peak intensities may no longer retain because majority of data points in the low intensity region has been cut off. The distributions of intensity values from selected seven mass ranges of a spectrum are displayed in histogram plot, Fig. 2(a). It can be seen that the data points still distribute around the mean although they are not perfectly symmetrical. When some ion peaks are comparable to the noise level, they are tailed on the right side, as those demonstrated in ranges 1, 6, and 7. The method used in this procedure aims to identify these ion peaks. Among the data points in each selected mass range, it is more certain that the peak with the lowest intensity is a noise peak. An initial noise level is therefore found from this peak. The detail is given as:
Firstly, a di value is calculated as:
![]() | (5) |
The peak detection in this new computational approach relies on the noise level found in each defined mass range. Since the noise level varies within each mass window and the detection relies on the distribution of peak intensities in the window, this allows the peak detection to select the ion peaks with low intensities. With a determined noise level, a number of peaks are identified as ion peaks and these peaks are selected by a threshold derived from the noise level. Figure 3 includes selected ion (♦) and noise (✳) peaks from a spectrum in seven mass ranges. The selected peaks are always located on top when the d value is used in the plot. If there are no peaks found above the threshold, like in the fourth mass range, then no peaks are selected.
In the calculation of Z-score values, peak intensities in the selected mass range are all converted to a scale of how each peak deviates from their mean or median value. Another advantage is quickly locating how far the peaks at the extremities differ. In addition to identifying the significant high peaks in the range by using this value, the peaks with very low intensities can also be found. The minimum value of Z scores in each mass range, Zmin is selected to examine how the lowest peak in the range is deviated from others, i.e., a large magnitude in the negative direction. A number of such points with very low intensity value involved in the calculation may result in a lower noise level than the real one. A scheme that combines comparing the median position with the Ri value as mentioned in the last section, and the Zmin value in the range is used to further adjust and optimize the noise level that is obtained from the preceding calculation.
In adjusting the noise level, a criterion of Zmin can be considered. Contrastingly, if more data points with relatively high intensities are involved in the noise calculation, it may derive a higher noise level. In this case, the derived level is reduced to represent a real noise level. A consequence of the optimization and adjustment is that a few peaks may be added or removed from the final peak list.
These parameters or variables can be involved in the calculation of finding noise level. But they can be determined by the features detected in a spectrum and are not necessary for a manual intervention. Therefore, unlike other peak detection software, only a few parameters are required to select from the provided interface by the user. These parameters mainly involve what type of instrument is used to acquire spectral data, optional methods provided with the program for pre-processing of spectral raw data, and a parameter to increase a few number of peaks in final peak list if the peak list is too short from the default selection in the program.
Program implementationSpectral raw data are input to the method in order to ensure that the number of points present in the defined mass range is enough for the calculation of the Z-score function. Therefore, several pre-processing steps are required to get the best result from the method. This may include smoothing and peak centroid24,25); and baseline subtraction if it is necessary. The mono-isotopic ion peaks will be selected from a procedure26) in the program if the isotopic clusters are resolvable in spectra. MWD is developed in C#.Net. It can be run as a standalone application and all the functions are also formed in all components, which are able to be integrated to other analysis platforms. A flow chart is given in Fig. 4 to show the basic workflow in MWD program. The proposed method mainly consists of four steps, Divide mass range into n windows, Determine noise level, Refine noise level, and Select ion peaks. These steps are repeated to ensure all divided mass ranges are processed. An extra step, Precursor ion correction is used to get accurate precursor ion mass from MS spectra. At present, MWD has been implemented in freely available software Mass++,27) which can be downloaded from the web site: http://www.first-ms3d.jp/english/achievement/software.
To evaluate the performance of the proposed method in peak detection, the test starts with examining the true/false positive (TP/FP) rates found in the peak lists. The testing spectra are all from the MS/MS spectra (LC-MALDI-TOF-TOF, AXIMA Performance, Shimadzu/Kratos) of known peptide sequences. The raw spectral data firstly went through a pre-processing procedure and all peaks over a low intensity threshold were recorded in a peak list. All possible theoretical ions from the sequence were also calculated. The peak list was then used to match the theoretical ions in a suitable mass range for which the instrument could be able to measure. The matched ion peaks were recorded as true peaks in the spectra; the rest of the peaks in the list as false peaks.
The samples used for the tests are serum albumin (bovine) BSA, the sample was digested by trypsin. Different experimental conditions, including concentration of sample, acquisition time scale, laser strength and so on, were applied to the samples during collection of the spectra from the instrument. Sequence number was used to represent the separate data sets acquired from the different given conditions in the experiment, like BSA1 to BSA6.
First, all peak lists for testing spectra were examined by their precision. It calculates as:
![]() | (6) |
It is a ratio of detecting correct ion peaks in the peak list. Where NTP is number of true ion peaks and NFP is the number of false ion peaks. A lower precision represents more false positive peaks being involved in the detection. If all detected peaks in the peak list are correct ion peaks, the precision is 1.0. The other commonly used score to represent the rate of correct measure results is sensitivity, which is defined as:
![]() | (7) |
In parallel, the peak lists derived from this program were also generated from Distiller peak detection tool, from Matrix Science (http://www.matrixscience.com), which can usually obtain better peak lists than other programs from numerous internal tests for the MALDI-TOF-TOF spectra. Several parameters in Distiller can be selected to control the length of peak lists. Different settings for the parameters may also affect the quality of peak list and result in different Mascot database search scores. But all peak lists were generated by Distiller using a default set of parameters optimized by Matrix Science because it is very difficult to find the best set of parameters for many spectral data from this instrument. The peak lists generated from Distiller are used as benchmark; they are then compared with all the peak lists derived from MWD.
The plot in Fig. 5 shows the precision (a), sensitivity (b), and descriptive statistics analysis from the peak lists obtained from a set of spectral data for BSA by Distiller and MWD, respectively. All the raw MS and MS/MS spectra were input into the peak detection programs to produce peak lists for each peptide spectrum. MS spectra were used to correct corresponding precursor ions provided in this program. It can be seen that the precision for BSA1–BSA6 at the median value, ranges from 62–85% for MWD peak lists, while 61–91% for Distiller peak lists [Fig. 5(a)], although the number of identified peptides in two data set are different. It is noted that Distiller can usually produce shorter peak lists on the given peak processing parameters. But using MWD peak lists, more peptides can be identified. The more details of identification from a database search will be discussed in a later part.
Some more attention was paid in order to investigate the peak lists with lower precision in MWD peak lists. For instance, in BSA4 the minimum value of precision is ∼28% and it corresponds to a spectrum acquired from precursor ion at 2045 Da. The inspection of the spectral details showed that the fragmentation in the spectrum was poor. Most ion peaks, except a few in the low mass range, have low intensities and sit at the noise level boundary. However, the peak list from this spectrum can still retrieve a correct peptide hit, RHPYFYAPELLYYANK, from protein BSA by using the Mascot search engine. The peak list for this peptide from Distiller could not derive the correct result, though the precision value was higher (∼31%).
These plots only show how accurate are the real ion peaks selected from the proposed method. If only strong ion peaks are recorded in a peak list, the precision for the peak detection can easily reach ∼1.0, because all peaks could be correct ions, such as, the peak lists derived from Distiller method in this test. The other statistical measure, sensitivity, which reflects the probability of a positive test, given that the peak is ion in a spectrum, may be more comparable to the results from these different peak detection methods. In Fig. 5 (b), the sensitivity obtained for the peak lists from two peak detection tools are displayed. In order to compare the results, some peptide spectra were removed from MWD peak lists, only the same peptide spectra that can be identified from Distiller peak lists are remained. In average, higher values are obtained from the MWD peak lists. It is further confirmed that more fragment ions can be identified and picked by MWD peak detection.
Figures 6(a) and (b) depict scatter plots of TP against FP rates for the selected two spectral data sets with more number of peptides that can be identified with. They are present here to demonstrate the distribution of correct detection by two detection tools. Plot (a) shows that Distiller peak lists derived a low FP rate, but the TP rate was also low. In plot (b), though Distiller can provide some good TP rates, low TP rates were also included. However most peak lists from MWD present the TP rates in the upper range with an FP rate of <0.2. This also shows that MWD can derive more consistent results in peak detection. The conclusion from the comparison among all data is that in general, although the Distiller peak lists can give a lower false positive rate, the true positive rate is also usually low. This signifies in most cases that Distiller only selected the ion peaks with a significant height. The peak lists from MWD can reach a higher true positive rate with a reasonably low false positive rate. These two examples in Fig. 6 are typical representations of relationship between TP and FP calculation from the peak detection results acquired by using these two tools, which show a general trend in selecting real peaks that MWD can derive more consistent results in peak detection. Other testing methods were applied to further evaluate the performance of peak lists in protein identification.
The performance of the method in selecting ion peaks was also investigated with a different quality of fragmentation. Three peptide spectra with fragment quality at different levels were randomly selected for testing. They were categorized as high (named SpecH), where numbers of strong fragment ions found in the spectrum were sufficient; low (named SpecL), where only a few ions were fragmented and contained the majority of noisy peaks in the spectrum; and medium (named SpecM), which was between the two. The details about these peptides and testing results are listed in Table 1.
Spectrum | Peptide | MWD | Distiller | ||
---|---|---|---|---|---|
Score a) | No. ion b) | Score | No. ion | ||
SpecH | YNGVFQECCQAEDK | 75 | 52 | 115 | 39 |
SpecM | SLHTLFGDELCK | 50 | 34 | 49 | 14 |
SpecL | DDPHACYSTVFDK | 34 | 24 | 14 | 8 |
a) “Score” is the MS/MS Mascot search score. b) “No. ion” is number of matched ions from the peak lists to the expected peptide fragment ions.
The peak lists for each spectral raw data were generated by this peak detection program (MWD) and Distiller. Then the peak lists were used to search the Swiss-Prot protein database using the Mascot search engine. The consistent parameter setting in the search for the tolerance values (peptide mass tolerance: 0.6 Da, fragment mass tolerance: 1.5 Da), modification (fixed modification: Carbamidmethyl (C), variable modification: Oxidation (M)), instrument type (MALDI-TOF-TOF) and so on are selected for both sets of data. Thus, we can compare the Mascot scores resulting from the peak lists of the two programs. A large parameter for fragment mass tolerance was chosen here to ensure that the ions in the poor quality spectrum can be used. As demonstrated in Table 1, the spectrum with medium quality (SpecM), both peak lists reach a very similar score [50 (MWD), 49 (Distiller)]; but the MWD list can acquire more matched ions (34) than the Distiller list (19). For the other two data sets, the Distiller list got a higher search score [115 (Distiller), 75 (MWD)] for SpecH but a lower score [14 (Distiller), 34 (MWD)] for SpecL, compared to MWD lists. Since Distiller favors selecting high peaks from the spectra, when the fragment quality is good and a number of strong ion peaks can be found in a spectrum, the peak list easily achieves a higher Mascot search score, such as in sample SpecH, 39 strong fragment ions are identified by Distiller while 52 fragment ions can be found from MWD peak list. But for the spectra with poor quality fragmentation, the capability is reduced. Compared with these results, MWD can perform well both on high and low quality spectra. In general, MWD peak lists contain a high coverage of real ion peaks. This is another important factor in protein identification using MS/MS to find reliable peptide hits in a database search method.
Comparison of database search resultsIn protein identification, by searching MS/MS spectra to match the peptides in a protein database, a key point is how reliable28) the derived matches are from the given peak lists. This includes the reliable matches of fragment ions from the individual MS/MS spectra to the proposed peptide ions in the database. The number of the reliable matched MS/MS spectra increases the coverage of peptides for the found protein and therefore greater confidence in the identification is achieved. Since the test data is from the standard protein samples: BSA, lysozyme (LZM) and alcohol dehydrogenase (ADH), when the Mascot search engine is used to search the database to find the expected proteins, two Mascot scores29) are simply used to evaluate the performance of using the peak lists from the developed peak detection tool. The peak lists from Distiller were also run with the same search parameters for comparison. The expected proteins are ranked at the top hit with significant score except for Distiller list in BSA8. Table 2 shows a complete test result for all MS/MS spectra acquired from each protein sample. In most cases, MWD lists can derive higher total protein scores than Distiller lists, which advances from that the peak lists generated from MWD can validate more MS/MS spectra; that is, a greater number of MS/MS spectra, from a whole range of LC retention times, are useable to identify peptides compared with Distiller lists. This consequently increases the peptide coverage rate in the protein identification.
Experiment | Mascot search score | |
---|---|---|
MWD list | Distiller list | |
BSA1 | 955 | 689 |
BSA2 | 890 | 473 |
BSA3 | 746 | 338 |
BSA4 | 440 | 234 |
BSA5 | 650 | 378 |
BSA6 | 823 | 438 |
BSA7 | 362 | 82 |
BSA8 | 165 | N/A*) |
BSA9 | 325 | 175 |
ADH1 | 330 | 209 |
ADH2 | 157 | 160 |
LZM1 | 113 | 82 |
LZM2 | 118 | 83 |
LZM3 | 141 | 122 |
LZM4 | 175 | 143 |
*) N/A indicates no hit found for the correct protein.
The Mascot search score directly reflects quality of matches between the experimental MS/MS spectrum and the proposed peptide. A number of the highest ion scores, which matched the peptides distinctly, are summed to represent the protein score. It should be noted here that the Mascot score is based on the calculated probability, therefore the total number of matches between the experimental peaks and theoretical ions is not a key factor in calculating the ion score. This implies that the small number of peaks with a strong intensity from a spectral list may deduce or optimize a higher ion score. The ion coverage from a peak list has been expressed in TP/FP rate curves or precision study given above. The following results reveal how the peak lists performed in finding the expected protein matches by using a general protein identification method.
Table 3 lists the detail of the Mascot score and how the MS/MS spectra in sample BSA3 match the peptides in the BSA sequence from the database. In this result, 22 spectra were used to identify 18 peptides from the BSA sequence, which reached protein sequence coverage of 33%; while peak lists from Distiller only identified 10 peptides by 12 spectra with 16% protein sequence coverage. This is a particular example that demonstrates that protein identification using peak lists from MWD can more easily provide reliable results, but it is common to find similar result to this example in the test data and deduce higher protein scores in database search by using MWD peak lists.
Spectral information | Mascot search score | |||
---|---|---|---|---|
Well # | Precursor ion (Da) | Peptide | MWD list | Distiller list |
17 | 1072.648 | SHCIAEVEK | 24 | Xb) |
23 | 1443.798 | YICDNQDTISSK | 68 | X |
30 | 1674.062 | QEPERNECFLSHK | 19 | X |
35 | 1927.895 | CCAADDKEACFAVEGPK | 36 | X |
40 | 1554.788 | DDPHACYSTVFDK | 23 | 16 |
43 | 1305.760 | HLVDEPQNLIK | 31 | X |
43 | 1576.932 | LKPDPNTLCDEFK | 18 | X |
44 | 927.463 | YLYEIAR | 50 | 48 |
45 | 1640.114 | KVPQVSTPTLVEVSR | 69 | 69 |
46 | 1512.013 | VPQVSTPTLVEVSR | 54 | X |
49 | 1881.019 | RPCFSALTPDETYVPK | 41 | 19 |
50 | 927.436 | YLYEIAR | [32]a) | [31] |
51 | 1163.495 | LVNELTEFAK | 59 | 46 |
53 | 1439.835 | RHPEYAVSVLLR | 57 | 30 |
56 | 1283.899 | HPEYAVSVLLR | 47 | 23 |
56 | 1419.878 | SLHTLFGDELCK | 66 | 71 |
59 | 927.557 | YLYEIAR | [23] | X |
59 | 1479.976 | LGEYGFQNALIVR | 31 | X |
63 | 927.698 | YLYEIAR | [27] | [20] |
63 | 1419.950 | SLHTLFGDELCK | [44] | [12] |
70 | 2045.571 | RHPYFYAPELLYYANK | 11 | X |
70 | 1480.156 | LGEYGFQNALIVR | X | 17 |
72 | 1567.923 | DAFLGSFLYEYSR | 42 | X |
Total score (Coverage) | 746 (33%) | 338 (16%) |
a) The score in bracket represents duplicate peptide from the search results. b) Symbol (X) indicates that the match of peptide was not found from the peak list.
A computer program has been developed to detect ion peaks from mass spectra. The method in the program determines the noise level from the data points within a selected mass range. Thus, many parameters, such as those normally required to control peak detection, are not necessary in this program. This method is effective in the selection of ion peaks particularly for spectra with low resolution. The test results have confirmed that this peak detection not only finds high true positive rates for the mass spectra with high quality fragmentation, but also acquires reasonable true/false positive rates in the peak lists of the spectra which do not contain many strong fragment ions. Therefore, a greater number of MS/MS spectra from high-throughput experiments can be used for finding peptide sequences, which leads to higher protein sequence coverage. Although the initial intention was to develop an approach to detect ion peaks in low-resolution spectra, applying this method to mass spectra with higher resolution is also possible.
This research is granted by the Japan Society for the Promotion of Science (JSPS) through the “Funding Program for World-Leading Innovative R&D on Science and Technology (FIRST Program),” initiated by the Council for Science and Technology Policy (CSTP).