2025 Volume 14 Issue 1 Pages A0174
In this study, we propose an effective summarization method for mass spectrometry imaging (MSI) data and demonstrate its efficacy. The MSI data used in this study were obtained from thoracic tissue sections of mice, including the thymus. The thymus is a multi-lobed organ composed of cortical and medullary areas, playing a crucial role in T-cell differentiation. By applying MSI to the thoracic region, including the thymus, this study aims to comprehensively visualize changes in molecular localization and metabolic patterns across thoracic organs. MSI data are highly information-rich, making effective summarization and organization challenging. Therefore, we explored a method to organize and visualize the data based on either spatial or m/z values. Specifically, we employed Uniform Manifold Approximation and Projection (UMAP) to project m/z data into 3-dimensional space, followed by k-means clustering to divide it into multiple clusters. This approach enables detailed and comprehensive representation of diverse features. The objective of this study is to identify molecular localizations and patterns that conventional methods may overlook. Furthermore, experimental results demonstrated that the pseudo-color images generated using UMAP highlighted specific m/z values that significantly influence image characteristics. When focusing on thoracic data, spatial segmentation resulted in clearer color differentiation; however, molecular localizations corresponding to blood vessels were not observed. This finding confirms that m/z segmentation is more effective than spatial segmentation in discovering new molecular localizations.
In recent years, mass spectrometry imaging (MSI) has emerged as a powerful technique for detailed visualization of molecular distributions within tissues and has been widely applied in various fields such as biomedicine, pharmacology, and materials science.1–3) Since Geladi’s pioneering work prior to 1990 on multivariate image analysis, numerous approaches have been developed to handle the complex data generated by spectroscopic imaging techniques.4–10) In the specific context of MSI, early applications using secondary ion mass spectrometry (MS),11,12) laser ablation inductively coupled plasma MS (LA-ICP-MS),13–15) and matrix-assisted laser desorption/ionization (MALDI)16–18) have demonstrated the technique’s versatility across different sample types and analytical questions. The MSI technology possesses high spatial resolution and molecular specificity, allowing the simultaneous detection of various molecules within tissues. However, a single sample typically generates thousands of molecular signals, making the effective visualization of this vast amount of information a significant challenge.
Fonville et al. have proposed pseudo-color techniques for visualizing MSI data.19) Additionally, dimensionality reduction methods such as t-distributed stochastic neighbor embedding (t-SNE)20) and uniform manifold approximation and projection (UMAP)21) have been shown to be effective means of extracting important information from complex datasets.22–25) Various clustering methods have also been proposed as summarization techniques for MSI data.26–28) Despite these advancements, comprehensively visualizing MSI data as a single image remains challenging, and many important features may be overlooked. Therefore, more effective data analysis methods are needed.
In this study, we propose a new approach for analyzing MSI data. We use UMAP and k-means clustering to partition the MSI data in the m/z information direction and generate pseudo-color images for each partition. This method aims to improve the efficiency of MSI data analysis, allowing for detailed and comprehensive representation of diverse features and enabling more accurate analysis.
The MSI data used in this study were obtained from thoracic tissue sections of mice, including the thymus.29) The thymus is a multi-lobed organ composed of cortical and medullary areas, playing a crucial role in T-cell differentiation. During T-cell development, thymocytes traverse specific thymic compartments and interact with the microenvironments of the cortex and medulla.
This significance of this lies in proposing a new method for summarizing and visualizing MSI data, demonstrating its effectiveness with actual biological samples. By doing so, we aim to uncover new molecular localizations that conventional methods may overlook.
Four-week-old female ICR mice were administered 5 mg/kg of dexamethasone intraperitoneally. The experimental data from these mice have been detailed in a previous publication.29) In this study, we reused the MSI data obtained from these experiments to validate the proposed method.
For the acquisition of MSI data in negative ion mode, the tissue sections were coated with a 9-aminoacridine matrix. MSI was performed using an atmospheric pressure MALDI-quadrupole ion trap-time-of-flight-MS (Shimadzu Corporation, Kyoto, Japan).
The acquired MSI data were normalized using the total ion count, and 3000 peaks were extracted in order of intensity from the average spectrum to create a data matrix. IMAGEREVEAL MS Ver.1.30 (Shimadzu, Kyoto, Japan) was used for processing. This data matrix was analyzed using Python (3.10) (Supplementary Table 1).
2.1 Data processing and analysisThe overall workflow for MSI data processing and analysis is depicted in Fig. 1. IMAGEREVEAL MS Ver.1.30 and Python (3.10) were combined to meet the advanced analysis requirements of this study. IMAGEREVEAL MS Ver.1.30 provides numerous useful features for processing MSI data, such as creating data matrices, spatial projection of pixels, and generating pseudo-color images. However, it lacks the capability to project m/z data into 3 dimensions using UMAP and to apply k-means clustering for data partitioning. Therefore, Python (3.10), which offers these algorithms, was integrated into the analysis process.
Specifically, the data matrix created using IMAGEREVEAL MS Ver.1.30 was transferred to the Python environment, where UMAP was used to project the m/z data into 3-dimensional (3D) space. Subsequently, k-means clustering was applied to the resulting 3D data to classify the data. This process enabled the identification of molecular localizations and patterns that were difficult to detect using conventional methods.
The parameters used for these algorithms are detailed in Supplementary Table 1. For UMAP, we used the umap-learn library (version 0.5.6) with 3 components for dimensionality reduction, 15 neighbors to balance local and global structure preservation, and a minimum distance of 0.1 to prevent excessive crowding of points. The Euclidean distance metric was selected as it is appropriate for the continuous nature of our intensity data. For k-means clustering, the scikit-learn library (version 1.4.2) was employed with 50 clusters for the pixel-direction analysis to capture the diversity of molecular patterns without over-partitioning. We used the k-means++ initialization method to ensure stable and reproducible results, with 10 initializations and a maximum of 300 iterations to guarantee convergence.
For pseudo-color image generation, the 3-dimensional UMAP coordinates were directly mapped to RGB values (0–255), and images were constructed with each pixel displaying the color of its corresponding cluster centroid. The PIL library was used to generate the final visualization.
The number of clusters (50) in this study merely determines the number of colors for visual representation and is not itself a subject of biological significance or analysis. This value was selected based on visual evaluation, as too few clusters result in insufficient color differentiation, while too many lead to difficulty in distinguishing similar colors. The value of 50 was empirically determined as the optimal number of colors for visual identification of spatial distribution patterns.
For the comparative principal component analysis (PCA) analysis, IMAGEREVEAL MS Ver.1.30 was used with 2 principal components and the Pareto scaling method. By combining the visual analysis capabilities of IMAGEREVEAL MS Ver.1.30 with the advanced numerical processing and statistical modeling provided by Python, we achieved efficient and precise analysis.
Python (3.10) was chosen for its stability and compatibility. By combining the visual analysis capabilities of IMAGEREVEAL MS Ver.1.30 with the advanced numerical processing and statistical modeling provided by Python, we achieved efficient and precise analysis.
2.2 Data visualizationFor the visualization of data in the pixel direction, the dimensionality in the m/z direction was reduced to 3 dimensions using UMAP, and the position of each pixel in the 3D space was converted to RGB color coding to represent the pixel’s color. 22,23) This conversion was performed by directly mapping the 3 UMAP dimensions to the R, G, and B color channels, respectively. Specifically, the 1st UMAP dimension (dim_1) was mapped to the red channel, the 2nd dimension (dim_2) to the green channel, and the 3rd dimension (dim_3) to the blue channel. Each dimension was scaled to the 0–255 range for standard RGB color representation using the formula: RGB_value = 255 × (UMAP_value − min_value)/(max_value − min_value), with higher values in each dimension resulting in more intense red, green, or blue components in the final color. This direct mapping approach preserves the relative distances between points in the UMAP space, allowing similar molecular profiles to appear as similar colors in the visualization.
Subsequently, k-means clustering was used to divide the pixels into 50 clusters, and the color of each cluster was assigned based on the centroid’s color to construct a pseudo-color image. In the m/z direction partitioning, we implemented a three-stage binary division process.
We adopted a 3-stage binary division method for m/z data segmentation. We chose this approach primarily because dividing into multiple clusters at once could potentially mix clusters that could not be distinguished in the overall UMAP projection. The binary division method allows us to separate m/z values into 2 groups based on the most significant differences at each stage, providing stable clustering results even without clear objective indicators for multiple divisions. Furthermore, the 3-stage hierarchical process enabled us to observe the molecular pattern separation process step by step, allowing detailed tracking of how each division reveals different molecular localizations.
As an example of spatial partitioning, the thymus was isolated, and the m/z data were reduced to 3 dimensions using UMAP to create a pseudo-color image. Additionally, for comparison, regions of interest (ROI) were set for each part, and PCA analysis was performed based on the average spectrum of these ROIs. All experiments were run on an Intel Xeon CPU W-2123 3.60 GHz machine with 64 GB RAM.
A series of pseudo-color images were obtained by segmenting the m/z data according to the hierarchical structure shown in Fig. 2. Figure 3A–3O shows the 15 pseudo-color images corresponding to each segment. Additionally, Fig. 4 shows the results of projecting the data into 3 dimensions using UMAP. In panels A–O of Fig. 4, the position of each measurement point (pixel) is represented by RGB values, while panel P displays the distribution of m/z values themselves, color-coded according to the 8 final segments from the 3rd division stage. Furthermore, a pseudo-color image limited to the thymus region was also created (Fig. 5). Note that the 1-mm scale bar shown in Fig. 3A applies to Figs. 3 and 4A–4O, but Fig. 5 is displayed at a different scale.
When visualizing the data without segmenting the m/z values (Fig. 3A, corresponding to segment “0” in Fig. 2), we confirmed that peripheral areas such as skin and muscle were clearly distinguishable from internal organs. After segmenting the m/z data in 3 stages, certain pseudo-color images (Fig. 3C, 3F, and 3M, corresponding to segments “1-2,” “2-3,” and “3-6” in Fig. 2, respectively) were observed to be similar to the original unsegmented image.
Segment “3-6” (Fig. 3M) contains only 218 m/z values (Supplementary Table 2) and is represented in red in Fig. 4P. The visual similarity between Figs. 3M and 3A suggests that the m/z values included in segment “3-6” strongly contribute to the characteristics of the original image.
The pseudo-color images obtained after segmentation revealed interesting localizations that were not observed before segmentation. For example, white spots were observed in the thymus in Fig. 3B (segment “1-1”), which were confirmed to be blood vessels by histological examination shown in Fig. 3P. Additionally, magenta spots were observed near the center in Fig. 3H (segment “3-1”).
To further investigate these magenta spots, we utilized the similar image extraction function of IMAGEREVEAL MS shown in Supplementary Fig. 4. By using the magenta region in segment “3-1” as a teacher image for partial least squares regression analysis, we identified m/z 474.34 as strongly associated with localization within the esophagus. Preliminary identification by METLIN30) suggested that this m/z value corresponds to the [M–H2O–H] ion of C25H52NO6P.
In the pseudo-color image where the spatial data were limited to the thymus without segmenting the m/z data (Fig. 5), the color differentiation became clearer compared to Fig. 3A; however, the localization corresponding to blood vessels observed in Fig. 3B was not detected.
This study demonstrates that pseudo-color images generated using UMAP are determined by a small number of characteristic m/z values. Specifically, we confirmed that 218 m/z values (approximately 7% of the total 3000) belonging to segment “3-6” determine the main features of the original unsegmented image.
The PCA loading plot analysis in Supplementary Fig. 2B further supports this observation. The m/z value group corresponding to segment “3-6” (shown in red in Fig. 4P) exhibits large variance in the PCA analysis, confirming its strong influence on the overall data characteristics. Consequently, conventional visualization methods may systematically overlook other important molecular information due to the influence of these dominant characteristic m/z values.
The generation of multiple similar images in our analysis (as seen in Figs. 3A, 3C, 3F, and 3M) is a direct consequence of this data characteristic. These similarities arise because certain m/z value clusters share dominant influence on the overall molecular distribution pattern. This repetition of similar visualization patterns across different segmentation steps further validates our finding that a small subset of m/z values strongly determines the image characteristics.
In our approach, after dimensionality reduction using UMAP, we employed k-means clustering to divide the m/z values into multiple clusters. The essence of this method lies in grouping m/z values that exhibit similar behavior and visualizing each group independently. The k-means segmentation allows us to separate other m/z value groups from the influence of the few m/z values that dominate the overall data characteristics, thereby revealing the unique molecular distribution patterns of each group.
4.1.1 Mathematical considerations for minor peak detectionFrom a mathematical perspective, UMAP alone struggles to highlight minor ion peaks because it preserves both the local and global structure of the high-dimensional data. When projecting the entire dataset, m/z values with high variance and abundance naturally dominate the projection, while minor peaks with limited spatial distribution become effectively “hidden” in the dimensionality reduction process, despite being potentially biologically significant.
K-means clustering proves effective for selecting these minor peaks because it partitions the m/z space based on similarities in spatial distribution patterns rather than absolute intensity. Within each cluster, m/z values share similar distribution patterns regardless of their absolute abundance. This separation enables visualization of molecular distributions that would otherwise be overshadowed by dominant m/z values. Mathematically, this works because k-means minimizes within-cluster variance while maximizing between-cluster variance, effectively isolating groups of m/z values with distinct spatial behaviors even when they represent a small fraction of the total signal.
The fact that blood vessels within the thymus and specific localizations within the esophagus were detectable only after m/z segmentation concretely demonstrates the effectiveness of this approach. These structures were clearly present in the data yet completely invisible in the pre-segmentation visualization.
4.1.2 Guidelines for selecting informative imagesWhen selecting the most informative images from multiple k-means clustering results, we recommend the following guidelines:
Our results clearly demonstrate that m/z segmentation is more effective than spatial segmentation for discovering new molecular localizations. When we limited the analysis to the thymus region (spatial segmentation), although the overall color differentiation improved, the blood vessel localization detected in the m/z-segmented image was not observed.
This phenomenon can be explained by our observation that pseudo-color images are determined by a small number of dominant m/z values. In spatial segmentation, the relative influence relationships of m/z values within the selected region remain unchanged, thus limiting the ability to reveal subtle molecular patterns.
On the other hand, m/z segmentation using the k-means method classifies the 3000 m/z values in the dataset into multiple groups based on similarity and visualizes the spatial distribution of each group independently. This reveals molecular patterns that were previously obscured by the few dominant m/z values, enabling the visualization of microregions such as blood vessels.
4.3 Molecular identification limitationsRegarding the localization observed in the esophagus (Fig. 3H), preliminary identification based on m/z value 474.34 suggested C25H52NO6P. However, definitive molecular identification would require confirmation by tandem MS. As this study utilized archived MSI data from previous experiments, and the original samples are no longer available for additional analysis, our identification remains tentative and based solely on accurate mass matching.
4.4 Applications and future directionsThe m/z segmentation approach proposed in this study has potential applications in various fields:
Future work should focus on:
This study demonstrates that pseudo-color images generated using UMAP are predominantly determined by a small number of characteristic m/z values. Our analysis revealed that merely 218 m/z values (approximately 7% of the total 3000) strongly influence the characteristics of unsegmented MSI data visualizations. This finding has important implications for MSI data analysis, suggesting that conventional visualization methods may systematically overlook significant molecular features due to the dominance of these few m/z values.
Additionally, the pseudo-color images obtained through m/z segmentation using k-means clustering revealed molecular localizations that were completely overlooked by conventional methods. Specifically, blood vessels within the thymus and specific localizations within the esophagus were detected only after segmentation, despite being clearly present in the data. The identification of m/z 474.34 as a marker for esophageal localization through our targeted analysis further demonstrates the value of exploring segmented data.
On the other hand, when spatial data were limited to the thymus, although the color differentiation became clearer, the blood vessel localization observed in the m/z-segmented image was not detected. This observation confirms that spatial segmentation is less effective than m/z segmentation in discovering new molecular localizations, as it does not alter the relative influence of dominant m/z values within the selected region.
From these results, we conclude that m/z segmentation using k-means clustering after UMAP dimensionality reduction is essential for extracting more detailed and meaningful information from MSI data. By separating the influence of dominant m/z values and independently analyzing similar m/z groups, this approach enables the visualization of molecular patterns that would otherwise remain hidden. The importance of this method is emphasized for future research in various fields including biomedical research, pharmaceutical studies, and pathology, providing a foundation for deeper biological understanding and potential clinical applications.
We deeply appreciate Assistant Professor Yudai Tsuji from Fujita Health University for his assistance in preparing the samples and obtaining the data.
Mass Spectrom (Tokyo) 2025; 14(1): A0174