Genes & Genetic Systems
Online ISSN : 1880-5779
Print ISSN : 1341-7568
ISSN-L : 1341-7568
Full papers
A four-gene-based methylation signature associated with lymph node metastasis predicts overall survival in lung squamous cell carcinoma
Yufei DengLifeng LiuXia XiaoYin Zhao
Author information
JOURNAL OPEN ACCESS FULL-TEXT HTML
Supplementary material

2023 Volume 98 Issue 5 Pages 209-219

Details
ABSTRACT

We aimed to identify prognostic methylation genes associated with lymph node metastasis (LNM) in lung squamous cell carcinoma (LUSC). Bioinformatics methods were used to obtain optimal prognostic genes for risk model construction using data from the Cancer Genome Atlas database. ROC curves were adopted to predict the prognostic value of the risk model. Multivariate regression was carried out to identify independent prognostic factors and construct a prognostic nomogram. The differences in overall survival, gene mutation and pathways between high- and low-risk groups were analyzed. Finally, the expression and methylation level of the optimal prognostic genes among different LNM stages were analyzed. FGA, GPR39, RRAD and TINAGL1 were identified as the optimal prognostic genes and were applied to establish a prognostic risk model. Significant differences were found among the different LNM stages. The risk model could predict overall survival, showing a moderate performance with AUC of 0.64–0.68. The model possessed independent prognostic value, and could accurately predict 1-, 3- and 5-year survival. Patients with a high risk score showed poorer survival. Lower gene mutation frequencies and enrichment of leukocyte transendothelial migration and the VEGF signaling pathway in the high-risk group may lead to the poor prognosis. This study identified several specific methylation markers associated with LNM in LUSC and generated a prognostic model to predict overall survival for LUSC patients.

INTRODUCTION

As a type of malignant tumor, lung cancer causes serious morbidity and mortality worldwide (Bray et al., 2018). Lung squamous cell carcinoma (LUSC) is a typical subtype of non-small cell lung cancer (NSCLC) (Piperdi et al., 2014). Lymph node metastasis (LNM) is the main form of metastasis in lung cancer and can severely affect survival and prognosis of patients (Dong et al., 2019). The 5-year overall survival of patients with LNM remains dissatisfactory (Goldstraw et al., 2016). There is still a lack of effective biomarkers for predicting survival of patients with LNM in LUSC. It is therefore essential to explore effective prognostic factors associated with LNM in LUSC to improve predictive ability and therapy decisions in the clinic.

DNA methylation is the most prominent modification in epigenetics. Selective methylation and demethylation of genes can influence the development of specific tissue types by mediating gene expression (Klutstein et al., 2016). The regulation of methylation is thought to be one of the pivotal factors contributing to the development of various solid tumors (Kulis and Esteller, 2010). In addition, DNA methylation state possesses an effective potential as a biomarker of prognosis (Zhou et al., 2014; Fu et al., 2020). An 11-DNA methylation gene signature (Zhang et al., 2020) and the AKAP13 methylation gene (Wang et al., 2020a) have been demonstrated to have significant sensitivity and specificity in predicting patient survival in LUSC. DNA hypermethylation is common in lung cancer and many results have demonstrated that methylation is correlated with LNM (Xie et al., 2013; Min et al., 2018). Aberrant promoter methylation of microRNA-137 is associated with LNM (Min et al., 2018) in NSCLC. Aberrant hypermethylation of disabled-2 reduced β-catenin expression and promoted the development of lung cancer (Xie et al., 2013). The methylation level of CDO1 was found to be significantly different between genders and at different LNM and tumor node metastasis stages (Wang et al., 2020c). Despite these observations, metastasis-related methylation genes involved in LUSC remain absent.

To explore potential prognostic markers specifically related to LNM in LUSC, we aimed to identify several methylation-related genes and construct a prognostic prediction model that exhibited accurate predictive value in survival and prognosis in LNM for LUSC.

RESULTS

Identification of methylation genes related to LNM

We grouped N0 as the non-LNM group and N1–N3 as the LNM groups in the Cancer Genome Atlas (TCGA) database. The LNM and non-LNM groups included 176 and 319 samples, respectively. A total of 276 differentially expressed genes (DEGs) between the two groups were screened, most of which were downregulated (Fig. 1A). Meanwhile, 10,241 differentially methylated genes (DMGs) between the two groups were screened. Furthermore, we intersected the DEGs and DMGs using Venn analysis and identified 163 overlapping genes as methylation genes related to LNM (Fig. 1B). A heat map of the expression level of these 163 overlapping genes is shown in Supplementary Fig. S1.

Fig. 1. Overlapping genes and functional analysis. (A) Volcano plot showing differentially expressed genes (DEGs) of patients with or without LNM. (B) Venn diagram identifying 163 overlapping genes among DEGs and differentially methylated genes (DMGs). (C) Enrichment analysis of biological process (BP), cellular component (CC) and molecular function (MF) (P < 0.05). (D) Pathway enrichment map.

GO and KEGG functional enrichment analyses

After Gene Ontology (GO) functional annotation and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment, a total of 19 significantly related biological processes (BPs), 12 cellular components (CCs), seven molecular functions (MFs) and five KEGG signaling pathways were filtered out. The top six most significant BPs, CCs and MFs, as well as the five KEGG pathways, are shown in Fig. 1C and 1D. The BP category was mainly associated with digestion, positive regulation of antibacterial peptide production, sodium ion transport, epidermis development, sodium ion transmembrane transport and excretion. The CC category was mainly associated with extracellular function. The MF category was mainly associated with structural molecule activity, calcium ion binding and microtubule motor activity. Furthermore, these genes were mainly associated with tumor regulation and metabolism.

Construction of a prognostic risk model

A univariate Cox regression analysis was carried out to analyze the association of gene expression and survival in LUSC, and 35 genes were significantly related to prognosis (P < 0.05). To optimize the prognostic genes, a least absolute shrinkage and selection operator (LASSO) Cox analysis on the 35 genes was performed. We generated four optimal prognostic genes, FGA, GPR39, RRAD and TINAGL1 (Fig. 2A, Table 1), whose hazard ratio (HR) values of > 1 suggested poor prognostic ability. The prognostic risk model was then constructed as follows:

  
Risk score = (0 .0227) * Exp  FGA  + (0.0156)  * Exp  GPR39  + (0 .0104) * Exp  RRAD  + (0 .0265) * Exp  TINAGL1 .

Fig. 2. Construction of a prognostic risk model. (A) Left, curves represent regularization paths of least absolute shrinkage and selection operator (LASSO) coefficients. Right, partial likelihood deviance as a function of regularization parameter λ. (B) Risk score distribution, Kaplan–Meier (KM) curve and receiver operating characteristic (ROC) curve analysis of the TCGA training dataset. (C) Risk score distribution, KM curve and ROC curve analysis of the GEO validation dataset. (D) Prognosis KM curve of FGA. (E) Prognosis KM curve of GPR39. (F) Prognosis KM curve of RRAD. (G) Prognosis KM curve of TINAGL1.

Table 1. Cox and LASSO analysis of four-gene prognostic signature components

GeneUnivariate Cox regression analysisLASSO coefficient
BetaHR95% CIP-value
FGA0.0742631.0771(1.0320–1.1241)0.0007***0.0227
GPR390.1047091.1104(1.0380–1.1878)0.0023**0.0156
RRAD0.0998181.1049(1.0312–1.1840)0.0046**0.0104
TINAGL10.1154451.1224(1.0362–1.2157)0.0046**0.0265

Note: HR, hazard ratio. **, P < 0.01; ***, P < 0.001.

Evaluation of the prognostic risk model

The risk scores of all samples in the TCGA and Gene Expression Omnibus (GEO) datasets were calculated, and the samples were grouped into high- and low-risk groups (Fig. 2B and 2C left). In both the TCGA and GEO datasets, patients with high risk scores possessed remarkably poorer overall survival than those with low risk scores (Fig. 2B and 2C middle, P < 0.05). The area under the receiver operating characteristic (ROC) curve (AUC) values for survival were all 0.65 at 1, 3 and 5 years in the TCGA database. The AUCs for survival were 0.67, 0.64 and 0.68 at 1, 3 and 5 years, respectively, in the GEO database. These results revealed that our prognostic model possessed moderate predictive value (Fig. 2B and 2C right). For the four prognostic genes, survival for high-expression patients was significantly lower than for low-expression patients (Fig. 2D–2G, P < 0.05).

Prognostic nomogram for overall survival

Stage and risk score remained as independent prognostic factors in the TCGA while only risk score was regarded as an independent prognostic factor in the GEO (Table 2). These findings implied that the risk model possessed independent prognostic value. A nomogram that involved the independent prognostic factors was visualized (Fig. 3A). The calibration figure exhibited excellent agreement between the predicted and actual situation for 1-, 3- and 5-year overall survival, with C-index values of 0.902, 0.792 and 0.715, respectively (Fig. 3B, 3C, 3D).

Table 2. Independent prognostic analysis of risk score

Data sourceCharacteristicUnivariate Cox analysisMultivariate Cox analysis
HR95% CIP-valueHR95% CIP-value
TCGA datasetAge1.01450.9978–1.03140.08851.01440.9973–1.03180.0997
Gender1.17000.8480–1.61430.33911.23600.8925–1.71170.2022
Stage1.25721.0639–1.48560.0072**1.29171.0958–1.52270.0023**
Risk score2.42683.7383–3.4928< 0.001***2.42941.3396–3.80750.0015**
GEO datasetAge1.00750.9800–1.03570.59621.01590.9855–1.04730.3075
Gender1.70640.9843–2.95820.05681.61070.9212–2.08170.0945
Stage1.31010.9088–1.88860.14781.38190.9577–1.99400.0839
Risk score2.49051.8657–3.32430.0017**1.92821.8410–2.54610.0237*

Note: HR, hazard ratio. *, P < 0.05; **, P < 0.01; ***, P < 0.001.

Fig. 3. Construction and evaluation of a nomogram model. (A) Nomogram to predict 1-, 3- and 5-year overall survival. (B, C, D) Calibration plots of the nomogram to predict overall survival at 1, 3 and 5 years.

Gene mutation and GSEA analysis

The mutation frequency and tumor mutation burden (TMB, the number of somatic nonsynonymous mutations per megabase pair) of each gene were calculated in the included LUSC samples. The top 20 mutation frequencies in the LUSC tumor samples are shown in Fig. 4A and 4B. Missense mutations accounted for the majority of mutations. Mutated genes in the high- and low-risk score groups were not completely in agreement. For example, the mutation frequency of TP53 in the high-risk group tended to be higher than that in the low-risk group. TMB exhibited no significant differences between the high and low risk scores (Fig. 4C).

Fig. 4. Mutation patterns and performance of gene set enrichment analysis (GSEA). (A) Top 20 gene mutation frequencies in the high-risk group. (B) Top 20 gene mutation frequencies in the low-risk group. (C) Tumor mutation burden (TMB) in the high- and low-risk group. (D) GSEA analysis reveals significantly enriched pathways in the high-risk group.

For investigating the potential biological processes and pathways between the high- and low-risk groups, we identified the genes in all samples between the high- and low-risk groups. Twenty-two signaling pathways differed significantly between the two groups. Gene set enrichment analysis (GSEA) revealed that several pathways related to tumor progression were enriched in the high-risk group, such as apoptosis and the VEGF signaling pathway (Fig. 4D).

Differences of biomarkers between non-LNM and LNM groups

The associations of four prognostic signature genes (FGA, GPR39, RRAD and TINAGL1) with their corresponding methylation sites and copy numbers were analyzed (Supplementary Fig. S1). For the four signature genes, no significant correlations were found between gene expression and copy number (Supplementary Fig. S2). The expression levels of FGA, GPR39, RRAD and TINAGL1 were significantly higher in the LNM group than in the non-LNM group based on the training dataset GSE74777 (Fig. 5).

Fig. 5. Expression levels of the four genes (FGA, GPR39, RRAD and TINAGL1) between non-metastasis (NM) and metastasis (M).

DISCUSSION

Gene therapy is gaining widespread attention due to its excellent therapeutic efficacy (Park et al., 2019). As effective biomarkers, DNA methylation markers have been identified and used in the clinic (Gu et al., 2021). Biomarkers of methylation genes related to LNM in NSCLC and lung adenocarcinoma have been established (Meng et al., 2013; Chen et al., 2020), but little is known about these genes. Furthermore, effective therapeutic targets for adenocarcinoma may not be effective for LUSC (Li et al., 2020). In this study, we aimed to identify several methylation genes associated with LNM and construct a prognostic prediction model that could be used to predict survival and prognosis in LUSC. FGA, GPR39, RRAD and TINAGL1 were selected as prognostic genes, and were the basis of the prognostic risk model. Patients in the high-risk group showed significantly poor overall survival. The ROC curves for predicting 1/3/5-year survival demonstrated that the prognostic risk model exhibited moderate predictive power. Multivariate Cox regression showed that the prognostic model had independent prognostic value. Finally, a nomogram based on the risk score was demonstrated to have an accurate ability to predict survival (C-index > 0.7).

We identified four optimal prognostic methylation genes related to LNM (FGA, GPR39, RRAD and TINAGL1) in LUSC. High expression of all four genes correlated with poor prognosis. Fibrinogen is the central blood clotting agent in wound healing and its polypeptide chains are encoded by the genes FGA along with FGB and FGG (Henschen et al., 1983). FGA knockout stimulates tumor growth and metastasis in lung cancer (Wang et al., 2020b). DNA methylation decreases FGA promoter activity (Vorjohann et al., 2013), and high expression of methylated FGA correlates with worse prognosis in gallbladder carcinoma (Yang et al., 2021). GPR39 is a G protein-coupled receptor (GPCR). “Cancer driver” GPCRs play a pivotal role in cancer progression and metastasis (Cotton and Claing, 2009; Bar-Shavit et al., 2016). GPR39 has been found to be a potential therapeutic target for oral squamous cell carcinoma (OSCC) (Jiang et al., 2020) and gastric adenocarcinoma (Alén et al., 2016). High GPR39 expression in OSCC promotes yes-associated protein expression and is associated with poor survival (Jiang et al., 2020). RRAD is a small Ras-related GTPase and epigenetic silencing of RRAD is frequent in lung cancer (Suzuki et al., 2007). RRAD inhibition can dramatically increase invasion and metastasis of hepatoma cells (Shang et al., 2016). RRAD methylation is correlated with prognosis in lung cancer, esophageal cancer, prostate cancer, nasopharyngeal carcinoma and breast cancers (Suzuki et al., 2007; Mo et al., 2012; Jin et al., 2013). TINAGL1 (tubulointerstitial nephritis antigen-like 1) is a newly discovered matricellular protein and is involved in cancer progression (Shan et al., 2021). TINAGL1 is generally dysregulated in highly metastatic tumors (Naba et al., 2014). Umeyama et al. (2014) have found that methylation-related TINAGL1 is a potential therapy target with a suppressive role in NSCLC metastasis. In conclusion, these four genes play crucial roles in tumor progression. We found significantly different expression and methylation levels of FGA, GPR39, RRAD and TINAGL1 in the LNM and non-LNM groups in our results, suggesting their involvement in the development of LUSC. In terms of carcinogenesis and the close relationship with prognosis of the four genes in LUSC, the four-gene signature possesses the ability to be a novel biomarker and have potential in prognosis and informing clinical decisions.

A prognostic model was developed based on the four optimal genes. The model predicted that samples in the high-risk group showed significantly poorer overall survival for LUSC patients. This prediction may be related to the changed biological functions arising from the gene expression pattern in carcinogenesis. Generally, LNM is driven by a higher tumor burden, which predicts a poor overall survival (Chen et al., 2020). The underlying mechanisms may be related to neoantigens produced by tumors which activate the immune response (Samstein et al., 2019). However, TMB showed no significant difference between the high- and low-risk groups in the present study, in disagreement with many previous reports (Jiang et al., 2019; Samstein et al., 2019; Xu et al., 2020; Yan and Chen, 2021). The mean TMB value differed among different tumors and was associated with age (Chalmers et al., 2017). Another study did not support utilization of TMB as a biomarker for therapy in any cancer type (McGrail et al., 2021). Although no difference was found in TMB between the two risk groups, a lower gene mutation frequency in the high-risk group was found in the present study. Cancer-related DNA methylation changes participate in maintaining epithelial phenotype, apoptosis and DNA repair (Yang et al., 2018). The mutation frequency of TP53, TTN and CSMD3 tends to be the highest of all the mutated genes, and they are the most common mutated genes in human lung cancer. Mutations in TP53, TTN and CSMD3 genes are closely associated with development of cancer and LNM (Jia et al., 2019; Cai et al., 2020; Kong et al., 2020) and may induce different changes between different risk groups. Enrichment of tumor hallmarks, such as apoptosis and the VEGF signaling pathway, in the high-risk group was found in the present study, and may account for the poor survival in the high-risk group. Overall, we provide a methylation-related prognostic model for LNM patients in LUSC, which is expected to be useful in the clinic.

A four-gene signature (FGA, GPR39, RRAD and TINAGL1) was constructed and demonstrated to be a novel biomarker that can guide clinical decisions. The independent prognostic model is plausible after verification in the GEO database. However, our study has two limitations. First, our prognostic model was not validated by clinical application; further validation of the four-gene signature is necessary. Second, the underlying mechanisms of the four-gene signature in cancer progression should be further explored.

In conclusion, a prognostic model was developed based on four methylation genes (FGA, GPR39, RRAD and TINAGL1) related to LNM for LUSC. Moreover, the results suggested that the model could independently predict the survival of LUSC patients.

MATERIALS AND METHODS

Data collection

We collected data from two independent databases, TCGA (https://portal.gdc.cancer.gov/) and GEO (https://www.ncbi.nlm.nih.gov/), for the following analysis. mRNA expression, clinical, mutation and methylation data for LUSC patients were derived from the TCGA database. The types of expression, mutation and clinical data were FPKM (standardized log2 expression level data), MuTect2 (exon mutation data) and clinical, respectively. Methylation data were obtained from the TCGA-LUSC dataset: DNA methylation - Illumina Human Methylation 450k. A total of 495 samples were retained with N0 as the non-LNM group and N1–N3 as the LNM group. After deleting samples lacking survival time data and samples with a survival time of 0, a total of 488 samples were retained for subsequent analysis. Two datasets, GSE37745 (Botling et al., 2013) and GSE50081 (Der et al., 2014), which were generated on the GPL570 [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array, were retrieved from the GEO database. After deleting samples with missing survival and zero times, the LUSC samples in these two GEO datasets were selected with 66 and 43 LUSC samples in GSE37745 and GSE50081, respectively. The 109 samples were then merged into a combined dataset, which was used as a validation dataset. Furthermore, 105 samples with LNM information in the GSE74777 (Noro et al., 2017) dataset were used to investigate gene expression between the non-LNM group and the LNM group. GSE74777 was generated on the GPL17586 [HTA-2_0] Affymetrix Human Transcriptome Array 2.0 [transcript (gene) version]. The process of this study is shown in Fig. 6.

Fig. 6. Study design. TCGA, the Cancer Genome Atlas; LUSC, lung squamous cell carcinoma; LNM, lymph node metastasis.

Identification of methylation genes related to LNM

Samples were grouped into LNM and non-LNM, and DEGs were extracted using the LIMMA package (Ritchie et al., 2015) in R (P < 0.05 and |log2FC| > 0.5 was used as the threshold). DMGs were extracted using the ChAMP package (Tian et al., 2017) in R (P < 0.05 was the threshold). Genes overlapping between DEGs and DMGs were identified by Venn analysis. Heat maps of overlapping genes were obtained using the pheatmap package (Wang et al., 2014) in R.

Biological process and pathway analysis

The functions of these overlapping genes were evaluated using GO enrichment, which contained BP, CC and MF. The KEGG database was used to explore potential biological pathways of the overlapping genes. The above analyses were undertaken in DAVID (Huang et al., 2009).

Construction and evaluation of the prognostic risk model

A univariate Cox model (Wang et al., 2016) was applied to filter out prognostic genes from overlapping genes. Among these, genes with P < 0.05 were incorporated into a LASSO Cox regression (Tibshirani, 1997) in the LARS package to obtain the optimal prognostic signature. Subsequently, the prognostic risk model was defined as follows:

  
Risk score =  Coef gene × Exp gene ,

where Coefgene and Expgene are the LASSO coefficient and expression level of the corresponding genes, respectively.

The risk score values of each sample in the TCGA training set and the GEO validation set were calculated. All samples in the training set and the validation set were grouped into high- and low-risk groups with the median value as the threshold. The Kaplan–Meier curve of survival package (Wang et al., 2016) was adopted to explore the relationship between risk score and patients’ survival. The performance of the prognostic risk model was evaluated by calculating the ROC AUC at 1, 3 and 5 years using the timeROC package.

Construction of an independent prognostic nomogram

Univariate and multivariate Cox regression analyses were performed to screen independent clinical factors for survival. Log-rank P-value < 0.05 was set as the significant correlation threshold. To evaluate the prognostic independence of the clinical prognostic factors and risk score, a nomogram model to predict 1-, 3- and 5-year survival was constructed using the rms package (Tibshirani, 1997), combining the independent prognostic factors screened with the risk information identified by the prognostic risk model. The C-index coefficient of the nomogram prognosis model was calculated to measure the predictive ability of the model (Shan et al., 2019).

Mutation pattern and GSEA

Mutation data from the TCGA were analyzed using the maftools package (Zhang et al., 2019) to calculate the mutation frequency of each gene in the LUSC samples. The TMB was also calculated. The differences of mutation frequency and TMB between high- and low-risk groups were investigated. GSEA was performed to analyze significant enrichment pathways between the high- and low-risk groups (P < 0.05 was set as the cut-off threshold standard).

Methylation analysis and validation of the prognostic signature

The association of each prognostic signature gene with its corresponding methylation sites and copy number was analyzed using MEXPRESS (https://mexpress.be/). The differential expression of each signature gene between LNM and non-LNM groups was verified by the inter-group T test using expression data from the GSE74777 dataset. Finally, differences of expression and methylation among different stages of lymph node metastasis were investigated.

DECLARATIONS

Funding: This research did not receive any specific grant from funding agencies in the public, commercial or not-for-profit sectors.

Competing interests: The authors declare that they have no competing interests.

Data availability statement: All data generated or analyzed during this study are included in this published article.

Authors’ contributions: Yufei Deng carried out the conception and design of the research, and Xia Xiao participated in the acquisition of data. Lifeng Liu carried out the analysis and interpretation of data. Yin Zhao participated in the design of the study and performed the statistical analysis. Yufei Deng and Yin Zhao conceived the study, participated in its design and coordination, and helped to draft the manuscript and revise it for important intellectual content. All authors read and approved the final manuscript.

REFERENCES
 
© 2023 The Author(s).

This is an open access article distributed under the terms of the Creative Commons BY 4.0 International (Attribution) License (https://creativecommons.org/licenses/by/4.0/legalcode), which permits the unrestricted distribution, reproduction and use of the article provided the original source and authors are credited.
https://creativecommons.org/licenses/by/4.0/legalcode
feedback
Top