Chem-Bio Informatics Journal
Online ISSN : 1347-0442
Print ISSN : 1347-6297
ISSN-L : 1347-0442
Original
A Novel Over-Sampling Method and its Application to Cancer Classification from Gene Expression Data
Xuan Tho DangOsamu HiroseDuong Hung BuiThammakorn SaethangVu Anh TranLan Anh T. NguyenTu Kien T. LeMamoru KuboYoichi YamadaKenji Satou
Author information
JOURNAL FREE ACCESS

2013 Volume 13 Pages 19-29

Details
Abstract

One of the most critical and frequent problems in biomedical data classification is imbalanced class distribution, where samples from the majority class significantly outnumber the minority class. SMOTE is a well-known general over-sampling method used to address this problem; however, in some cases it cannot improve or even reduces classification performance. To address these issues, we have developed a novel minority over-sampling method named safe-SMOTE. Experimental results from two gene expression datasets for cancer classification (i.e., colon-cancer and leukemia) and six imbalanced benchmark datasets from the UCI Machine Learning Repository showed that our method achieved better sensitivity and G-mean values than both the control method (i.e., no over-sampling) and SMOTE. For example, in the colon-cancer dataset, although the sensitivity and specificity achieved by SMOTE (81.36% and 88.63%) were lower than for the control method (81.59% and 89.50%), safe-SMOTE in contrast had these values increase (81.82% and 90.50%). Similarly, the G-mean value of the control (85.45%) decreased to 84.91% when SMOTE was employed, but increased to 86.04% when using safe-SMOTE. In the leukemia dataset, SMOTE was able to improve the sensitivity and G-mean values with respect to the control; however, safe-SMOTE achieved noticeable, even greater improvements for both of these criteria.

Content from these authors
2013 Chem-Bio Informatics Society
Previous article Next article
feedback
Top