Abstract
In this paper, we present optimal models to predict the survival rate of breast cancer patients in five years. Material and Methods: This study examined the 37,256 follow-up patients by 2002 that were diagnosed as breast cancer and registered in the SEER program from 1992 to 1997. We implemented seven common algorithms (Logistic Regression model, Artificial Neural Network (ANN), Naive Bayes, Bayes Net, Decision Trees with naive Bayes, Decision Trees (ID3) and Decision Trees (J48)) besides the most widely used statistical method (Logistic Regression model) to develop the prediction models. Results: The accuracy was 85.8±0.2%, 84.5±1.4%, 83.9±0.2%, 83.9±0.2%, 84.2±0.2%, 82.3±0.2%, 85.6±0.2% for the Logistic Regression model, ANN, Naive Bayes, Bayes Net, Decision Trees with naive Bayes, ID3 and J48, respectively. Conclusion: In this study, Logistic Regression model showed the highest accuracy. The J48 had the highest sensitivity and the ANN had the highest specificity. The Decision Trees models tend to show high sensitivity. And the Bayesian models were apt to show the accuracy going up. We found that the optimal algorithm might be different by the predicted objects and dataset.