Breast Cancer Detection Using a Dataset Balancing Approach
Z. Abbasi
1
(
)
Keywords: Imbalanced datasets, automated disease diagnosis, oversampling.,
Abstract :
Imbalanced datasets are one of the major challenges in the automatic diagnosis of diseases. The imbalance in data classes leads to failures in diagnosis, which can be particularly dangerous for diseases such as breast cancer. In this study, a modified version of the ReliefF algorithm, which is a feature selection algorithm, is proposed. The modifications have been made to select and balance instances effectively. The proposed algorithm balances the number of instances in breast cancer datasets to improve diagnosis. In this algorithm, instances are weighted and ranked. After ranking them, the dataset is balanced using the proposed oversampling method based on the instance weights. This algorithm has been applied to two breast cancer datasets: Wisconsin Breast Cancer Dataset (WBCD) and Wisconsin Diagnostic Breast Cancer Dataset (WDBCD). The balanced dataset was then classified using various classification algorithms. The classification results show that performance evaluation metrics have improved compared to the classification of the original data. The best results obtained in WBCD dataset are Accuracy = 98.04%, G-Mean = 98.00% and in WDBCD dataset are Accuracy = 98.31%, G-Mean = 98.35%. The obtained results indicate the effectiveness of the proposed algorithm in breast cancer diagnosis.