Breast Cancer Detection Using a Dataset Balancing Approach

Submited date : 2025-03-16 Accepted date : 2025-07-28

Keywords: Imbalanced datasets, automated disease diagnosis, oversampling.,

Abstract :

Imbalanced datasets are one of the major challenges in the automatic diagnosis of diseases. The imbalance in data classes leads to failures in diagnosis, which can be particularly dangerous for diseases such as breast cancer. In this study, a modified version of the ReliefF algorithm, which is a feature selection algorithm, is proposed. The modifications have been made to select and balance instances effectively. The proposed algorithm balances the number of instances in breast cancer datasets to improve diagnosis. In this algorithm, instances are weighted and ranked. After ranking them, the dataset is balanced using the proposed oversampling method based on the instance weights. This algorithm has been applied to two breast cancer datasets: Wisconsin Breast Cancer Dataset (WBCD) and Wisconsin Diagnostic Breast Cancer Dataset (WDBCD). The balanced dataset was then classified using various classification algorithms. The classification results show that performance evaluation metrics have improved compared to the classification of the original data. The best results obtained in WBCD dataset are Accuracy = 98.04%, G-Mean = 98.00% and in WDBCD dataset are Accuracy = 98.31%, G-Mean = 98.35%. The obtained results indicate the effectiveness of the proposed algorithm in breast cancer diagnosis.

References:

[1] WHO, (IARC), International Agency for Research on Cancer, World Health Organization, [Online]. Available: http://gco.iarc.fr/.
[2] F. Bray, et al., "Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries," CA: A Cancer Journal for Clinicians, vol. 68, no. 6, pp. 394-424. Nov 2018.
[3] L. J. Mena and J. A. Gonzalez, "Machine learning for imbalanced datasets: application in medical diagnostic," in Proc. of the 19th Int. Florida Artificial Intelligence Research Society Conf., pp. 574-579, 2006.
[4] H. Parvin, B. Minaei-Bidgoli, and H. Alinejad-Rokny, "A new imbalanced learning and dictions tree method for breast cancer diagnosis," Journal of Bionanoscience, vol. 7, no. 6, pp. 673-678, 2013.
[5] G. Haixiang, et al., "Learning from class-imbalanced data: Review of methods and applications," Expert Systems with Applications, vol. 73, pp. 220-239, May 2017.
[6] A. Anand, G. Pugalenthi, G. B. Fogel, and P. N. Suganthan, "An approach for classification of highly imbalanced data using weighting and undersampling," Amino Acids, vol. 39, no. 5, pp. 1385-1391, Nov. 2010.
[7] L. Yijing, G. Haixiang, L. Xiao, L. Yanan, and L. Jinling, "Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data," Knowledge-Based Systems, vol. 94, pp. 88-104, Feb. 2016.
[8] Y. Liu, H. T. Loh, and A. Sun, "Imbalanced text classification: A term weighting approach," Expert systems with Applications, vol. 36, no. 1, pp. 690-701, Jan. 2009.
[9] C. Phua, D. Alahakoon, and V. Lee, "Minority report in fraud detection: classification of skewed data," ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 50-59, 2004.
[10] B. W. Yap, et al., "An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets," in Proc. of the First Int. Conf. on Advanced Data and Information Engineering, pp. 13-22, Kuala Lumpur, Malaysia, 16-18 Dec. 2014.
[11] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, "Handling imbalanced datasets: A review," GESTS International Trans. on Computer Science and Engineering, vol. 30, pp. 25-36, 2006.
[12] N. V. Chawla, "Data mining for imbalanced datasets: An overview," In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook, pp. 875-886, Springer: Boston, 2005.
[13] M. Zięba, J. M. Tomczak, M. Lubicz, and J. Świątek, "Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients," Applied Soft Computing, vol. 14, pt. A, pp. 99-108, Jan. 2014.
[14] S. García and F. Herrera, "Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy," Evolutionary computation, vol. 17, no. 3, pp. 275-306, Fall 2009.
[15] M. M. Rahman and D. N. Davis, "Addressing the class imbalance problem in medical datasets," International Journal of Machine Learning and Computing, vol. 3, no. 2, pp. 224-228, Apr. 2013.
[16] E. Š. M. R. -Š. Igor Kononenko, "Overcoming the myopia of inductive learning algorithms with RELIEFF," Applied Intelligence, vol. 7, no. 1, pp. 39-55, Jan. 1997.
[17] A. Luque, A. Carrasco, A. Martín, and A. D. L. Heras, "The impact of class imbalance in classification performance metrics," Pattern Recognition, vol. 91, pp. 216-231, Jul. 2019.
[18] M. Karabatak and M. C. Ince, "An expert system for detection of breast cancer based on association rules and neural network," Expert systems with Applications, vol. 36, no. 2, pt. 2, pp. 3465-3469, Mar. 2009.
[19] A. Osareh and B. Shadgar, "Machine learning techniques to diagnose breast cancer," in Proc. 5th Int. Symp. on Health Informatics and Bioinformatics, pp. 114-120, Antalya, Turkey, 20-22 Apr. 2010.
[20] K. J. Wang and A. M. Adrian, "Breast cancer classification using hybrid synthetic minority over-sampling technique and artificial immune recognition system algorithm," International Journal of Computer Science and Electronics Engineering, vol. 1, no. 3, pp. 408-412, 2013.
[21] H. Asri, H. Mousannif, H. Al Moatassime, and T. Noel, "Using machine learning algorithms for breast cancer risk prediction and diagnosis," Procedia Computer Science, vol. 83, pp. 1064-1069, 2016.
[22] E. Yavuz, C. Eyupoglu, U. Sanver, and R. Yazici, "An ensemble of neural networks for breast cancer diagnosis," in Proc. Int. Conf. on Computer Science and Engineering, pp. 538-543, Antalya, Turkey, 5-8 Oct. 2017.
[23] B. Zheng, S. Won Yoon, and S. S. Lam, "Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms," Expert Systems with Applications, vol. 41, no. 4,pt. 1, pp. 1476-1482, Mar. 2014.
[24] S. Sasikala, M. Bharathi, M. Ezhilarasi, S. Senthil, and M. R. Reddy, "Particle swarm optimization based fusion of ultrasound echographic and elastographic texture features for improved breast cancer detection," Australasian Physical & Engineering Sciences in Medicine, vol. 42, no. 3, pp. 677-688, Sept. 2019.
[25] M. khanna, L. k. Singh, k. Shrivastava, and R. singh, "An enhanced and efficient approach for feature selection for chronic human disease prediction: A breast cancer study," Heliyon, vol. 10, no. 5, Article ID: e26799, Mar. 2024.
[26] R. R. Kadhim and M. Y. Kamil, "Comparison of breast cancer classification models on Wisconsin dataset," Int. J. Reconfigurable Embed. Syst, vol. 11, no. 2, pp. 166-174, Jul. 2022.
[27] T. Cai, H. He, and W. Zhang, "Breast Cancer Diagnosis Using Imbalanced Learning and Ensemble Method," Applied and Computational Mathematics, vol. 7, no. 3, pp. 146-154, Jun. 2018.
[28] H. Ouifak and A. Idri, "On the performance and interpretability of Mamdani and Takagi-Sugeno-Kang based neuro-fuzzy systems for medical diagnosis," Scientific African, vol. 20, Article ID: e01610, Jul .2023.
[29] F. Gurcan and A. Soylu, "Learning from imbalanced data: Integration of advanced resampling techniques and machine learning models for enhanced cancer diagnosis and prognosis," Cancers, vol. 16, no. 19, p. 3417, Oct.-1 2024.
[30] G. Husain, et al., "SMOTE vs. SMOTEENN: A study on the performance of resampling algorithms for addressing class imbalance in regression models," Algorithms, vol. 18, no. 1, Article ID: 37, 16 pp., Jan. 2025.
[31] Z. Liu, H. Liu, W. Jia, D. Zhang and J. Tan, "A novel imbalanced data classification method based on weakly supervised learning for fault diagnosis.," IEEE Trans. on Industrial Informatics, vol. 18, no. 3, pp. 1583-1593, Mar. 2022.
[32] H. Sinha and M. Shah, "Early prediction and classification of breast cancer survival based on machine learning models," in Proc. IEEE 15th Annual Computing and Communication Workshop and Conf., pp. 01185-01193, Las Vegas, NV, USA, 6-8 Jan. 2025.
[33] F. Gurcan and A. Soylu, "Synthetic boosted resampling using deep generative adversarial networks: A novel approach to improve cancer prediction from imbalanced datasets," Cancers, vol. 16, no. 23, Article ID: 4046, Dec.-1 2024.
[34] Z. Chen, J. Duan, L. Kang, and G. Qiu, "A hybrid data-level ensemble to enable learning from highly imbalanced dataset," Information Sciences, vol. 554, pp. 157-176, Apr. 2021.
[35] I. Czarnowski, "Weighted ensemble with one-class classification and over-sampling and instance selection (WECOI): An approach for learning from imbalanced data streams," Journal of Computational Science, vol. 61, Article ID: 101614, May 2022.
[36] C.-F. Tsai, K.-C. Chen and W.-C. Lin, "Feature selection and its combination with data over-sampling for multi-class imbalanced datasets," Applied Soft Computing, vol. 153, Article ID: 111267, May 2024.
[37] M. Robnik-Šikonja and I. Kononenko, "Theoretical and empirical analysis of ReliefF and RReliefF," Machine Learning, vol. 53, no. 1-2, pp. 23-69, 2003.
[38] Z. Abbasi and M. Rahmani, "An instance selection algorithm based on ReliefF," International Journal on Artificial Intelligence Tools, vol. 28, no. 1, Article ID: 1950001, 2019.
[39] Z. Wang, X. Ning, and M. Blaschko, "Jaccard metric losses: Optimizing the jaccard index with soft labels," in Proc. 37th Int. Conf. on Neural Information Processing Systems, pp. 75259-75285, New Orleans, LA, USA ,10-16 Dec. 2023.
[40] S. Bhowmick and A. Saha, "Enhancing the performance of kNN for glass identification dataset using inverse distance weight, ReliefF ranking and SMOTE," in Proc. 13th Int. Conf. on Material Processing and Characterization, pp. , Hyderabad, India, 22-24 Apr. 2022, https:// https://doi.org/10.1063/5.0161083
[41] A. Moran Calderon, Improved Distance Functions for the ReliefF Family, MSc. Thesis, Universitat Politècnica de Catalunya, Catalunya, Spain, 2023.
[42] F. Rosenblatt, Principles of Neurodynamics. Perceptrons and the Theory of Brain Mechanisms, Spartan Books, Washington DC, US, 1961.
[43] L. Breiman, "Random forests," Machine learning, vol. 45, no. 1, pp. 5-32, Oct. 2001.
[44] Y. Freund and R. E. Schapire, "A decision-theoretic generalization of on-line learning and an application to boosting," Journal of computer and system sciences, vol. 55, no. 1, pp. 119-139, Aug. 1997.
[45] L. Breiman, "Bagging predictors," Machine learning , vol. 24, no. 2, pp. 123-140, Aug. 1996.
[46] T. M. Cover and P. E. Hart, "Nearest neighbor pattern classification," IEEE Trans. on Information Theory , vol. 13, no. 1, pp. 21-27, Jan. 1967.
[47] D. Lavanya and K. U. Rani, "Performance evaluation of decision tree classifiers on medical datasets," International Journal of Computer Applications, vol. 26, no. 4, pp. 1-4, 2011.
[48] I. Rish, "An empirical study of the naive Bayes classifier," in Proc. IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, pp. 41-46, Seattle, Washington, USA, 4-6 Aug. 2001.
[49] J. G. Cleary and L. E. Trigg, "K*: An instance-based learner using an entropic distance measure," in Proc. of the 12th Int. Conf. on Machine Learning, pp. 108-114, Tahoe City, CA, USA, 9-12 Jul. 1995.
[50] J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, 1993.
[51] J. Kaur and A. Kumar, "Speech emotion recognition using CNN, k-NN, MLP and random forest," In: S. Smys, R. Palanisamy, Á.Rocha, and G. N. Beligiannis, (eds) Computer Networks and Inventive Communication Technologies. Lecture Notes on Data Engineering and Communications Technologies, vol 58. pp 499-509, Springer, Singapore, 2021.
[52] P. C. Sen, M. Hajra, and M. Ghosh, "Supervised classification algorithms in machine learning: A survey and review," In: J. Mandal and D. Bhattacharya, (eds) Emerging Technology in Modelling and Graphics. Advances in Intelligent Systems and Computing, vol 937, 99-111, Springer, Singapore.
[53] F. Osisanwo, et al., "Supervised machine learning algorithms: classification and comparison," International Journal of Computer Trends and Technology, vol. 48, no. 3, pp. 128-138, Jun. 2017.
[54] Q. Gu, L. Zhu and Z. Cai, "Evaluation measures of the classification performance of imbalanced data sets," in Proc. Int. Symp. on Intelligence Computation and Applications, pp. 461-471, Huangshi, China, 23-25 Oct. 2009.
[55] Y. Sun, A. Wong and M. S. Kamel, "Classification of imbalanced data: A review.," ‏International Journal of Pattern Recognition and Artificial Intelligence, vol. 23, no. 4, pp. 687-719, Jun. 2009.
[56] W. H. Wolberg and O. Mangasarian, "Multisurface method of pattern separation for medical diagnosis applied to breast cytology," Proceedings of the National Academy of Sciences, vol. 87, no. 23, pp. 9193-9196, Dec. 1990.
[57] W. Wolberg, W. Street, and O. Mangasarian, "Machine learning techniques to diagnose breast cancer from fine-needle aspirates," Cancer Letters , vol. 77, no. 2-3, pp. 163-171, 1994. [58] J. TENG, SEER Breast Cancer Data, IEEE Dataport, 2019.
[59] -, UCI Machine Learning Repository, [Online]. Available: https://archive.ics.uci.edu/ml/index.php.
[60] -, Kaggle, [Online]. Available: https://www.kaggle.com/datasets/sujithmandala/seer-breast-cancer-data.