Examining the Performance of the Bagging Method in Breast Cancer Classification
Abstract views: 20 / PDF downloads: 15
DOI:
https://doi.org/10.5281/zenodo.13335497Keywords:
Breast cancer, classification, bagging, machine learningAbstract
The aim of this study is to classify breast cancer using the Bagging classifier, which is among the Ensemble methods. To this end, the breast cancer dataset available on the Kaggle database was used. The dataset consists of 569 observations and 32 variables, with 212 (37.3 %) being benign and 357 (62.7 %) malignant. Initially, the gain ratio feature selection method was used to determine the important variables. Then, the performance of the method was examined according to the 2-fold, 5-fold, and 10-fold cross-validation methods with the number of variables used. The analyses were performed using the WEKA program. As a result of the analysis, both with all variables included and after removing insignificant variables, the performance metrics were determined as follows: accuracy was 95.0791, with precision, recall, and F-measure values of 0.951, and the ROC area value was 0.988. Moreover, it was observed that when all variables were used and when insignificant variables were removed, the method's performance was similar, except for the time variable, and it showed better performance compared to other variable numbers. Additionally, it can be said that the 2-fold cross-validation method showed slightly better classification performance in all metrics except for the ROC area measure. It is recommended that the Bagging method be used in the classification of different diseases.
References
Abdulkareem, S. A., Abdulkareem, Z. O., 2021. An evaluation of the Wisconsin breast cancer dataset using ensemble classifiers and RFE feature selection. International Journal of Sciences: Basic and Applied Research (IJSBAR), 55(2): 67-80.
Ahmed, N., Shefat, S.N., 2022. Performance Evaluation of Data Mining Classification Algorithms for Predicting Breast Cancer. Malaysian Journal of Science and Advanced Technology, 90-95.
Aksu, G., Doğan, N., 2018. Comparison of Learning Methods Used in Data Mining Under Different Conditions. Ankara University Journal of Faculty of Educational Sciences, 51(3): 71-100.
Assegie, T. A., Tulasi, R. L., Kumar, N. K., 2021. Breast cancer prediction model with decision tree and adaptive boosting. IAES International Journal of Artificial Intelligence, 10(1): 184-190.
Assegie, T.A., Tulasi, R.L., Elanangai, V., Kumar, N.K., 2022. Exploring the performance of feature selection method using breast cancer dataset. Indonesian Journal of Electrical Engineering and Computer Science, 25(1): 232-237.
Bansal, A., Singhrova, A., 2021. Performance analysis of supervised machine learning algorithms for diabetes and breast cancer dataset. In 2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS), 25-27 March, Coimbatore, India (pp. 137-143). IEEE.
Battineni, G., Sagaro, G.G., Nalini, C., Amenta, F., Tayebati, S.K., 2019. Comparative machine-learning approach: A follow-up study on type 2 diabetes predictions by cross-validation methods. Machines, 7(4): 74.
Bezek Güre, Ö., 2023. Investigation of ensemble methods in terms of statistics: TIMMS 2019 example. Neural Computing and Applications, 35(32): 23507-23520.
Bishop CM 2006. Information science and statistics. Pattern recognition and machine learning. Springer.
Bowers, A.J., Zhou, X., 2019. Receiver operating characteristic (ROC) area under the curve (AUC): A diagnostic measure for evaluating the accuracy of predictors of education outcomes. Journal of Education for Students Placed at Risk (JESPAR), 24(1): 20-46.
Ceyhan, G., 2020. Comparison of performance of data mining methods used for classification in terms of data characteristics. PhD Thesis, Institute of Educational Sciences, Department of Educational Sciences of Gazi University, Ankara.
Dataset Description. Available at Kaggle. .https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset?resou rce=download. (Accessed:14.02.2024).
Divyavani, M., Kalpana, G., 2021. An analysis on SVM & ANN using breast cancer dataset. Aegaeum Journal, 8(12):369-379.
Dolgun, M.Ö., 2014. Comparison of the performances of data mining classification methods based on prevalence of the dependent variable, sample size and the correlation of the independent variables. PhD Thesis, Department of Biostatistics, Institute of Health Sciences of Hacettepe University, Ankara.
Ghosh, M., Prabu, P., 2019. Empirical analysis of ensemble methods for the classification of robocalls in telecommunications. International Journal of Electrical and Computer Engineering, 9(4):3108-3114.
Güre, Ö.B., 2023. Investigating the Performance of Feature Selection Methods in Classifying Student Success. International Journal of Education Technology and Scientific Researches, 8(24): 2695-2728.
Güre, Ö.B., 2024. Classification of Liver Disorders Diagnosis using Naïve Bayes Method. Bitlis Eren Üniversitesi Fen Bilimleri Dergisi, 13(1):153-160.
Hossin, M.M., Shamrat, F.J.M., Bhuiyan, M. R., Hira, R. A., Khan, T., Molla, S., 2023. Breast cancer detection: an effective comparison of different machine learning algorithms on the Wisconsin dataset. Bulletin of Electrical Engineering and Informatics, 12(4): 2446-2456.
Huang S., Fang N., 2013. Predicting student academic performance in an engineering dynamics course: a comparison of four types of predictive mathematical models. Computer & Education, 61:133–145.
Kadiyala, A., Kumar, A., 2018. Applications of python to evaluate the performance of bagging methods. Environmental Progress & Sustainable Energy, 37(5):1555-1559.
Kanik, E.A., Temel, G.O., Erdoğan, S., Kaya, İ.E. 2013. Affected states soft independent modeling by class analogy from the relation between independent variables, number of independent variables and sample size. Balkan Medical Journal, 2013(1):28-32.
Kasap, Y., Doğan, N., Koçak, C., 2021. PISA 2018’de Okuduğunu Anlama Başarısını Yordayan Değişkenlerin Veri Madenciliği ile Belirlenmesi. Manisa Celal Bayar Üniversitesi Sosyal Bilimler Dergisi, 19(4), 241-258.
Koirunnisa, A. K., Faisal, S., 2023. Optimized Machine Learning Performance with Feature Selection for Breast Cancer Disease Classification. Jurnal Ilmiah Teknik Elektro Komputer dan Informatika (JITEKI), 9(4):1131-1143.
Kumar, S., Chong, I., 2018. Correlation analysis to identify the effective data in machine learning: Prediction of depressive disorder and emotion states. International Journal of Environmental Research and Public Health. 15(2907):1-24.
Kwon, O., Sim, J. M., 2013. Effects of data set features on the performances of classification algorithms. Expert Systems with Applications. 40(5):1847-1857.
Lavanya, D., Rani, D. K. U., 2011. Analysis of feature selection with classification: Breast cancer datasets. Indian Journal of Computer Science and Engineering (IJCSE), 2(5):756-763.
Nasien, D., Enjeslina, V., Adiya, M. H., Baharum, Z., 2022, August. Breast Cancer Prediction Using Artificial Neural Networks Back Propagation Method. In Journal of Physics: Conference Series, 2319 (2022) pp.1-6.
Nathiya, S., Sumitha, J., 2023, September. Evaluation of data mining algorithms for breast cancer prediction. In AIP Conference Proceedings, 11–12 March, Rajapalayam, India.
Ngo, G., Beard, R., Chandra, R., 2022. Evolutionary bagging for ensemble learning. Neurocomputing, 510(2022):1-14.
Nuray, S. E., Gençdal, H. B., Arama, Z. A., 2021. Zeminlerin kıvam ve kompaksiyon özelliklerinin tahmininde rastgele orman regresyonu yönteminin uygulanabilirliği. Mühendislik Bilimleri ve Tasarım Dergisi, 9(1):265-281.
Oza, N.C., Russell, S.J., 2001. Online bagging and boosting. In: International workshop on artificial intelligence and statistics. PMLR, pp 229–236
Patil, P., Du, J.H., Kuchibhotla, A.K., 2023. Bagging in overparameterized learning: Risk characterization and risk monotonization. Journal of Machine Learning Research, 24(319): 1-113.
Sari, N.F.A.T., Nabela, M., Abdurrohman, M.F., 2023. Utilizing the K-Means algorithm for breast cancer diagnosis: A promising approach for improved early detection. MATICS: Jurnal Ilmu Komputer dan Teknologi Informasi (Journal of Computer Science and Information Technology), 15(2):72-78.
Şevgin, H., Önen, E., 2022. Comparison of Classification Performances of MARS and BRT Data Mining Methods: ABİDE- 2016 Case. Education & Science/Egitim ve Bilim, 47(211):195-222.
Sujana, T.S., Rao N.M.S., Reddy, R.S. 2017. An efficient feature selection using parallel cuckoo search and naïve Bayes classifier, IEEE 2017 International Conference on Networks & Advances in Computational Technologies (NetACT), 20-22 July,Thiruvananthapuram, India, pp.167-172.
Sutton D.C., 2005. Classification and regr,ession trees. bagging and boosting. Handbook of Statistics 24:303–329
Temel, G.O., Erdoğan, S., Ankaralı, H., 2012. Sınıflama modelinin performansını değerlendirmede yeniden örnekleme yöntemlerinin kullanımı. Bilişim Teknolojileri Dergisi, 5(3): 1-8.
Tsehay Admassu Assegie, S.S., 2020. A support vector machine and decision tree based breast cancer prediction. International Journal of Engineering and Advanced Technology (IJEAT), 9(3): 2972-2976.
Watanabe, S., 2010. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11: 3571–3594.
World Health Organization (WHO), 2024, Breast cancer sheets, https://www.w ho.int/news-room/fact-sh eets/detail/br east-cancer), (Accessed:12.04.2024).
Yabacı, A., 2017. Comparison of tree-based methods used in survival data. PhD Thesis, Institute of Health Sciences, Department of Biostatistics and Medical Informatics of Uludağ University, Bursa.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 MAS Journal of Applied Sciences
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.