A Comparative Evaluation of Predictive Models for Lung Cancer: Insights from Logistic Regression, Naive Bayes, and Random Forest

Muhammad Hafiz Kurniawan; Misinem Misinem

doi:10.58723/ijaaiml.v2i1.378

PDF

Issue

Vol. 2 No. 1 (2025): International Journal of Advances in Artificial Intelligence and Machine Learning (March)

Published: Mar 15, 2025

Keywords:

Logistic Regression,
Lung Cancer,
Machine Learning,
Naive Bayes,
Random Forest

Muhammad Hafiz Kurniawan

Universitas Sriwijaya, Palembang,

https://orcid.org/0009-0007-5069-0811

Misinem Misinem

Universitas Bina Darma, Palembang,

https://orcid.org/0000-0002-7946-4582

Abstract

This study aims to evaluate the performance of three machine learning models-Logistic Regression, Naive Bayes, and Random Forest-in predicting lung cancer using a publicly available dataset from Kaggle. The data used included demographic information, risk factors, and diagnostic imaging features, with significant class imbalance between benign and malignant cases. To address this imbalance, the Synthetic Minority Sampling Technique (SMOTE) was applied. In addition, Principal Component Analysis (PCA) and Recursive Feature Elimination (RFE) were used for dimensionality reduction and feature selection to improve model performance. The results showed that Random Forest, especially when combined with PCA, outperformed the other models with the highest accuracy of 96.77% and a balanced F1 score of 0.50 for the minority class. Although Logistic Regression achieved high accuracy, it was less effective in predicting minority classes, especially when combined with RFE. Meanwhile, Naive Bayes showed moderate performance but was limited by the assumption of feature independence. The application of SMOTE significantly improved the model's ability to handle class imbalance, while PCA proved more effective than RFE in improving model performance. This study highlights the importance of selecting appropriate machine learning models and preprocessing techniques for lung cancer prediction. Random Forest, with its ability to model complex relationships and handle imbalanced data, emerged as the most effective model for this task. These findings underscore the potential of machine learning in medical diagnostics and provide valuable insights for future research.

How to Cite

Hafiz Kurniawan, M., & Misinem, M. (2025). A Comparative Evaluation of Predictive Models for Lung Cancer: Insights from Logistic Regression, Naive Bayes, and Random Forest. International Journal of Advances in Artificial Intelligence and Machine Learning, 2(1), 10–17. https://doi.org/10.58723/ijaaiml.v2i1.378

Section

Articles

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

References

Ahmad, A., Chaudhari, O., & Chandra, R. (2024). A review of ensemble learning and data augmentation models for class imbalanced problems : Combination , implementation and evaluation. Expert Systems With Applications, 244(May 2023), 122778. https://doi.org/10.1016/j.eswa.2023.122778

Boateng, E. Y., & Abaye, D. A. (2019). A Review of the Logistic Regression Model with Emphasis on Medical Research. Journal of Data Analysis and Information Processing, 7(4), 190–207. https://doi.org/10.4236/jdaip.2019.74012

Chen, S., Webb, G. I., Liu, L., & Ma, X. (2020). A novel selective naïve Bayes algorithm. Knowledge-Based Systems, 192(xxxx), 105361. https://doi.org/10.1016/j.knosys.2019.105361

Hemmer, P., Schemmer, M., Kuhl, N., Vossing, M., & Satzger, G. (2023). COMPLEMENTARITY IN HUMAN-AI COLLABORATION: CONCEPT, SOURCES, AND EVIDENCE. Nature Medicine, 29(7), 1814–1820. https://doi.org/10.1038/s41591-023-02437-x

Hermiati, A. S., Herteno, R., Indriani, F., & Saragih, T. H. (2024). A Comparative Study : Application of Principal Component Analysis and Recursive Feature Elimination in Machine Learning for Stroke Prediction. Journal of Electronics, Electromedical Engineering, and Medical Informatics, 6(2), 231–242. https://doi.org/10.35882/jeeemi.v6i3.446

Li, B., Wu, Y., Zhang, Y., Hu, C., Li, X., Luo, S., Sun, C., & Yousef, I. (2025). Global and China trends and forecasts of disease burden for female lung Cancer from 1990 to 2021 : a study based on the global burden of disease 2021 database. Journal of Cancer Research and Clinical Oncology, 2, 1–15. https://doi.org/10.1007/s00432-025-06084-2

Linardatos, P., & Papastefanopoulos, V. (2021). Explainable AI : A Review of Machine Learning Interpretability Methods. Entropy, 23(1), 2–45. https://doi.org/10.3390/e23010018

Ma, W., Zhang, X., Shen, Y., Xie, J., Zuo, G., Zhang, X., & Jin, T. (2024). Incorporating Recursive Feature Elimination and Decomposed. WATER, 16(21), 2–27. https://doi.org/10.3390/w16213102

Mani, K., & Rajaguru, H. (2024). Heliyon A framework for performance enhancement of classifiers in detection of prostate cancer from microarray gene. Heliyon, 10(9), e29630. https://doi.org/10.1016/j.heliyon.2024.e29630

Nakhipova, V., Kerimbekov, Y., Umarova, Z., Suleimenova, L., & Botayeva, S. (2024). Use of the Naive Bayes Classifier Algorithm in Machine Learning for Student Performance Prediction. International Journal of Information and Education Technology, 14(1), 92–98. https://doi.org/10.18178/ijiet.2024.14.1.2028

Parisi, F., Luca, G. De, Mosconi, M., Lastraioli, S., Dellepiane, C., Rossi, G., Puglisi, S., Bennicelli, E., Barletta, G., Zullo, L., Santamaria, S., Mora, M., Ballestrero, A., Montecucco, F., Bellodi, A., Del, L., Lambertini, M., Barisione, E., Cittadini, G., … Genova, C. (2024). Cancer Treatment and Research Communications Front-line liquid biopsy for early molecular assessment and treatment of hospitalized lung cancer patients. Cancer Treatment and Research Communications, 41(August), 100839. https://doi.org/10.1016/j.ctarc.2024.100839

Richens, J. G., Lee, C. M., & Johri, S. (2020). with causal machine learning. Nature Communications, 11(1), 1–9. https://doi.org/10.1038/s41467-020-17419-7

Simon, S. M., Glaum, P., & Valdovinos, F. S. (2023). Interpreting random forest analysis of ecological models to move from prediction to explanation. Scientific Reports, 0123456789, 1–12. https://doi.org/10.1038/s41598-023-30313-8

Thabtah, F., Hammoud, S., & Kamalov, F. (2019). Data Imbalance in Classification : Experimental Evaluation. Information Sciences. https://doi.org/10.1016/j.ins.2019.11.004

Uddin, P., Mamun, A., & Hossain, A. (2020). PCA-based Feature Reduction for Hyperspectral Remote Sensing Image Classification PCA-based Feature Reduction for Hyperspectral Remote Sensing Image. IETE Technical Review, 0(0), 1–21. https://doi.org/10.1080/02564602.2020.1740615

Total 16 Author's Countries
		(14)
		(9)
		(4)
		(3)
		(3)
		(2)
		(2)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
Total 6 Reviewer's Countries
		(31)
		(6)
		(2)
		(1)
		(1)
		(1)
Total 10 Editor's Countries
		(8)
		(2)
		(2)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)

Article Sidebar

Main Article Content

Abstract

Article Details

References