A Comparative Evaluation of Predictive Models for Lung Cancer: Insights from Logistic Regression, Naive Bayes, and Random Forest
Main Article Content
Abstract
This study aims to evaluate the performance of three machine learning models-Logistic Regression, Naive Bayes, and Random Forest-in predicting lung cancer using a publicly available dataset from Kaggle. The data used included demographic information, risk factors, and diagnostic imaging features, with significant class imbalance between benign and malignant cases. To address this imbalance, the Synthetic Minority Sampling Technique (SMOTE) was applied. In addition, Principal Component Analysis (PCA) and Recursive Feature Elimination (RFE) were used for dimensionality reduction and feature selection to improve model performance. The results showed that Random Forest, especially when combined with PCA, outperformed the other models with the highest accuracy of 96.77% and a balanced F1 score of 0.50 for the minority class. Although Logistic Regression achieved high accuracy, it was less effective in predicting minority classes, especially when combined with RFE. Meanwhile, Naive Bayes showed moderate performance but was limited by the assumption of feature independence. The application of SMOTE significantly improved the model's ability to handle class imbalance, while PCA proved more effective than RFE in improving model performance. This study highlights the importance of selecting appropriate machine learning models and preprocessing techniques for lung cancer prediction. Random Forest, with its ability to model complex relationships and handle imbalanced data, emerged as the most effective model for this task. These findings underscore the potential of machine learning in medical diagnostics and provide valuable insights for future research.
Article Details
Copyright (c) 2025 Muhammad Hafiz Kurniawan, Misinem Misinem

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
References
Ahmad, A., Chaudhari, O., & Chandra, R. (2024). A review of ensemble learning and data augmentation models for class imbalanced problems : Combination , implementation and evaluation. Expert Systems With Applications, 244(May 2023), 122778. https://doi.org/10.1016/j.eswa.2023.122778
Boateng, E. Y., & Abaye, D. A. (2019). A Review of the Logistic Regression Model with Emphasis on Medical Research. Journal of Data Analysis and Information Processing, 7(4), 190–207. https://doi.org/10.4236/jdaip.2019.74012
Chen, S., Webb, G. I., Liu, L., & Ma, X. (2020). A novel selective naïve Bayes algorithm. Knowledge-Based Systems, 192(xxxx), 105361. https://doi.org/10.1016/j.knosys.2019.105361
Hemmer, P., Schemmer, M., Kuhl, N., Vossing, M., & Satzger, G. (2023). COMPLEMENTARITY IN HUMAN-AI COLLABORATION: CONCEPT, SOURCES, AND EVIDENCE. Nature Medicine, 29(7), 1814–1820. https://doi.org/10.1038/s41591-023-02437-x
Hermiati, A. S., Herteno, R., Indriani, F., & Saragih, T. H. (2024). A Comparative Study : Application of Principal Component Analysis and Recursive Feature Elimination in Machine Learning for Stroke Prediction. Journal of Electronics, Electromedical Engineering, and Medical Informatics, 6(2), 231–242. https://doi.org/10.35882/jeeemi.v6i3.446
Li, B., Wu, Y., Zhang, Y., Hu, C., Li, X., Luo, S., Sun, C., & Yousef, I. (2025). Global and China trends and forecasts of disease burden for female lung Cancer from 1990 to 2021 : a study based on the global burden of disease 2021 database. Journal of Cancer Research and Clinical Oncology, 2, 1–15. https://doi.org/10.1007/s00432-025-06084-2
Linardatos, P., & Papastefanopoulos, V. (2021). Explainable AI : A Review of Machine Learning Interpretability Methods. Entropy, 23(1), 2–45. https://doi.org/10.3390/e23010018
Ma, W., Zhang, X., Shen, Y., Xie, J., Zuo, G., Zhang, X., & Jin, T. (2024). Incorporating Recursive Feature Elimination and Decomposed. WATER, 16(21), 2–27. https://doi.org/10.3390/w16213102
Mani, K., & Rajaguru, H. (2024). Heliyon A framework for performance enhancement of classifiers in detection of prostate cancer from microarray gene. Heliyon, 10(9), e29630. https://doi.org/10.1016/j.heliyon.2024.e29630
Nakhipova, V., Kerimbekov, Y., Umarova, Z., Suleimenova, L., & Botayeva, S. (2024). Use of the Naive Bayes Classifier Algorithm in Machine Learning for Student Performance Prediction. International Journal of Information and Education Technology, 14(1), 92–98. https://doi.org/10.18178/ijiet.2024.14.1.2028
Parisi, F., Luca, G. De, Mosconi, M., Lastraioli, S., Dellepiane, C., Rossi, G., Puglisi, S., Bennicelli, E., Barletta, G., Zullo, L., Santamaria, S., Mora, M., Ballestrero, A., Montecucco, F., Bellodi, A., Del, L., Lambertini, M., Barisione, E., Cittadini, G., … Genova, C. (2024). Cancer Treatment and Research Communications Front-line liquid biopsy for early molecular assessment and treatment of hospitalized lung cancer patients. Cancer Treatment and Research Communications, 41(August), 100839. https://doi.org/10.1016/j.ctarc.2024.100839
Richens, J. G., Lee, C. M., & Johri, S. (2020). with causal machine learning. Nature Communications, 11(1), 1–9. https://doi.org/10.1038/s41467-020-17419-7
Simon, S. M., Glaum, P., & Valdovinos, F. S. (2023). Interpreting random forest analysis of ecological models to move from prediction to explanation. Scientific Reports, 0123456789, 1–12. https://doi.org/10.1038/s41598-023-30313-8
Thabtah, F., Hammoud, S., & Kamalov, F. (2019). Data Imbalance in Classification : Experimental Evaluation. Information Sciences. https://doi.org/10.1016/j.ins.2019.11.004
Uddin, P., Mamun, A., & Hossain, A. (2020). PCA-based Feature Reduction for Hyperspectral Remote Sensing Image Classification PCA-based Feature Reduction for Hyperspectral Remote Sensing Image. IETE Technical Review, 0(0), 1–21. https://doi.org/10.1080/02564602.2020.1740615
Muhammad Hafiz Kurniawan