A Comparative Study of Convolutional Neural Networks and Vision Transformers for Fruit Classification

Main Article Content

  Malik Jawarneh
  Arief Marwanto
  Dedy Syamsuar
  Maivi Kusnandar

Abstract

Background of study:  Accurate fruit classification is vital for agricultural automation, yet traditional methods are often subjective and inefficient. Convolutional Neural Networks (CNNs) are effective but struggle with global context in fine-grained tasks. Vision Transformers (ViTs), inspired by NLP models, offer global attention mechanisms that may improve classification in complex scenarios.
Aims and scope of paper: This study compares the performance of EfficientNet-B0 (a CNN model) and ViT-B/16 (a Transformer model) on a fruit classification task involving five fruit types. The goal is to evaluate their strengths and weaknesses under controlled experimental conditions using a moderately sized dataset.
Methods: A dataset of 10,000 fruit images was preprocessed with standard augmentation techniques and split into training and validation sets. Both models were fine-tuned using pretrained weights. Performance was evaluated using accuracy, precision, recall, F1-score, and confusion matrices.
Result: EfficientNet-B0 achieved higher overall accuracy (94%) than ViT-B/16 (92%). The CNN model performed consistently across all classes, particularly excelling in bananas and strawberries. ViT-B/16 showed superior results for strawberries but struggled with apples. Confusion matrices revealed class-specific strengths and weaknesses.
Conclusion: EfficientNet-B0 is better suited for general fruit classification due to its balanced performance, while ViT-B/16 excels in capturing fine-grained visual features. A hybrid approach may leverage both models’ strengths for enhanced performance in real-world applications.

Article Details

How to Cite
Jawarneh, M., Marwanto, A., Syamsuar, D., & Kusnandar, M. (2025). A Comparative Study of Convolutional Neural Networks and Vision Transformers for Fruit Classification. International Journal of Advances in Artificial Intelligence and Machine Learning, 2(2), 104–112. https://doi.org/10.58723/ijaaiml.v2i2.435
Section
Articles

References

Altalak, M., Uddin, M. A., Alajmi, A., & Rizg, A. (2022). Smart Agriculture Applications Using Deep Learning Technologies: A Survey. Applied Sciences (Switzerland), 12(12). https://doi.org/10.3390/app12125919

Alzubaidi, L., Zhang, J., Humaidi, A. J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., Santamaría, J., Fadhel, M. A., Al-Amidie, M., & Farhan, L. (2021). Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. In Journal of Big Data (Vol. 8, Issue 1). Springer International Publishing. https://doi.org/10.1186/s40537-021-00444-8

Darwin, B., Dharmaraj, P., Prince, S., Popescu, D. E., & Hemanth, D. J. (2021). Recognition of bloom/yield in crop images using deep learning models for smart agriculture: A review. Agronomy, 11(4), 1–22. https://doi.org/10.3390/agronomy11040646

Dewi, D. A., Kurniawan, T. B., Thinakaran, R., Batumalay, M., Habib, S., & Islam, M. (2024). Efficient Fruit Grading and Selection System Leveraging Computer Vision and Machine Learning. Journal of Applied Data Sciences, 5(4), 1989–2001. https://doi.org/10.47738/jads.v5i4.443

Elharrouss, O., Akbari, Y., Almadeed, N., & Al-Maadeed, S. (2024). Backbones-review: Feature extractor networks for deep learning and deep reinforcement learning approaches in computer vision. Computer Science Review, 53, 1–23. https://doi.org/10.1016/j.cosrev.2024.100645

Espinoza, S., Aguilera, C., Rojas, L., & Campos, P. G. (2024). Analysis of Fruit Images With Deep Learning: A Systematic Literature Review and Future Directions. IEEE Access, 12, 3837–3859. https://doi.org/10.1109/ACCESS.2023.3345789

Ghazal, S., Qureshi, W. S., Khan, U. S., Iqbal, J., Rashid, N., & Tiwana, M. I. (2021). Analysis of visual features and classifiers for Fruit classification problem. Computers and Electronics in Agriculture, 187, 1–9. https://doi.org/10.1016/j.compag.2021.106267

Hinojosa Lee, M. C., Braet, J., & Springael, J. (2024). Performance Metrics for Multilabel Emotion Classification: Comparing Micro, Macro, and Weighted F1-Scores. Applied Sciences (Switzerland), 14(21), 1–21. https://doi.org/10.3390/app14219863

Ismail, W. N., Alsalamah, H. A., Hassan, M. M., & Mohamed, E. (2023). AUTO-HAR: An adaptive human activity recognition framework using an automated CNN architecture design. Heliyon, 9(2), e13636. https://doi.org/10.1016/j.heliyon.2023.e13636

Kanadath, A., Angel Arul Jothi, J., & Urolagin, S. (2024). CViTS-Net: A CNN-ViT Network With Skip Connections for Histopathology Image Classification. IEEE Access, 12(August), 117627–117649. https://doi.org/10.1109/ACCESS.2024.3448302

Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., & Shah, M. (2022). Transformers in Vision: A Survey. ACM Computing Surveys, 54(10), 1–30. https://doi.org/10.1145/3505244

Kiranyaz, S., Avci, O., Abdeljaber, O., Ince, T., Gabbouj, M., & Inman, D. J. (2021). 1D convolutional neural networks and applications: A survey. Mechanical Systems and Signal Processing, 151, 107398. https://doi.org/10.1016/j.ymssp.2020.107398

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 10012–10022. https://doi.org/10.1109/ICCV48922.2021.00986

Maurício, J., Domingues, I., & Bernardino, J. (2023). Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Applied Sciences (Switzerland), 13(9). https://doi.org/10.3390/app13095521

Miller, C., Portlock, T., Nyaga, D. M., & O’Sullivan, J. M. (2024). A review of model evaluation metrics for machine learning in genetics and genomics. Frontiers in Bioinformatics, 4(September), 1–13. https://doi.org/10.3389/fbinf.2024.1457619

Momeny, M., Asghar, A., Arafat, M., & Kia, S. (2021). Learning-to-augment strategy using noisy and denoised data: Improving generalizability of deep CNN for the detection of COVID-19 in X-ray images. Computers in Biology and Medicine, 136, 1–13. https://doi.org/10.1016/j.compbiomed.2021.104704

Oliullah, K., Islam, M. R., Babar, J. I., Quraishi, M. A. N., Rahman, M. M., Mahbub-Or-Rashid, M., & Bhuiyan, T. M. A. U. H. (2025). FruVeg_MultiNet: A hybrid deep learning-enabled IoT system for fresh fruit and vegetable identification with web interface and customized blind glasses for visually impaired individuals. Journal of Agriculture and Food Research, 19(April 2024), 101623. https://doi.org/10.1016/j.jafr.2024.101623

Pandey, V. K., Srivastava, S., Dash, K. K., Singh, R., Mukarram, S. A., Kovács, B., & Harsányi, E. (2023). Machine Learning Algorithms and Fundamentals as Emerging Safety Tools in Preservation of Fruits and Vegetables: A Review. Processes, 11(6), 1–17. https://doi.org/10.3390/pr11061720

Peng, Z., Huang, W., Gu, S., Xie, L., Wang, Y., Jiao, J., & Ye, Q. (2021). Conformer: Local Features Coupling Global Representations for Visual Recognition. Proceedings of the IEEE International Conference on Computer Vision, 357–366. https://doi.org/10.1109/ICCV48922.2021.00042

Shabir, A., Ahmed, K. T., Mahmood, A., Garay, H., González, L. E. P., & Ashraf, I. (2025). Deep image features sensing with multilevel fusion for complex convolution neural networks & cross domain benchmarks. PLoS ONE, 20(3 March), 1–42. https://doi.org/10.1371/journal.pone.0317863

Shi, X., Wang, S., Zhang, B., Ding, X., Qi, P., Qu, H., Li, N., Wu, J., & Yang, H. (2025). Advances in Object Detection and Localization Techniques for Fruit Harvesting Robots. Agronomy, 15(1), 1–19. https://doi.org/10.3390/agronomy15010145

Trigka, M., & Dritsas, E. (2025). A Comprehensive Survey of Deep Learning Approaches in Image Processing. Sensors, 25(2). https://doi.org/10.3390/s25020531

Wang, W., Zhang, J., Cao, Y., Shen, Y., & Tao, D. (2022). Towards Data-Efficient Detection Transformers. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 13669 LNCS, 88–105. https://doi.org/10.1007/978-3-031-20077-9_6

Zhang, W., Belcheva, V., & Ermakova, T. (2025). Interpretable Deep Learning for Diabetic Retinopathy: A Comparative Study of CNN, ViT, and Hybrid Architectures. Computers, 14(5), 1–24. https://doi.org/10.3390/computers14050187