A Comparative Study of Convolutional Neural Networks and Vision Transformers for Fruit Classification

Malik Jawarneh; Arief Marwanto; Dedy Syamsuar; Maivi Kusnandar

doi:10.58723/ijaaiml.v2i2.435

PDF

Issue

Vol. 2 No. 2 (2025): International Journal of Advances in Artificial Intelligence and Machine Learning

Published: Jul 23, 2025

Keywords:

Agricultural Automation,
Convolutional Neural Networks,
Fruit Classification,
Image Classification,
Vision Transformer

Malik Jawarneh

Oman College of Management and Technology, Muscat,

https://orcid.org/0000-0001-6894-2756

Arief Marwanto

Universitas Islam Sultan Agung, Semarang,

https://orcid.org/0000-0001-6873-5108

Dedy Syamsuar

Universitas Bina Nusantara, Jakarta,

https://orcid.org/0000-0002-2374-9546

Maivi Kusnandar

Politeknik Negeri Sriwijaya, Palembang,

https://orcid.org/0009-0007-8119-7253

Abstract

Background of study: Accurate fruit classification is vital for agricultural automation, yet traditional methods are often subjective and inefficient. Convolutional Neural Networks (CNNs) are effective but struggle with global context in fine-grained tasks. Vision Transformers (ViTs), inspired by NLP models, offer global attention mechanisms that may improve classification in complex scenarios.
Aims and scope of paper: This study compares the performance of EfficientNet-B0 (a CNN model) and ViT-B/16 (a Transformer model) on a fruit classification task involving five fruit types. The goal is to evaluate their strengths and weaknesses under controlled experimental conditions using a moderately sized dataset.
Methods: A dataset of 10,000 fruit images was preprocessed with standard augmentation techniques and split into training and validation sets. Both models were fine-tuned using pretrained weights. Performance was evaluated using accuracy, precision, recall, F1-score, and confusion matrices.
Result: EfficientNet-B0 achieved higher overall accuracy (94%) than ViT-B/16 (92%). The CNN model performed consistently across all classes, particularly excelling in bananas and strawberries. ViT-B/16 showed superior results for strawberries but struggled with apples. Confusion matrices revealed class-specific strengths and weaknesses.
Conclusion: EfficientNet-B0 is better suited for general fruit classification due to its balanced performance, while ViT-B/16 excels in capturing fine-grained visual features. A hybrid approach may leverage both models’ strengths for enhanced performance in real-world applications.

How to Cite

Jawarneh, M., Marwanto, A., Syamsuar, D., & Kusnandar, M. (2025). A Comparative Study of Convolutional Neural Networks and Vision Transformers for Fruit Classification. International Journal of Advances in Artificial Intelligence and Machine Learning, 2(2), 104–112. https://doi.org/10.58723/ijaaiml.v2i2.435

Section

Articles

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

References

Altalak, M., Uddin, M. A., Alajmi, A., & Rizg, A. (2022). Smart Agriculture Applications Using Deep Learning Technologies: A Survey. Applied Sciences (Switzerland), 12(12). https://doi.org/10.3390/app12125919

Alzubaidi, L., Zhang, J., Humaidi, A. J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., Santamaría, J., Fadhel, M. A., Al-Amidie, M., & Farhan, L. (2021). Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. In Journal of Big Data (Vol. 8, Issue 1). Springer International Publishing. https://doi.org/10.1186/s40537-021-00444-8

Darwin, B., Dharmaraj, P., Prince, S., Popescu, D. E., & Hemanth, D. J. (2021). Recognition of bloom/yield in crop images using deep learning models for smart agriculture: A review. Agronomy, 11(4), 1–22. https://doi.org/10.3390/agronomy11040646

Dewi, D. A., Kurniawan, T. B., Thinakaran, R., Batumalay, M., Habib, S., & Islam, M. (2024). Efficient Fruit Grading and Selection System Leveraging Computer Vision and Machine Learning. Journal of Applied Data Sciences, 5(4), 1989–2001. https://doi.org/10.47738/jads.v5i4.443

Elharrouss, O., Akbari, Y., Almadeed, N., & Al-Maadeed, S. (2024). Backbones-review: Feature extractor networks for deep learning and deep reinforcement learning approaches in computer vision. Computer Science Review, 53, 1–23. https://doi.org/10.1016/j.cosrev.2024.100645

Espinoza, S., Aguilera, C., Rojas, L., & Campos, P. G. (2024). Analysis of Fruit Images With Deep Learning: A Systematic Literature Review and Future Directions. IEEE Access, 12, 3837–3859. https://doi.org/10.1109/ACCESS.2023.3345789

Ghazal, S., Qureshi, W. S., Khan, U. S., Iqbal, J., Rashid, N., & Tiwana, M. I. (2021). Analysis of visual features and classifiers for Fruit classification problem. Computers and Electronics in Agriculture, 187, 1–9. https://doi.org/10.1016/j.compag.2021.106267

Hinojosa Lee, M. C., Braet, J., & Springael, J. (2024). Performance Metrics for Multilabel Emotion Classification: Comparing Micro, Macro, and Weighted F1-Scores. Applied Sciences (Switzerland), 14(21), 1–21. https://doi.org/10.3390/app14219863

Ismail, W. N., Alsalamah, H. A., Hassan, M. M., & Mohamed, E. (2023). AUTO-HAR: An adaptive human activity recognition framework using an automated CNN architecture design. Heliyon, 9(2), e13636. https://doi.org/10.1016/j.heliyon.2023.e13636

Kanadath, A., Angel Arul Jothi, J., & Urolagin, S. (2024). CViTS-Net: A CNN-ViT Network With Skip Connections for Histopathology Image Classification. IEEE Access, 12(August), 117627–117649. https://doi.org/10.1109/ACCESS.2024.3448302

Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., & Shah, M. (2022). Transformers in Vision: A Survey. ACM Computing Surveys, 54(10), 1–30. https://doi.org/10.1145/3505244

Kiranyaz, S., Avci, O., Abdeljaber, O., Ince, T., Gabbouj, M., & Inman, D. J. (2021). 1D convolutional neural networks and applications: A survey. Mechanical Systems and Signal Processing, 151, 107398. https://doi.org/10.1016/j.ymssp.2020.107398

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 10012–10022. https://doi.org/10.1109/ICCV48922.2021.00986

Maurício, J., Domingues, I., & Bernardino, J. (2023). Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Applied Sciences (Switzerland), 13(9). https://doi.org/10.3390/app13095521

Miller, C., Portlock, T., Nyaga, D. M., & O’Sullivan, J. M. (2024). A review of model evaluation metrics for machine learning in genetics and genomics. Frontiers in Bioinformatics, 4(September), 1–13. https://doi.org/10.3389/fbinf.2024.1457619

Momeny, M., Asghar, A., Arafat, M., & Kia, S. (2021). Learning-to-augment strategy using noisy and denoised data: Improving generalizability of deep CNN for the detection of COVID-19 in X-ray images. Computers in Biology and Medicine, 136, 1–13. https://doi.org/10.1016/j.compbiomed.2021.104704

Oliullah, K., Islam, M. R., Babar, J. I., Quraishi, M. A. N., Rahman, M. M., Mahbub-Or-Rashid, M., & Bhuiyan, T. M. A. U. H. (2025). FruVeg_MultiNet: A hybrid deep learning-enabled IoT system for fresh fruit and vegetable identification with web interface and customized blind glasses for visually impaired individuals. Journal of Agriculture and Food Research, 19(April 2024), 101623. https://doi.org/10.1016/j.jafr.2024.101623

Pandey, V. K., Srivastava, S., Dash, K. K., Singh, R., Mukarram, S. A., Kovács, B., & Harsányi, E. (2023). Machine Learning Algorithms and Fundamentals as Emerging Safety Tools in Preservation of Fruits and Vegetables: A Review. Processes, 11(6), 1–17. https://doi.org/10.3390/pr11061720

Peng, Z., Huang, W., Gu, S., Xie, L., Wang, Y., Jiao, J., & Ye, Q. (2021). Conformer: Local Features Coupling Global Representations for Visual Recognition. Proceedings of the IEEE International Conference on Computer Vision, 357–366. https://doi.org/10.1109/ICCV48922.2021.00042

Shabir, A., Ahmed, K. T., Mahmood, A., Garay, H., González, L. E. P., & Ashraf, I. (2025). Deep image features sensing with multilevel fusion for complex convolution neural networks & cross domain benchmarks. PLoS ONE, 20(3 March), 1–42. https://doi.org/10.1371/journal.pone.0317863

Shi, X., Wang, S., Zhang, B., Ding, X., Qi, P., Qu, H., Li, N., Wu, J., & Yang, H. (2025). Advances in Object Detection and Localization Techniques for Fruit Harvesting Robots. Agronomy, 15(1), 1–19. https://doi.org/10.3390/agronomy15010145

Trigka, M., & Dritsas, E. (2025). A Comprehensive Survey of Deep Learning Approaches in Image Processing. Sensors, 25(2). https://doi.org/10.3390/s25020531

Wang, W., Zhang, J., Cao, Y., Shen, Y., & Tao, D. (2022). Towards Data-Efficient Detection Transformers. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 13669 LNCS, 88–105. https://doi.org/10.1007/978-3-031-20077-9_6

Zhang, W., Belcheva, V., & Ermakova, T. (2025). Interpretable Deep Learning for Diabetic Retinopathy: A Comparative Study of CNN, ViT, and Hybrid Architectures. Computers, 14(5), 1–24. https://doi.org/10.3390/computers14050187

Total 16 Author's Countries
		(14)
		(9)
		(4)
		(3)
		(3)
		(2)
		(2)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
Total 7 Reviewer's Countries
		(34)
		(6)
		(2)
		(1)
		(1)
		(1)
		(1)
Total 10 Editor's Countries
		(8)
		(2)
		(2)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)

Article Sidebar

Main Article Content

Abstract

Article Details

References