Comparative Study of CNN and Vision Transformers on Indonesian Tradisional Cakes Classification

Dedi Trisnawarman; Adolf Asih Supriyanton; Viny Christanti  Mawardi; Ugochi A Okengwu

doi:10.58723/ijaaiml.v2i2.405

PDF

Issue

Vol. 2 No. 2 (2025): International Journal of Advances in Artificial Intelligence and Machine Learning

Published: Jul 13, 2025

Keywords:

EfficientNet,
Fine-Grained Image Recognition,
Indonesian Kue Tradisional,
Traditional Food Classification,
Vision Transformers (ViT)

Dedi Trisnawarman

Universitas Tarumanagara, Jakarta,

https://orcid.org/0000-0002-9994-249X

Adolf Asih Supriyanton

Politeknik Enjinering Indorama, Purwakarta,

https://orcid.org/0009-0009-7945-8754

Viny Christanti Mawardi

Universitas Tarumanagara, Jakarta,

https://orcid.org/0000-0001-6260-406X

Ugochi A Okengwu

Department of Computer Science, University of Port Harcourt,

https://orcid.org/0000-0003-1695-0660

Abstract

Background of study: Food image classification is a challenging task in computer vision, particularly when dealing with traditional food items that exhibit subtle visual variations. While Convolutional Neural Networks (CNNs) have long been the standard for image recognition, their limitation in capturing long-range dependencies has led to the emergence of Vision Transformers (ViTs). In this context, the classification of Indonesian traditional cakes offers a culturally rich yet complex problem for automated image recognition systems.
Aims and scope of paper: This study aims to conduct a comparative analysis between EfficientNet-B0 (CNN-based) and ViT-B/16 (Transformer-based) architectures in classifying eight categories of Indonesian traditional cakes. The research evaluates not only classification accuracy but also the strengths and limitations of each model in handling fine-grained visual distinctions.
Methods: Both models were fine-tuned using the “Kue Indonesia” dataset from Kaggle. The methodology includes image preprocessing, model training with consistent parameters, and evaluation using accuracy, precision, recall, and F1-score. A confusion matrix was also used to visualize misclassifications and analyze per-class performance.
Result: ViT-B/16 achieved slightly higher accuracy (96.25%) compared to EfficientNet-B0 (95.62%). ViT performed better in classes with subtle variations, such as kue lapis and kue dadar gulung, while EfficientNet-B0 showed superior efficiency and high accuracy on visually distinct cakes.
Conclusion: Both CNN and ViT models demonstrate strong performance in traditional food classification. ViT is more robust in fine-grained visual analysis, whereas EfficientNet-B0 is preferable for resource-constrained environments. This study highlights the role of AI in supporting digital preservation of culinary heritage.

How to Cite

Trisnawarman, D., Supriyanton, A. A., Mawardi, V. C., & Okengwu, U. A. (2025). Comparative Study of CNN and Vision Transformers on Indonesian Tradisional Cakes Classification. International Journal of Advances in Artificial Intelligence and Machine Learning, 2(2), 86–94. https://doi.org/10.58723/ijaaiml.v2i2.405

Section

Articles

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

References

Alba-Martínez, J., Bononad-Olmo, A., Igual, M., Cunha, L. M., Martínez-Monzó, J., & García-Segovia, P. (2022). Role of Visual Assessment of High-Quality Cakes in Emotional Response of Consumers. Foods, 11(10), 1–15. https://doi.org/10.3390/foods11101412

Alruwaili, M., & Mohamed, M. (2025). An Integrated Deep Learning Model with EfficientNet and ResNet for Accurate Multi-Class Skin Disease Classification. Diagnostics, 15(5), 1–18. https://doi.org/10.3390/diagnostics15050551

Banerjee, S., Palsani, D., & Mondal, A. C. (2024). Nutritional Content Detection Using Vision Transformers- An Intelligent Approach. International Journal of Innovative Research in Engineering and Management, 11(6), 21–27. https://doi.org/10.55524/ijirem.2024.11.6.3

Bhatt, D., Patel, C., Talsania, H., Patel, J., Vaghela, R., Pandya, S., Modi, K., & Ghayvat, H. (2021). Cnn variants for computer vision: History, architecture, application, challenges and future scope. Electronics (Switzerland), 10(20), 1–28. https://doi.org/10.3390/electronics10202470

Boyd, L., Nnamoko, N., & Lopes, R. (2024). Fine-Grained Food Image Recognition: A Study on Optimising Convolutional Neural Networks for Improved Performance. Journal of Imaging, 10(6), 1–25. https://doi.org/10.3390/jimaging10060126

Chen, J., Ma, X., Li, S., Ma, S., Zhang, Z., & Ma, X. (2024). A Hybrid Parallel Computing Architecture Based on CNN and Transformer for Music Genre Classification. Electronics (Switzerland), 13(16), 1–13. https://doi.org/10.3390/electronics13163313

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). an Image Is Worth 16X16 Words: Transformers for Image Recognition At Scale. ICLR 2021 - 9th International Conference on Learning Representations, 2(1), 1–22. https://doi.org/10.48550/arXiv.2010.11929

Hjalager, A.-M. (2022). Digital Food and the Innovation of Gastronomic Tourism. Journal of Gastronomy and Tourism, 7(1), 35–49. https://doi.org/10.3727/216929722x16354101932186

Isinkaye, F. O., Olusanya, M. O., & Singh, P. K. (2024). Deep learning and content-based filtering techniques for improving plant disease identification and treatment recommendations: A comprehensive review. Heliyon, 10(9), e29583. https://doi.org/10.1016/j.heliyon.2024.e29583

Karlita, T., Afif, B. P., & Prasetyaningrum, I. (2022). Indonesian Traditional Cake Classification Using Convolutional Neural Networks. Proceedings of the International Conference on Applied Science and Technology on Social Science 2021 (ICAST-SS 2021), 647, 924–929. https://doi.org/10.2991/assehr.k.220301.153

Liu, D., Zuo, E., Wang, D., He, L., Dong, L., & Lu, X. (2025). Deep Learning in Food Image Recognition : A Comprehensive Review. Applied Sciences, 15(14), 1–18. https://doi.org/10.3390/app15147626

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986

Mienye, I. D., Swart, T. G., Obaido, G., Jordan, M., & Ilono, P. (2025). Deep Convolutional Neural Networks in Medical Image Analysis: A Review. Information (Switzerland), 16(3), 1–28. https://doi.org/10.3390/info16030195

Mingxing Tan, Q. V. Le. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks Mingxing. International Conference on Machine Learning, 1(5). https://doi.org/10.48550/arXiv.1905.11946

Nfor, K. A., Theodore Armand, T. P., Ismaylovna, K. P., Joo, M. Il, & Kim, H. C. (2025). An Explainable CNN and Vision Transformer-Based Approach for Real-Time Food Recognition. Nutrients , 17(2), 1–24. https://doi.org/10.3390/nu17020362

Sampath, V., Maurtua, I., Aguilar Martín, J. J., & Gutierrez, A. (2021). A survey on generative adversarial networks for imbalance problems in computer vision tasks. In Journal of Big Data (Vol. 8, Issue 1). Springer International Publishing. https://doi.org/10.1186/s40537-021-00414-0

Sari, R. P., & Chandra, A. Y. (2025). Analisis Perbandingan Akurasi Model EfficientNetB0 dan Vision Transformer Dalam Klasifikasi Citra Motif Batik Giriloyo. Building of Informatics, Technology and Science, 7(1), 252–263. https://doi.org/10.47065/bits.v7i1.7343

Sikdar, A., Liu, Y., Kedarisetty, S., Zhao, Y., Ahmed, A., & Behera, A. (2025). Interweaving Insights: High-Order Feature Interaction for Fine-Grained Visual Recognition. International Journal of Computer Vision, 133(4), 1755–1779. https://doi.org/10.1007/s11263-024-02260-y

Suanpang, P., & Pothipassa, P. (2024). Integrating Generative AI and IoT for Sustainable Smart Tourism Destinations. Sustainability (Switzerland), 16(17), 1–34. https://doi.org/10.3390/su16177435

Taufiqurrahman, Sari, I. C., & Manurung, M. K. (2024). Integrasi Model Deep Learning Efficientnet-B0 Untuk Deteksi Penyakit Daun Tomat Pada Aplikasi Seluler Berbasis Flutter. Djtechno: Jurnal Teknologi Informasi, 5(2), 332–346. https://doi.org/10.46576/djtechno.v5i2.4651

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. Proceedings of Machine Learning Research, 1–22. https://doi.org/10.48550/arXiv.2012.12877

Wang, Y., & Wang, Z. (2019). A survey of recent work on fine-grained image classification techniques. Journal of Visual Communication and Image Representation, 59, 210–214. https://doi.org/10.1016/j.jvcir.2018.12.049

Zhang, Y., Deng, L., Zhu, H., Wang, W., Ren, Z., Zhou, Q., Lu, S., Sun, S., Zhu, Z., Gorriz, J. M., & Wang, S. (2023). Deep learning in food category recognition. Information Fusion, 98(March), 101859. https://doi.org/10.1016/j.inffus.2023.101859

Total 16 Author's Countries
		(14)
		(9)
		(4)
		(3)
		(3)
		(2)
		(2)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
Total 7 Reviewer's Countries
		(34)
		(6)
		(2)
		(1)
		(1)
		(1)
		(1)
Total 10 Editor's Countries
		(8)
		(2)
		(2)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)

Article Sidebar

Main Article Content

Abstract

Article Details

References