Comparative Study of CNN and Vision Transformers on Indonesian Tradisional Cakes Classification
Main Article Content
Abstract
Background of study: Food image classification is a challenging task in computer vision, particularly when dealing with traditional food items that exhibit subtle visual variations. While Convolutional Neural Networks (CNNs) have long been the standard for image recognition, their limitation in capturing long-range dependencies has led to the emergence of Vision Transformers (ViTs). In this context, the classification of Indonesian traditional cakes offers a culturally rich yet complex problem for automated image recognition systems.
Aims and scope of paper: This study aims to conduct a comparative analysis between EfficientNet-B0 (CNN-based) and ViT-B/16 (Transformer-based) architectures in classifying eight categories of Indonesian traditional cakes. The research evaluates not only classification accuracy but also the strengths and limitations of each model in handling fine-grained visual distinctions.
Methods: Both models were fine-tuned using the “Kue Indonesia” dataset from Kaggle. The methodology includes image preprocessing, model training with consistent parameters, and evaluation using accuracy, precision, recall, and F1-score. A confusion matrix was also used to visualize misclassifications and analyze per-class performance.
Result: ViT-B/16 achieved slightly higher accuracy (96.25%) compared to EfficientNet-B0 (95.62%). ViT performed better in classes with subtle variations, such as kue lapis and kue dadar gulung, while EfficientNet-B0 showed superior efficiency and high accuracy on visually distinct cakes.
Conclusion: Both CNN and ViT models demonstrate strong performance in traditional food classification. ViT is more robust in fine-grained visual analysis, whereas EfficientNet-B0 is preferable for resource-constrained environments. This study highlights the role of AI in supporting digital preservation of culinary heritage.
Article Details
Copyright (c) 2025 Dedi Trisnawarman, Adolf Asih Supriyanton, Viny Christanti Mawardi, Ugochi A Okengwu

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
References
Alba-Martínez, J., Bononad-Olmo, A., Igual, M., Cunha, L. M., Martínez-Monzó, J., & García-Segovia, P. (2022). Role of Visual Assessment of High-Quality Cakes in Emotional Response of Consumers. Foods, 11(10), 1–15. https://doi.org/10.3390/foods11101412
Alruwaili, M., & Mohamed, M. (2025). An Integrated Deep Learning Model with EfficientNet and ResNet for Accurate Multi-Class Skin Disease Classification. Diagnostics, 15(5), 1–18. https://doi.org/10.3390/diagnostics15050551
Banerjee, S., Palsani, D., & Mondal, A. C. (2024). Nutritional Content Detection Using Vision Transformers- An Intelligent Approach. International Journal of Innovative Research in Engineering and Management, 11(6), 21–27. https://doi.org/10.55524/ijirem.2024.11.6.3
Bhatt, D., Patel, C., Talsania, H., Patel, J., Vaghela, R., Pandya, S., Modi, K., & Ghayvat, H. (2021). Cnn variants for computer vision: History, architecture, application, challenges and future scope. Electronics (Switzerland), 10(20), 1–28. https://doi.org/10.3390/electronics10202470
Boyd, L., Nnamoko, N., & Lopes, R. (2024). Fine-Grained Food Image Recognition: A Study on Optimising Convolutional Neural Networks for Improved Performance. Journal of Imaging, 10(6), 1–25. https://doi.org/10.3390/jimaging10060126
Chen, J., Ma, X., Li, S., Ma, S., Zhang, Z., & Ma, X. (2024). A Hybrid Parallel Computing Architecture Based on CNN and Transformer for Music Genre Classification. Electronics (Switzerland), 13(16), 1–13. https://doi.org/10.3390/electronics13163313
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). an Image Is Worth 16X16 Words: Transformers for Image Recognition At Scale. ICLR 2021 - 9th International Conference on Learning Representations, 2(1), 1–22. https://doi.org/10.48550/arXiv.2010.11929
Hjalager, A.-M. (2022). Digital Food and the Innovation of Gastronomic Tourism. Journal of Gastronomy and Tourism, 7(1), 35–49. https://doi.org/10.3727/216929722x16354101932186
Isinkaye, F. O., Olusanya, M. O., & Singh, P. K. (2024). Deep learning and content-based filtering techniques for improving plant disease identification and treatment recommendations: A comprehensive review. Heliyon, 10(9), e29583. https://doi.org/10.1016/j.heliyon.2024.e29583
Karlita, T., Afif, B. P., & Prasetyaningrum, I. (2022). Indonesian Traditional Cake Classification Using Convolutional Neural Networks. Proceedings of the International Conference on Applied Science and Technology on Social Science 2021 (ICAST-SS 2021), 647, 924–929. https://doi.org/10.2991/assehr.k.220301.153
Liu, D., Zuo, E., Wang, D., He, L., Dong, L., & Lu, X. (2025). Deep Learning in Food Image Recognition : A Comprehensive Review. Applied Sciences, 15(14), 1–18. https://doi.org/10.3390/app15147626
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986
Mienye, I. D., Swart, T. G., Obaido, G., Jordan, M., & Ilono, P. (2025). Deep Convolutional Neural Networks in Medical Image Analysis: A Review. Information (Switzerland), 16(3), 1–28. https://doi.org/10.3390/info16030195
Mingxing Tan, Q. V. Le. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks Mingxing. International Conference on Machine Learning, 1(5). https://doi.org/10.48550/arXiv.1905.11946
Nfor, K. A., Theodore Armand, T. P., Ismaylovna, K. P., Joo, M. Il, & Kim, H. C. (2025). An Explainable CNN and Vision Transformer-Based Approach for Real-Time Food Recognition. Nutrients , 17(2), 1–24. https://doi.org/10.3390/nu17020362
Sampath, V., Maurtua, I., Aguilar Martín, J. J., & Gutierrez, A. (2021). A survey on generative adversarial networks for imbalance problems in computer vision tasks. In Journal of Big Data (Vol. 8, Issue 1). Springer International Publishing. https://doi.org/10.1186/s40537-021-00414-0
Sari, R. P., & Chandra, A. Y. (2025). Analisis Perbandingan Akurasi Model EfficientNetB0 dan Vision Transformer Dalam Klasifikasi Citra Motif Batik Giriloyo. Building of Informatics, Technology and Science, 7(1), 252–263. https://doi.org/10.47065/bits.v7i1.7343
Sikdar, A., Liu, Y., Kedarisetty, S., Zhao, Y., Ahmed, A., & Behera, A. (2025). Interweaving Insights: High-Order Feature Interaction for Fine-Grained Visual Recognition. International Journal of Computer Vision, 133(4), 1755–1779. https://doi.org/10.1007/s11263-024-02260-y
Suanpang, P., & Pothipassa, P. (2024). Integrating Generative AI and IoT for Sustainable Smart Tourism Destinations. Sustainability (Switzerland), 16(17), 1–34. https://doi.org/10.3390/su16177435
Taufiqurrahman, Sari, I. C., & Manurung, M. K. (2024). Integrasi Model Deep Learning Efficientnet-B0 Untuk Deteksi Penyakit Daun Tomat Pada Aplikasi Seluler Berbasis Flutter. Djtechno: Jurnal Teknologi Informasi, 5(2), 332–346. https://doi.org/10.46576/djtechno.v5i2.4651
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. Proceedings of Machine Learning Research, 1–22. https://doi.org/10.48550/arXiv.2012.12877
Wang, Y., & Wang, Z. (2019). A survey of recent work on fine-grained image classification techniques. Journal of Visual Communication and Image Representation, 59, 210–214. https://doi.org/10.1016/j.jvcir.2018.12.049
Zhang, Y., Deng, L., Zhu, H., Wang, W., Ren, Z., Zhou, Q., Lu, S., Sun, S., Zhu, Z., Gorriz, J. M., & Wang, S. (2023). Deep learning in food category recognition. Information Fusion, 98(March), 101859. https://doi.org/10.1016/j.inffus.2023.101859
Dedi Trisnawarman