Comparative Study of CNN and Vision Transformers on Indonesian Tradisional Cakes Classification

Main Article Content

  Dedi Trisnawarman
  Adolf Asih Supriyanton
  Viny Christanti Mawardi
  Ugochi A Okengwu

Abstract

Background of study: Food image classification is a challenging task in computer vision, particularly when dealing with traditional food items that exhibit subtle visual variations. While Convolutional Neural Networks (CNNs) have long been the standard for image recognition, their limitation in capturing long-range dependencies has led to the emergence of Vision Transformers (ViTs). In this context, the classification of Indonesian traditional cakes offers a culturally rich yet complex problem for automated image recognition systems.
Aims and scope of paper: This study aims to conduct a comparative analysis between EfficientNet-B0 (CNN-based) and ViT-B/16 (Transformer-based) architectures in classifying eight categories of Indonesian traditional cakes. The research evaluates not only classification accuracy but also the strengths and limitations of each model in handling fine-grained visual distinctions.
Methods: Both models were fine-tuned using the “Kue Indonesia” dataset from Kaggle. The methodology includes image preprocessing, model training with consistent parameters, and evaluation using accuracy, precision, recall, and F1-score. A confusion matrix was also used to visualize misclassifications and analyze per-class performance.
Result: ViT-B/16 achieved slightly higher accuracy (96.25%) compared to EfficientNet-B0 (95.62%). ViT performed better in classes with subtle variations, such as kue lapis and kue dadar gulung, while EfficientNet-B0 showed superior efficiency and high accuracy on visually distinct cakes.
Conclusion: Both CNN and ViT models demonstrate strong performance in traditional food classification. ViT is more robust in fine-grained visual analysis, whereas EfficientNet-B0 is preferable for resource-constrained environments. This study highlights the role of AI in supporting digital preservation of culinary heritage.

Article Details

How to Cite
Trisnawarman, D., Supriyanton, A. A., Mawardi, V. C., & Okengwu, U. A. (2025). Comparative Study of CNN and Vision Transformers on Indonesian Tradisional Cakes Classification. International Journal of Advances in Artificial Intelligence and Machine Learning, 2(2), 86–94. https://doi.org/10.58723/ijaaiml.v2i2.405
Section
Articles

References

Alba-Martínez, J., Bononad-Olmo, A., Igual, M., Cunha, L. M., Martínez-Monzó, J., & García-Segovia, P. (2022). Role of Visual Assessment of High-Quality Cakes in Emotional Response of Consumers. Foods, 11(10), 1–15. https://doi.org/10.3390/foods11101412

Alruwaili, M., & Mohamed, M. (2025). An Integrated Deep Learning Model with EfficientNet and ResNet for Accurate Multi-Class Skin Disease Classification. Diagnostics, 15(5), 1–18. https://doi.org/10.3390/diagnostics15050551

Banerjee, S., Palsani, D., & Mondal, A. C. (2024). Nutritional Content Detection Using Vision Transformers- An Intelligent Approach. International Journal of Innovative Research in Engineering and Management, 11(6), 21–27. https://doi.org/10.55524/ijirem.2024.11.6.3

Bhatt, D., Patel, C., Talsania, H., Patel, J., Vaghela, R., Pandya, S., Modi, K., & Ghayvat, H. (2021). Cnn variants for computer vision: History, architecture, application, challenges and future scope. Electronics (Switzerland), 10(20), 1–28. https://doi.org/10.3390/electronics10202470

Boyd, L., Nnamoko, N., & Lopes, R. (2024). Fine-Grained Food Image Recognition: A Study on Optimising Convolutional Neural Networks for Improved Performance. Journal of Imaging, 10(6), 1–25. https://doi.org/10.3390/jimaging10060126

Chen, J., Ma, X., Li, S., Ma, S., Zhang, Z., & Ma, X. (2024). A Hybrid Parallel Computing Architecture Based on CNN and Transformer for Music Genre Classification. Electronics (Switzerland), 13(16), 1–13. https://doi.org/10.3390/electronics13163313

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). an Image Is Worth 16X16 Words: Transformers for Image Recognition At Scale. ICLR 2021 - 9th International Conference on Learning Representations, 2(1), 1–22. https://doi.org/10.48550/arXiv.2010.11929

Hjalager, A.-M. (2022). Digital Food and the Innovation of Gastronomic Tourism. Journal of Gastronomy and Tourism, 7(1), 35–49. https://doi.org/10.3727/216929722x16354101932186

Isinkaye, F. O., Olusanya, M. O., & Singh, P. K. (2024). Deep learning and content-based filtering techniques for improving plant disease identification and treatment recommendations: A comprehensive review. Heliyon, 10(9), e29583. https://doi.org/10.1016/j.heliyon.2024.e29583

Karlita, T., Afif, B. P., & Prasetyaningrum, I. (2022). Indonesian Traditional Cake Classification Using Convolutional Neural Networks. Proceedings of the International Conference on Applied Science and Technology on Social Science 2021 (ICAST-SS 2021), 647, 924–929. https://doi.org/10.2991/assehr.k.220301.153

Liu, D., Zuo, E., Wang, D., He, L., Dong, L., & Lu, X. (2025). Deep Learning in Food Image Recognition : A Comprehensive Review. Applied Sciences, 15(14), 1–18. https://doi.org/10.3390/app15147626

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986

Mienye, I. D., Swart, T. G., Obaido, G., Jordan, M., & Ilono, P. (2025). Deep Convolutional Neural Networks in Medical Image Analysis: A Review. Information (Switzerland), 16(3), 1–28. https://doi.org/10.3390/info16030195

Mingxing Tan, Q. V. Le. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks Mingxing. International Conference on Machine Learning, 1(5). https://doi.org/10.48550/arXiv.1905.11946

Nfor, K. A., Theodore Armand, T. P., Ismaylovna, K. P., Joo, M. Il, & Kim, H. C. (2025). An Explainable CNN and Vision Transformer-Based Approach for Real-Time Food Recognition. Nutrients , 17(2), 1–24. https://doi.org/10.3390/nu17020362

Sampath, V., Maurtua, I., Aguilar Martín, J. J., & Gutierrez, A. (2021). A survey on generative adversarial networks for imbalance problems in computer vision tasks. In Journal of Big Data (Vol. 8, Issue 1). Springer International Publishing. https://doi.org/10.1186/s40537-021-00414-0

Sari, R. P., & Chandra, A. Y. (2025). Analisis Perbandingan Akurasi Model EfficientNetB0 dan Vision Transformer Dalam Klasifikasi Citra Motif Batik Giriloyo. Building of Informatics, Technology and Science, 7(1), 252–263. https://doi.org/10.47065/bits.v7i1.7343

Sikdar, A., Liu, Y., Kedarisetty, S., Zhao, Y., Ahmed, A., & Behera, A. (2025). Interweaving Insights: High-Order Feature Interaction for Fine-Grained Visual Recognition. International Journal of Computer Vision, 133(4), 1755–1779. https://doi.org/10.1007/s11263-024-02260-y

Suanpang, P., & Pothipassa, P. (2024). Integrating Generative AI and IoT for Sustainable Smart Tourism Destinations. Sustainability (Switzerland), 16(17), 1–34. https://doi.org/10.3390/su16177435

Taufiqurrahman, Sari, I. C., & Manurung, M. K. (2024). Integrasi Model Deep Learning Efficientnet-B0 Untuk Deteksi Penyakit Daun Tomat Pada Aplikasi Seluler Berbasis Flutter. Djtechno: Jurnal Teknologi Informasi, 5(2), 332–346. https://doi.org/10.46576/djtechno.v5i2.4651

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. Proceedings of Machine Learning Research, 1–22. https://doi.org/10.48550/arXiv.2012.12877

Wang, Y., & Wang, Z. (2019). A survey of recent work on fine-grained image classification techniques. Journal of Visual Communication and Image Representation, 59, 210–214. https://doi.org/10.1016/j.jvcir.2018.12.049

Zhang, Y., Deng, L., Zhu, H., Wang, W., Ren, Z., Zhou, Q., Lu, S., Sun, S., Zhu, Z., Gorriz, J. M., & Wang, S. (2023). Deep learning in food category recognition. Information Fusion, 98(March), 101859. https://doi.org/10.1016/j.inffus.2023.101859