Analyzing Bias in Large Language Models: A Quantitative Study Using Sentiment and Demographic Metrics

Ramya Mandava

doi:10.58723/ijaaiml.v2i2.411

PDF

Issue

Vol. 2 No. 2 (2025): International Journal of Advances in Artificial Intelligence and Machine Learning

Published: May 28, 2025

Keywords:

BLOOM,
Claude,
Demographic Bias ,
GPT-4,
LLaMA-2,
Large language models,
Racial Bias,
Sentiment Analysis,
Statistical Analysis

Ramya Mandava

Independent Researcher, New Jersey, USA,

https://orcid.org/0009-0000-5497-0721

Abstract

Background of study: The widespread adoption of Large Language Models (LLMs) raises concerns about biases that affect fairness and credibility. As LLMs affect areas such as recruitment and customer service, systematic quantitative analysis is essential to identify and mitigate these biases.
Aims and scope of paper: This research investigates demographic bias in LLM quantitatively by analyzing sentiment polarity scores across different demographic categories. The goal is to provide a statistically confirmed analysis of sentiment bias and propose mitigation methods, focusing on GPT-4, LLaMA-2, Claude, and BLOOM.
Methods: Quantitative analysis was performed on GPT-4, LLaMA-2, Claude, and BLOOM using sentiment and demographic data. Sentiment polarity assessments for gender and racial/ethnic groups were obtained with VADER and TextBlob. Demographic Disparity Score, ANOVA, and Cohen's Kappa assessed the significance and appropriateness of bias. Inter-rater reliability between automated tools and human annotators was also evaluated.
Result: Sentiment bias was found in all models, varying by gender and race, particularly in GPT-4 and Claude. Sentiment scores were consistently higher for queries pertaining to females than those pertaining to males across all models, with GPT-4 and Claude showing the largest differences. Claude also showed racial sentiment alignment, favoring queries relating to white people over black people. ANOVA confirmed statistically significant sentiment variation by demographics across all models. High inter-rater reliability validated the sentiment analysis.
Conclusion: This study shows demographic bias in GPT-4, LLaMA-2, Claude, and BLOOM, with different sentiment trends across demographic classifications. The models showed more positive sentiment for female questions and a trend towards certain racial groups. These findings indicate an embedded bias in the training data, which raises ethical concerns. Identifying and addressing these biases is critical to ensuring fairness and credibility in real-world LLM applications.

How to Cite

Mandava, R. (2025). Analyzing Bias in Large Language Models: A Quantitative Study Using Sentiment and Demographic Metrics. International Journal of Advances in Artificial Intelligence and Machine Learning, 2(2), 44–55. https://doi.org/10.58723/ijaaiml.v2i2.411

Section

Articles

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

References

Abdurahman, S., Atari, M., Karimi-Malekabadi, F., Xue, M. J., Trager, J., Park, P. S., Golazizian, P., Omrani, A., & Dehghani, M. (2024). Perils and opportunities in using large language models in psychological research. PNAS Nexus, 3(7), 1–14. https://doi.org/10.1093/pnasnexus/pgae245

Bzdok, D., Thieme, A., Levkovskyy, O., Wren, P., Ray, T., & Reddy, S. (2024). Data science opportunities of large language models for neuroscience and biomedicine. Neuron, 112(5), 698–717. https://doi.org/10.1016/j.neuron.2024.01.016

Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., Ye, W., Zhang, Y., Chang, Y., Yu, P. S., Yang, Q., & Xie, X. (2024). A Survey on Evaluation of Large Language Models. ACM Transactions on Intelligent Systems and Technology, 15(3). https://doi.org/10.1145/3641289

Dendukuri, H., Raju, K. B., Praveen, S. P., Ramesh, J. V. N., Shariff, V., & Tirumanadham, N. S. K. M. K. (2025). Optimizing Diabetes Diagnosis: HFM with Tree-Structured Parzen Estimator for Enhanced Predictive Performance and Interpretability. Fusion: Practice and Applications, 19(1), 57–74. https://doi.org/10.54216/FPA.190106

Esiobu, D., Tan, X., Hosseini, S., Ung, M., Zhang, Y., Fernandes, J., Dwivedi-Yu, J., Presani, E., Williams, A., & Smith, E. M. (2023). ROBBIE: Robust Bias Evaluation of Large Generative Language Models. EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings, 1(1), 3764–3814. https://doi.org/10.18653/v1/2023.emnlp-main.230

Fang, X., Che, S., Mao, M., Zhang, H., Zhao, M., & Zhao, X. (2024). Bias of AI-generated content: an examination of news produced by large language models. Scientific Reports, 14(1), 1–20. https://doi.org/10.1038/s41598-024-55686-2

Gallegos, I. O., Rossi, R. A., Barrow, J., Tanjim, M. M., Kim, S., Dernoncourt, F., Yu, T., Zhang, R., & Ahmed, N. K. (2024). Bias and Fairness in Large Language Models: A Survey. Computational Linguistics, March, 1–83. https://doi.org/10.1162/coli_a_00524

Ho, J. Q. H., Hartanto, A., Koh, A., & Majeed, N. M. (2025). Computers in Human Behavior : Artificial Humans Gender biases within Artificial Intelligence and ChatGPT : Evidence , Sources of Biases and Solutions. Computers in Human Behavior: Artificial Humans, 4(October 2024), 100145. https://doi.org/10.1016/j.chbah.2025.100145

Jansen, B. J., Jung, S., & Salminen, J. (2023). Employing large language models in survey research. Natural Language Processing Journal, 4(May), 100020. https://doi.org/10.1016/j.nlp.2023.100020

Kirk, H. R., Jun, Y., Iqbal, H., Benussi, E., Volpin, F., Dreyer, F. A., Shtedritski, A., & Asano, Y. M. (2021). Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models. Advances in Neural Information Processing Systems, 4(NeurIPS), 2611–2624. https://doi.org/10.48550/arXiv.2102.04130

Liu, R., Jia, C., Wei, J., Xu, G., & Vosoughi, S. (2022). Quantifying and alleviating political bias in language models. Artificial Intelligence, 304, 103654. https://doi.org/10.1016/j.artint.2021.103654

Liu, R., Jia, C., Wei, J., Xu, G., Wang, L., & Vosoughi, S. (2021). Mitigating Political Bias in Language Models Through Reinforced Calibration. 35th AAAI Conference on Artificial Intelligence, AAAI 2021, 17A, 14857–14866. https://doi.org/10.1609/aaai.v35i17.17744

Malgaroli, M., Hull, T. D., Zech, J. M., & Althoff, T. (2023). Natural language processing for mental health interventions: a systematic review and research framework. Translational Psychiatry, 13(1), 1–17. https://doi.org/10.1038/s41398-023-02592-2

Nazi, Z. Al, & Peng, W. (2024). Large Language Models in Healthcare and Medical Domain: A Review. Informatics, 11(3), 57. https://doi.org/10.3390/informatics11030057

Qu, Y., & Wang, J. (2024). Performance and biases of Large Language Models in public opinion simulation. Humanities and Social Sciences Communications, 11(1). https://doi.org/10.1057/s41599-024-03609-x

Radaideh, M. I., Kwon, O. H., & Radaideh, M. I. (2025). Fairness and social bias quantification in Large Language Models for sentiment analysis. Knowledge-Based Systems, 319(April), 113569. https://doi.org/10.1016/j.knosys.2025.113569

Rashidi, H. H., Pantanowitz, J., Hanna, M., Tafti, A. P., Sanghani, P., Buchinsky, A., Fennell, B., Deebajah, M., Wheeler, S., Pearce, T., Abukhiran, I., Robertson, S., Palmer, O., Gur, M., Tran, N. K., & Pantanowitz, L. (2025). Introduction to Artificial Intelligence (AI) and Machine Learning (ML) in Pathology & Medicine: Generative & Non-Generative AI Basics. Modern Pathology : An Official Journal of the United States and Canadian Academy of Pathology, Inc, 38(4), 100688. https://doi.org/10.1016/J.MODPAT.2024.100688

Ray, P. P. (2023). ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 3(April), 121–154. https://doi.org/10.1016/j.iotcps.2023.04.003

Rozado, D. (2020). Wide range screening of algorithmic bias in word embedding models using large sentiment lexicons reveals underreported bias types. PLoS ONE, 15(4), 1–26. https://doi.org/10.1371/journal.pone.0231189

Salewski, L., Alaniz, S., Rio-Torto, I., Schulz, E., & Akata, Z. (2023). In-Context Impersonation Reveals Large Language Models’ Strengths and Biases. Advances in Neural Information Processing Systems, 36(NeurIPS), 1–27. https://doi.org/10.48550/arXiv.2305.14930

Shariff, V., Aluri, Y. K., & Venkata Rami Reddy, C. (2019). New distributed routing algorithm in wireless network models. Journal of Physics: Conference Series, 1228(1). https://doi.org/10.1088/1742-6596/1228/1/012027

Shen, Y., Liu, Q., Guo, N., Yuan, J., & Yang, Y. (2023). Fake News Detection on Social Networks: A Survey. Applied Sciences (Switzerland), 13(21), 1–19. https://doi.org/10.3390/app132111877

Sheng, E., Chang, K. W., Natarajan, P., & Peng, N. (2021). Societal biases in language generation: Progress and challenges. ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 4275–4293. https://doi.org/10.18653/v1/2021.acl-long.330

Yuan, X., Hu, J., & Zhang, Q. (2024). A Comparative Analysis of Cultural Alignment in Large Language Models in Bilingual Contexts. 1–13. https://doi.org/10.31219/osf.io/6hpcf

Total 16 Author's Countries
		(14)
		(9)
		(4)
		(3)
		(3)
		(2)
		(2)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
Total 6 Reviewer's Countries
		(31)
		(6)
		(2)
		(1)
		(1)
		(1)
Total 10 Editor's Countries
		(8)
		(2)
		(2)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)

Article Sidebar

Main Article Content

Abstract

Article Details

References