Pythinsearch: A Simple Web Search Engine

Annam Rupa; Sadhu Swathi Priya; G. Sumana; M. Navya  Sri; N. Chandana

doi:10.58723/ijaaiml.v3i1.457

PDF

Issue

Vol. 3 No. 1 (2026): International Journal of Advances in Artificial Intelligence and Machine Learning

Published: Mar 30, 2026

Keywords:

Applied Machine Learning,
Anchor Text Analysis,
Information Retrieval,
Lightweight AI Systems,
HITS Algorithm

Annam Rupa

Jawaharlal Nehru Technological,

https://orcid.org/0009-0007-6001-3363

Sadhu Swathi Priya

Vignan's Institute of Management and Technology for Women,

https://orcid.org/0009-0007-9969-1372

G. Sumana

Vignan's Institute of Management and Technology for Women,

https://orcid.org/0009-0008-0133-9731

M. Navya Sri

Vignan's Institute of Management and Technology for Women,

https://orcid.org/0009-0006-3750-8301

N. Chandana

Vignan's Institute of Management and Technology for Women,

https://orcid.org/0009-0002-2583-9075

Abstract

Background: The rapid growth of web content has increased the complexity of retrieving relevant and high-quality information, especially in resource-constrained environments. Traditional keyword-based search engines often fail to capture semantic relationships and structural importance within web documents, leading to suboptimal retrieval performance.
Aims: This study aims to develop a lightweight and modular web search engine, PyThinSearch, that integrates content-based and link-based ranking techniques to improve retrieval effectiveness and efficiency in low-resource and domain-specific environments.
Method: The proposed system employs a hybrid ranking approach combining TF-IDF, PageRank, and HITS algorithms, along with anchor text analysis to enhance contextual relevance. The system is designed using a modular pipeline architecture consisting of data crawling, text preprocessing, indexing with inverted index, ranking, and query processing. Performance is evaluated using standard information retrieval metrics, including Precision, Recall, F1-score, Mean Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG), and response time.
Result: The experimental results demonstrate that the hybrid ranking model consistently outperforms individual methods. The system achieves higher retrieval effectiveness, with improvements in Precision (0.78), Recall (0.75), MAP (0.77), and NDCG (0.80). Additionally, anchor text analysis significantly enhances performance in ambiguous queries, while the inverted index structure ensures efficient query response times suitable for small- to medium-scale datasets.
Conclusion: PyThinSearch provides an effective and efficient solution for information retrieval by integrating textual relevance and structural importance within a lightweight and modular framework. The proposed system is well-suited for deployment in resource-constrained environments, although future work should focus on incorporating advanced NLP techniques and scalable architectures to improve performance in large-scale applications.

How to Cite

Rupa, A., Priya, S. S., Sumana, G., Sri, M. N., & Chandana, N. (2026). Pythinsearch: A Simple Web Search Engine. International Journal of Advances in Artificial Intelligence and Machine Learning, 3(1), 66–76. https://doi.org/10.58723/ijaaiml.v3i1.457

Section

Articles

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

References

Ajjam, M. H., & Al-Raweshidy, H. S. (2026). AI-driven semantic similarity-based job matching framework for recruitment systems. Information Sciences, 724, 122728. https://doi.org/10.1016/J.INS.2025.122728

Amir Mehmood, M., & Tahir, B. (2024). Humkinar: Construction of a Large Scale Web Repository and Information System for Low Resource Urdu Language. IEEE Access, 12, 128404–128423. https://doi.org/10.1109/ACCESS.2024.3454706

Bifulco, I., Cirillo, S., Esposito, C., Guadagni, R., & Polese, G. (2021). An intelligent system for focused crawling from Big Data sources. Expert Systems with Applications, 184, 115560. https://doi.org/10.1016/J.ESWA.2021.115560

Booij, T. M., Chiscop, I., Meeuwissen, E., Moustafa, N., & Hartog, F. T. H. D. (2022). ToN_IoT: The Role of Heterogeneity and the Need for Standardization of Features and Attack Types in IoT Network Intrusion Data Sets. IEEE Internet of Things Journal, 9(1), 485–496. https://doi.org/10.1109/JIOT.2021.3085194

Ajjam, M. H., & Al-Raweshidy, H. S. (2026). AI-driven semantic similarity-based job matching framework for recruitment systems. Information Sciences, 724, 122728. https://doi.org/10.1016/J.INS.2025.122728

Amir Mehmood, M., & Tahir, B. (2024). Humkinar: Construction of a Large Scale Web Repository and Information System for Low Resource Urdu Language. IEEE Access, 12, 128404–128423. https://doi.org/10.1109/ACCESS.2024.3454706

Bifulco, I., Cirillo, S., Esposito, C., Guadagni, R., & Polese, G. (2021). An intelligent system for focused crawling from Big Data sources. Expert Systems with Applications, 184, 115560. https://doi.org/10.1016/J.ESWA.2021.115560

Booij, T. M., Chiscop, I., Meeuwissen, E., Moustafa, N., & Hartog, F. T. H. D. (2022). ToN_IoT: The Role of Heterogeneity and the Need for Standardization of Features and Attack Types in IoT Network Intrusion Data Sets. IEEE Internet of Things Journal, 9(1), 485–496. https://doi.org/10.1109/JIOT.2021.3085194

Bragilovski, M., van Can, A. T., Dalpiaz, F., & Sturm, A. (2025). Leveraging machines to derive domain models from user stories. Requirements Engineering 2025 30:2, 30(2), 241–262. https://doi.org/10.1007/S00766-025-00442-9

Breit, A., Waltersdorfer, L., Ekaputra, F. J., Sabou, M., Ekelhart, A., Iana, A., Paulheim, H., Portisch, J., Revenko, A., Teije, A. Ten, & Van Harmelen, F. (2023). Combining Machine Learning and Semantic Web: A Systematic Mapping Study. ACM Computing Surveys, 55(14 S). https://doi.org/10.1145/3586163;SUBPAGE:STRING:BASIC

Chen, J. B., & Chang, C. H. (2024). Using Hyperlink-Induced Topic Search Algorithm to Optimize Content Placement in Multimedia Content Delivery Network. 34(5). https://doi.org/10.1142/S0218126625501300

Choi, H., & Jeong, J. (2025). Domain-Specific Manufacturing Analytics Framework: An Integrated Architecture with Retrieval-Augmented Generation and Ollama-Based Models for Manufacturing Execution Systems Environments. Processes 2025, Vol. 13, Page 670, 13(3), 670. https://doi.org/10.3390/PR13030670

Deterding, N. M., & Waters, M. C. (2021). Flexible Coding of In-depth Interviews: A Twenty-first-century Approach. Sociological Methods and Research, 50(2), 708–739. https://doi.org/10.1177/0049124118799377

Fan, Y., Xie, X., Cai, Y., Chen, J., Ma, X., Li, X., Zhang, R., & Guo, J. (2022). Pre-training Methods in Information Retrieval. Foundations and Trends in Information Retrieval, 16(3), 178–317. https://doi.org/10.1561/1500000100

Göppert, A., Grahn, L., Rachner, J., Grunert, D., Hort, S., & Schmitt, R. H. (2021). Pipeline for ontology-based modeling and automated deployment of digital twins for planning and control of manufacturing systems. Journal of Intelligent Manufacturing 2021 34:5, 34(5), 2133–2152. https://doi.org/10.1007/S10845-021-01860-6

Guo, J., Cai, Y., Fan, Y., Sun, F., Zhang, R., & Cheng, X. (2022). Semantic Models for the First-Stage Retrieval: A Comprehensive Review. ACM Transactions on Information Systems, 40(4). https://doi.org/10.1145/3486250

Gupta, V., Sharma, D. K., & Dixit, A. (2021). Review of Information Retrieval: Models, Performance Evaluation Techniques and Applications. International Journal of Sensors, Wireless Communications and Control, 11(9), 896–909. https://doi.org/10.2174/2210327911666210121161142/CITE/REFWORKS

Joseph, M. H., & Ravana, S. D. (2024). Reliable Information Retrieval Systems Performance Evaluation: A Review. IEEE Access, 12, 51740–51751. https://doi.org/10.1109/ACCESS.2024.3377239

Kadyrbek, N., Tuimebayev, Z., Mansurova, M., & Viegas, V. (2025). The Development of Small-Scale Language Models for Low-Resource Languages, with a Focus on Kazakh and Direct Preference Optimization. Big Data and Cognitive Computing 2025, Vol. 9, Page 137, 9(5), 137. https://doi.org/10.3390/BDCC9050137

Kayest, M., & Jain, S. K. (2022). Optimization driven cluster based indexing and matching for the document retrieval. Journal of King Saud University - Computer and Information Sciences, 34(3), 851–861. https://doi.org/10.1016/J.JKSUCI.2019.02.012

Lyu, Y., Li, Z., Niu, S., Xiong, F., Tang, B., Wang, W., Wu, H., Liu, H., Xu, T., & Chen, E. (2025). CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models. ACM Transactions on Information Systems, 43(2). https://doi.org/10.1145/3701228

Nadim, M., Akopian, D., & Matamoros, A. (2023). A Comparative Assessment of Unsupervised Keyword Extraction Tools. IEEE Access, 11, 144778–144798. https://doi.org/10.1109/ACCESS.2023.3344032

Nethravathi, B., Saruka, A., Amitha, G., Bharath, T. P., & Suyagya, S. (2020). Structuring Natural Language to Query Language: A Review. Engineering, Technology & Applied Science Research, 10(6), 6521–6525. https://doi.org/10.48084/ETASR.3873

Pandey, V. K., Sahu, D., Prakash, S., Rathore, R. S., Dixit, P., & Hunko, I. (2025). A lightweight framework to secure IoT devices with limited resources in cloud environments. Scientific Reports 2025 15:1, 15(1), 26009-. https://doi.org/10.1038/s41598-025-09885-0

Santos, J., Wauters, T., Volckaert, B., & De Turck, F. (2021). Towards Low-Latency Service Delivery in a Continuum of Virtual Resources: State-of-the-Art and Research Directions. IEEE Communications Surveys and Tutorials, 23(4), 2557–2589. https://doi.org/10.1109/COMST.2021.3095358

Solanki, A., & Kumar, A. (2018). A system to transform natural language queries into SQL queries. International Journal of Information Technology 2018 14:1, 14(1), 437–446. https://doi.org/10.1007/S41870-018-0095-2

Vijayan, V., Connolly, J., Condell, J., McKelvey, N., & Gardiner, P. (2021). Review of Wearable Devices and Data Collection Considerations for Connected Health. Sensors 2021, Vol. 21, Page 5589, 21(16), 5589. https://doi.org/10.3390/S21165589

von Hippel, E., & Kaulartz, S. (2021). Next-generation consumer innovation search: Identifying early-stage need-solution pairs on the web. Research Policy, 50(8), 104056. https://doi.org/10.1016/J.RESPOL.2020.104056

Xiong, H., Bian, J., Li, Y., Li, X., Du, M., Wang, S., Yin, D., & Helal, S. (2024). When Search Engine Services Meet Large Language Models: Visions and Challenges. IEEE Transactions on Services Computing, 17(6), 4558–4577. https://doi.org/10.1109/TSC.2024.3451185

Yang, M., Wang, H., Wei, Z., Wang, S., & Wen, J. R. (2024). Efficient Algorithms for Personalized PageRank Computation: A Survey. IEEE Transactions on Knowledge and Data Engineering, 36(9), 4582–4602. https://doi.org/10.1109/TKDE.2024.3376000

Yaqub, M. Z., & Al-Sabban, A. S. (2023). Knowledge Sharing through Social Media Platforms in the Silicon Age. Sustainability (Switzerland), 15(8), 1–19. https://doi.org/10.3390/su15086765

Total 19 Author's Countries
		(18)
		(10)
		(5)
		(3)
		(3)
		(3)
		(3)
		(2)
		(2)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
Total 13 Reviewer's Countries
		(74)
		(7)
		(2)
		(2)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
Total 10 Editor's Countries
		(8)
		(2)
		(2)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)
		(1)

Article Sidebar

Main Article Content

Abstract

Article Details

References