Pythinsearch: A Simple Web Search Engine
Main Article Content
Abstract
Background: The rapid growth of web content has increased the complexity of retrieving relevant and high-quality information, especially in resource-constrained environments. Traditional keyword-based search engines often fail to capture semantic relationships and structural importance within web documents, leading to suboptimal retrieval performance.
Aims: This study aims to develop a lightweight and modular web search engine, PyThinSearch, that integrates content-based and link-based ranking techniques to improve retrieval effectiveness and efficiency in low-resource and domain-specific environments.
Method: The proposed system employs a hybrid ranking approach combining TF-IDF, PageRank, and HITS algorithms, along with anchor text analysis to enhance contextual relevance. The system is designed using a modular pipeline architecture consisting of data crawling, text preprocessing, indexing with inverted index, ranking, and query processing. Performance is evaluated using standard information retrieval metrics, including Precision, Recall, F1-score, Mean Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG), and response time.
Result: The experimental results demonstrate that the hybrid ranking model consistently outperforms individual methods. The system achieves higher retrieval effectiveness, with improvements in Precision (0.78), Recall (0.75), MAP (0.77), and NDCG (0.80). Additionally, anchor text analysis significantly enhances performance in ambiguous queries, while the inverted index structure ensures efficient query response times suitable for small- to medium-scale datasets.
Conclusion: PyThinSearch provides an effective and efficient solution for information retrieval by integrating textual relevance and structural importance within a lightweight and modular framework. The proposed system is well-suited for deployment in resource-constrained environments, although future work should focus on incorporating advanced NLP techniques and scalable architectures to improve performance in large-scale applications.
Article Details
Copyright (c) 2026 Annam Rupa, Sadhu Swathi Priya, G. Sumana, M. Navya Sri, N. Chandana

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
References
Ajjam, M. H., & Al-Raweshidy, H. S. (2026). AI-driven semantic similarity-based job matching framework for recruitment systems. Information Sciences, 724, 122728. https://doi.org/10.1016/J.INS.2025.122728
Amir Mehmood, M., & Tahir, B. (2024). Humkinar: Construction of a Large Scale Web Repository and Information System for Low Resource Urdu Language. IEEE Access, 12, 128404–128423. https://doi.org/10.1109/ACCESS.2024.3454706
Bifulco, I., Cirillo, S., Esposito, C., Guadagni, R., & Polese, G. (2021). An intelligent system for focused crawling from Big Data sources. Expert Systems with Applications, 184, 115560. https://doi.org/10.1016/J.ESWA.2021.115560
Booij, T. M., Chiscop, I., Meeuwissen, E., Moustafa, N., & Hartog, F. T. H. D. (2022). ToN_IoT: The Role of Heterogeneity and the Need for Standardization of Features and Attack Types in IoT Network Intrusion Data Sets. IEEE Internet of Things Journal, 9(1), 485–496. https://doi.org/10.1109/JIOT.2021.3085194
Ajjam, M. H., & Al-Raweshidy, H. S. (2026). AI-driven semantic similarity-based job matching framework for recruitment systems. Information Sciences, 724, 122728. https://doi.org/10.1016/J.INS.2025.122728
Amir Mehmood, M., & Tahir, B. (2024). Humkinar: Construction of a Large Scale Web Repository and Information System for Low Resource Urdu Language. IEEE Access, 12, 128404–128423. https://doi.org/10.1109/ACCESS.2024.3454706
Bifulco, I., Cirillo, S., Esposito, C., Guadagni, R., & Polese, G. (2021). An intelligent system for focused crawling from Big Data sources. Expert Systems with Applications, 184, 115560. https://doi.org/10.1016/J.ESWA.2021.115560
Booij, T. M., Chiscop, I., Meeuwissen, E., Moustafa, N., & Hartog, F. T. H. D. (2022). ToN_IoT: The Role of Heterogeneity and the Need for Standardization of Features and Attack Types in IoT Network Intrusion Data Sets. IEEE Internet of Things Journal, 9(1), 485–496. https://doi.org/10.1109/JIOT.2021.3085194
Bragilovski, M., van Can, A. T., Dalpiaz, F., & Sturm, A. (2025). Leveraging machines to derive domain models from user stories. Requirements Engineering 2025 30:2, 30(2), 241–262. https://doi.org/10.1007/S00766-025-00442-9
Breit, A., Waltersdorfer, L., Ekaputra, F. J., Sabou, M., Ekelhart, A., Iana, A., Paulheim, H., Portisch, J., Revenko, A., Teije, A. Ten, & Van Harmelen, F. (2023). Combining Machine Learning and Semantic Web: A Systematic Mapping Study. ACM Computing Surveys, 55(14 S). https://doi.org/10.1145/3586163;SUBPAGE:STRING:BASIC
Chen, J. B., & Chang, C. H. (2024). Using Hyperlink-Induced Topic Search Algorithm to Optimize Content Placement in Multimedia Content Delivery Network. 34(5). https://doi.org/10.1142/S0218126625501300
Choi, H., & Jeong, J. (2025). Domain-Specific Manufacturing Analytics Framework: An Integrated Architecture with Retrieval-Augmented Generation and Ollama-Based Models for Manufacturing Execution Systems Environments. Processes 2025, Vol. 13, Page 670, 13(3), 670. https://doi.org/10.3390/PR13030670
Deterding, N. M., & Waters, M. C. (2021). Flexible Coding of In-depth Interviews: A Twenty-first-century Approach. Sociological Methods and Research, 50(2), 708–739. https://doi.org/10.1177/0049124118799377
Fan, Y., Xie, X., Cai, Y., Chen, J., Ma, X., Li, X., Zhang, R., & Guo, J. (2022). Pre-training Methods in Information Retrieval. Foundations and Trends in Information Retrieval, 16(3), 178–317. https://doi.org/10.1561/1500000100
Göppert, A., Grahn, L., Rachner, J., Grunert, D., Hort, S., & Schmitt, R. H. (2021). Pipeline for ontology-based modeling and automated deployment of digital twins for planning and control of manufacturing systems. Journal of Intelligent Manufacturing 2021 34:5, 34(5), 2133–2152. https://doi.org/10.1007/S10845-021-01860-6
Guo, J., Cai, Y., Fan, Y., Sun, F., Zhang, R., & Cheng, X. (2022). Semantic Models for the First-Stage Retrieval: A Comprehensive Review. ACM Transactions on Information Systems, 40(4). https://doi.org/10.1145/3486250
Gupta, V., Sharma, D. K., & Dixit, A. (2021). Review of Information Retrieval: Models, Performance Evaluation Techniques and Applications. International Journal of Sensors, Wireless Communications and Control, 11(9), 896–909. https://doi.org/10.2174/2210327911666210121161142/CITE/REFWORKS
Joseph, M. H., & Ravana, S. D. (2024). Reliable Information Retrieval Systems Performance Evaluation: A Review. IEEE Access, 12, 51740–51751. https://doi.org/10.1109/ACCESS.2024.3377239
Kadyrbek, N., Tuimebayev, Z., Mansurova, M., & Viegas, V. (2025). The Development of Small-Scale Language Models for Low-Resource Languages, with a Focus on Kazakh and Direct Preference Optimization. Big Data and Cognitive Computing 2025, Vol. 9, Page 137, 9(5), 137. https://doi.org/10.3390/BDCC9050137
Kayest, M., & Jain, S. K. (2022). Optimization driven cluster based indexing and matching for the document retrieval. Journal of King Saud University - Computer and Information Sciences, 34(3), 851–861. https://doi.org/10.1016/J.JKSUCI.2019.02.012
Lyu, Y., Li, Z., Niu, S., Xiong, F., Tang, B., Wang, W., Wu, H., Liu, H., Xu, T., & Chen, E. (2025). CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models. ACM Transactions on Information Systems, 43(2). https://doi.org/10.1145/3701228
Nadim, M., Akopian, D., & Matamoros, A. (2023). A Comparative Assessment of Unsupervised Keyword Extraction Tools. IEEE Access, 11, 144778–144798. https://doi.org/10.1109/ACCESS.2023.3344032
Nethravathi, B., Saruka, A., Amitha, G., Bharath, T. P., & Suyagya, S. (2020). Structuring Natural Language to Query Language: A Review. Engineering, Technology & Applied Science Research, 10(6), 6521–6525. https://doi.org/10.48084/ETASR.3873
Pandey, V. K., Sahu, D., Prakash, S., Rathore, R. S., Dixit, P., & Hunko, I. (2025). A lightweight framework to secure IoT devices with limited resources in cloud environments. Scientific Reports 2025 15:1, 15(1), 26009-. https://doi.org/10.1038/s41598-025-09885-0
Santos, J., Wauters, T., Volckaert, B., & De Turck, F. (2021). Towards Low-Latency Service Delivery in a Continuum of Virtual Resources: State-of-the-Art and Research Directions. IEEE Communications Surveys and Tutorials, 23(4), 2557–2589. https://doi.org/10.1109/COMST.2021.3095358
Solanki, A., & Kumar, A. (2018). A system to transform natural language queries into SQL queries. International Journal of Information Technology 2018 14:1, 14(1), 437–446. https://doi.org/10.1007/S41870-018-0095-2
Vijayan, V., Connolly, J., Condell, J., McKelvey, N., & Gardiner, P. (2021). Review of Wearable Devices and Data Collection Considerations for Connected Health. Sensors 2021, Vol. 21, Page 5589, 21(16), 5589. https://doi.org/10.3390/S21165589
von Hippel, E., & Kaulartz, S. (2021). Next-generation consumer innovation search: Identifying early-stage need-solution pairs on the web. Research Policy, 50(8), 104056. https://doi.org/10.1016/J.RESPOL.2020.104056
Xiong, H., Bian, J., Li, Y., Li, X., Du, M., Wang, S., Yin, D., & Helal, S. (2024). When Search Engine Services Meet Large Language Models: Visions and Challenges. IEEE Transactions on Services Computing, 17(6), 4558–4577. https://doi.org/10.1109/TSC.2024.3451185
Yang, M., Wang, H., Wei, Z., Wang, S., & Wen, J. R. (2024). Efficient Algorithms for Personalized PageRank Computation: A Survey. IEEE Transactions on Knowledge and Data Engineering, 36(9), 4582–4602. https://doi.org/10.1109/TKDE.2024.3376000
Yaqub, M. Z., & Al-Sabban, A. S. (2023). Knowledge Sharing through Social Media Platforms in the Silicon Age. Sustainability (Switzerland), 15(8), 1–19. https://doi.org/10.3390/su15086765
Annam Rupa