Lightweight String Similarity Approaches for Duplicate Detection in Academic Titles

Fahrudin Mukti Wibowo; Muhammad Zidny Nafan; Muhamad Azrino Gustalika; Harinda Fernando; Muhammad Hussain; Nur Afiqah Sahadun

doi:10.33093/jiwe.2025.4.3.25

PDF

Published: 14 October 2025

DOI: https://doi.org/10.33093/jiwe.2025.4.3.25

Keywords:

Duplicate Detection, String Similarity, TF-IDF, Lightweight NLP, Hybrid Models

Fahrudin Mukti Wibowo

Telkom University, Indonesia

https://orcid.org/0000-0001-7681-5255

Muhammad Zidny Nafan

Telkom University, Indonesia

Muhamad Azrino Gustalika

Telkom University, Indonesia

Harinda Fernando

Sri Lanka Institute of Information Technology, Sri Lanka

Muhammad Hussain

University of Sindh, Pakistan

Nur Afiqah Sahadun

Universiti Tun Hussein Onn Malaysia, Malaysia

Abstract

This study addresses the critical challenge of detecting duplicate final year project (FYP) titles in academic institutions, where minor variations like reordering, synonyms, and paraphrasing often obscure plagiarism. We systematically evaluate four string similarity algorithms - Jaro-Winkler, Levenshtein Edit Distance, TF-IDF with Cosine Similarity, and Jaccard Similarity - using a synthetic dataset of 250 title pairs representing common duplication patterns. Our experiments reveal that character-based methods (Jaro-Winkler and Edit Distance) achieve perfect detection (F1-score=1.0) for literal matches, including typographical variations and phrase reordering. At the same time, TF-IDF demonstrates strong semantic capability (F1-score=0.95), albeit with some false positives. Jaccard Similarity performs poorly (Recall=0.40) due to its inability to handle paraphrased content. The analysis of score distributions show a clear separation between duplicates and non-duplicates for character-based approaches, compared to significant overlap in set-based methods. Based on these findings, we propose a practical two-stage screening framework: initial high-confidence filtering using Jaro-Winkler (threshold>0.9) followed by semantic validation with TF-IDF (threshold>0.8). This hybrid approach offers institutions an effective balance between accuracy and computational efficiency for title screening. This study contributes by demonstrating how existing string similarity techniques can be orchestrated into a lightweight, two-stage screening framework tailored for academic title duplication, balancing accuracy with deployment feasibility in institutional settings. Future work should explore multilingual extensions and validation with real-world title datasets to further enhance the practical applicability of these findings.

How to Cite

Mukti Wibowo, F., Nafan, M. Z., Gustalika, M. A., Fernando, H., Hussain, M., & Sahadun, N. A. (2025). Lightweight String Similarity Approaches for Duplicate Detection in Academic Titles. Journal of Informatics and Web Engineering, 4(3), 416–426. https://doi.org/10.33093/jiwe.2025.4.3.25

Issue

Vol. 4 No. 3 (2025): October 2025

Section

Thematic (AI-Enhanced Computing and Digital Transformation)

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

All articles published in JIWE are licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License. Readers are allowed to

Share — copy and redistribute the material in any medium or format under the following conditions:
Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use;
NonCommercial — You may not use the material for commercial purposes;
NoDerivatives — If you remix, transform, or build upon the material, you may not distribute the modified material.

References

D. Prakoso, A. Abdi, and C. Amrit, “Short text similarity measurement methods: a review”, Soft Computing, vol. 25, no. 6, pp. 4699-4723, 2021, doi:10.1007/s00500-020-05479-2.

M. Han, X. Zhang, X. Yuan, J. Jiang, W. Yun, and C. Gao, “A survey on the techniques, applications, and performance of short text semantic similarity”, Concurrency and Computation: Practice and Experience, vol. 33, no. 5, 2020, doi: 10.1002/cpe.5971.

J. Gatto, O. Sharif, P. Seegmiller, P. Bohlman, and S. M. Preum, “Text encoders lack knowledge: Leveraging generative LLMs for domain-specific semantic textual similarity,” arXiv preprint arXiv:2309.06541, 2023.

T. Celikten, and A. Onan, “Exploring text similarity in human and AI-generated scientific abstracts: A comprehensive analysis,” in IEEE Access, vol. 13, pp. 74313-74334, 2025, doi: 10.1109/ACCESS.2025.3564867.

C. Zhou, C. Qiu, L. Liang, and D. Acuna, “Paraphrase identification with deep learning: A review of datasets and methods”, IEEE Access, vol. 13, pp. 65797-65822, 2025, doi:10.1109/access.2025.3556899.

Z. Amur, Y. Hooi, H. Bhanbhro, K. Dahri, and G. Soomro, “Short-text semantic similarity (STSS): Techniques, challenges and future perspectives”, Applied Sciences, vol. 13, no. 6, pp. 3911, 2023, doi:10.3390/app13063911.

J. Zhang, L. Qian, S. Wang, Y. Zhu, Z. Gao, H. Yu, and W. Li, “A Levenshtein distance-based method for word segmentation in corpus augmentation of geoscience texts,” Annals of GIS, vol. 29, no. 2, pp. 293–306, 2023, doi: 10.1080/19475683.2023.2165543.

Y. Chaabi and F. A. Allah, “Amazigh spell checker using Damerau-Levenshtein algorithm and N-gram,” Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 8, pp. 6116–6124, 2022, doi:10.1016/j.jksuci.2021.07.015.

O. Rozinek, and J. Mares, “Fast and precise convolutional Jaro and Jaro-Winkler similarity,” 2024 35th Conference of Open Innovations Association (FRUCT), Tampere, Finland, pp. 604-613, 2024, doi: 10.23919/FRUCT61870.2024.10516360.

N. Ifada, F. Rachman, and S. Wahyuni, “Character-based string matching similarity algorithms for Madurese spelling correction: A preliminary study,” in 2023 International Conference on Electrical Engineering and Informatics (ICEEI), pp. 1–6, 2023, doi: 10.1109/ICEEI59426.2023.10346716.

L.-C. Chen, “An extended TF-IDF method for improving keyword extraction in traditional corpus-based research: An example of a climate change corpus,” Data & Knowledge Engineering, vol. 153, pp. 102322, 2024, doi: 10.1016/j.datak.2024.102322.

S. M. M. Hossain, K. M. A. Kamal, A. Sen, and I. H. Sarker, “TF-IDF feature-based spam filtering of mobile SMS using a machine learning approach,” in Applied Intelligence for Industry 4.0, Boca Raton, FL, USA: Chapman and Hall/CRC, pp. 162–175, 2023, doi: 10.1201/9781003340066-11.

W. Suwarningsih, and N. Nuryani, “Generate fuzzy string-matching to build self attention on Indonesian medical-chatbot”, International Journal of Electrical and Computer Engineering (IJECE), vol. 14, no. 1, pp. 819, 2024, doi:10.11591/ijece.v14i1.pp819-829.

D. Subramanian, T. Jeyaprakash, M. Preetha, S. Ganga, and S. Sajeev, “Similarities and ranking of documents using TF-IDF, LDA and WAM”, 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), pp. 01-07, 2024, doi:10.1109/adics58448.2024.10533526.

A. Mishra, and V. Panchal, “A novel approach to capture the similarity in summarized text using embedded model”, International Journal on Smart Sensing and Intelligent Systems, vol. 15, no. 1, 2022, doi:10.2478/ijssis-2022-0002.

H. Arabi, and M. Akbari, “Improving plagiarism detection in text document using hybrid weighted similarity”, Expert Systems With Applications, vol. 207, pp. 118034, 2022, doi:10.1016/j.eswa.2022.118034.

Z. Balani, and C. Varol, “Combining approximate string matching algorithms and term frequency in the detection of plagiarism,” International Journal of Computer Science and Security (IJCSS), vol. 15, no. 4, pp. 97–105, 2021.

J. Halim, and D. Lasut, “Document plagiarism detection application using web-based TF-IDF and Cosine similarity methods”, Bit-Tech, vol. 7, no. 2, pp. 202-213, 2024, doi:10.32877/bt.v7i2.1697.

C. Chang, S. Lee, C. Wu, C. Liu, and C. Liu, “Using word semantic concepts for plagiarism detection in text documents”, Information Retrieval Journal, vol. 24, no. 4-5, pp. 298-321, 2021, doi:10.1007/s10791-021-09394-4.

S. Torabi, M. Dib, E. Bou-Harb, C. Assi, and M. Debbabi, “A strings-based similarity analysis approach for characterizing IoT malware and inferring their underlying relationships," in IEEE Networking Letters, vol. 3, no. 3, pp. 161-165, Sept. 2021, doi: 10.1109/LNET.2021.3076600.

Y. Zhou, C. Li, G. Huang, Q. Guo, H. Li, and X. Wei, “A short-text similarity model combining semantic and syntactic information”, Electronics, vol. 12, no. 14, pp. 3126, 2023, doi:10.3390/electronics12143126.

Article Sidebar

Main Article Content

Abstract

Article Details

References

Most read articles by the same author(s)