Lightweight String Similarity Approaches for Duplicate Detection in Academic Titles
Main Article Content
Abstract
This study addresses the critical challenge of detecting duplicate final year project (FYP) titles in academic institutions, where minor variations like reordering, synonyms, and paraphrasing often obscure plagiarism. We systematically evaluate four string similarity algorithms - Jaro-Winkler, Levenshtein Edit Distance, TF-IDF with Cosine Similarity, and Jaccard Similarity - using a synthetic dataset of 250 title pairs representing common duplication patterns. Our experiments reveal that character-based methods (Jaro-Winkler and Edit Distance) achieve perfect detection (F1-score=1.0) for literal matches, including typographical variations and phrase reordering. At the same time, TF-IDF demonstrates strong semantic capability (F1-score=0.95), albeit with some false positives. Jaccard Similarity performs poorly (Recall=0.40) due to its inability to handle paraphrased content. The analysis of score distributions show a clear separation between duplicates and non-duplicates for character-based approaches, compared to significant overlap in set-based methods. Based on these findings, we propose a practical two-stage screening framework: initial high-confidence filtering using Jaro-Winkler (threshold>0.9) followed by semantic validation with TF-IDF (threshold>0.8). This hybrid approach offers institutions an effective balance between accuracy and computational efficiency for title screening. This study contributes by demonstrating how existing string similarity techniques can be orchestrated into a lightweight, two-stage screening framework tailored for academic title duplication, balancing accuracy with deployment feasibility in institutional settings. Future work should explore multilingual extensions and validation with real-world title datasets to further enhance the practical applicability of these findings.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
All articles published in JIWE are licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License. Readers are allowed to
- Share — copy and redistribute the material in any medium or format under the following conditions:
- Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use;
- NonCommercial — You may not use the material for commercial purposes;
- NoDerivatives — If you remix, transform, or build upon the material, you may not distribute the modified material.
References
D. Prakoso, A. Abdi, and C. Amrit, “Short text similarity measurement methods: a review”, Soft Computing, vol. 25, no. 6, pp. 4699-4723, 2021, doi:10.1007/s00500-020-05479-2.
M. Han, X. Zhang, X. Yuan, J. Jiang, W. Yun, and C. Gao, “A survey on the techniques, applications, and performance of short text semantic similarity”, Concurrency and Computation: Practice and Experience, vol. 33, no. 5, 2020, doi: 10.1002/cpe.5971.
J. Gatto, O. Sharif, P. Seegmiller, P. Bohlman, and S. M. Preum, “Text encoders lack knowledge: Leveraging generative LLMs for domain-specific semantic textual similarity,” arXiv preprint arXiv:2309.06541, 2023.
T. Celikten, and A. Onan, “Exploring text similarity in human and AI-generated scientific abstracts: A comprehensive analysis,” in IEEE Access, vol. 13, pp. 74313-74334, 2025, doi: 10.1109/ACCESS.2025.3564867.
C. Zhou, C. Qiu, L. Liang, and D. Acuna, “Paraphrase identification with deep learning: A review of datasets and methods”, IEEE Access, vol. 13, pp. 65797-65822, 2025, doi:10.1109/access.2025.3556899.
Z. Amur, Y. Hooi, H. Bhanbhro, K. Dahri, and G. Soomro, “Short-text semantic similarity (STSS): Techniques, challenges and future perspectives”, Applied Sciences, vol. 13, no. 6, pp. 3911, 2023, doi:10.3390/app13063911.
J. Zhang, L. Qian, S. Wang, Y. Zhu, Z. Gao, H. Yu, and W. Li, “A Levenshtein distance-based method for word segmentation in corpus augmentation of geoscience texts,” Annals of GIS, vol. 29, no. 2, pp. 293–306, 2023, doi: 10.1080/19475683.2023.2165543.
Y. Chaabi and F. A. Allah, “Amazigh spell checker using Damerau-Levenshtein algorithm and N-gram,” Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 8, pp. 6116–6124, 2022, doi:10.1016/j.jksuci.2021.07.015.
O. Rozinek, and J. Mares, “Fast and precise convolutional Jaro and Jaro-Winkler similarity,” 2024 35th Conference of Open Innovations Association (FRUCT), Tampere, Finland, pp. 604-613, 2024, doi: 10.23919/FRUCT61870.2024.10516360.
N. Ifada, F. Rachman, and S. Wahyuni, “Character-based string matching similarity algorithms for Madurese spelling correction: A preliminary study,” in 2023 International Conference on Electrical Engineering and Informatics (ICEEI), pp. 1–6, 2023, doi: 10.1109/ICEEI59426.2023.10346716.
L.-C. Chen, “An extended TF-IDF method for improving keyword extraction in traditional corpus-based research: An example of a climate change corpus,” Data & Knowledge Engineering, vol. 153, pp. 102322, 2024, doi: 10.1016/j.datak.2024.102322.
S. M. M. Hossain, K. M. A. Kamal, A. Sen, and I. H. Sarker, “TF-IDF feature-based spam filtering of mobile SMS using a machine learning approach,” in Applied Intelligence for Industry 4.0, Boca Raton, FL, USA: Chapman and Hall/CRC, pp. 162–175, 2023, doi: 10.1201/9781003340066-11.
W. Suwarningsih, and N. Nuryani, “Generate fuzzy string-matching to build self attention on Indonesian medical-chatbot”, International Journal of Electrical and Computer Engineering (IJECE), vol. 14, no. 1, pp. 819, 2024, doi:10.11591/ijece.v14i1.pp819-829.
D. Subramanian, T. Jeyaprakash, M. Preetha, S. Ganga, and S. Sajeev, “Similarities and ranking of documents using TF-IDF, LDA and WAM”, 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), pp. 01-07, 2024, doi:10.1109/adics58448.2024.10533526.
A. Mishra, and V. Panchal, “A novel approach to capture the similarity in summarized text using embedded model”, International Journal on Smart Sensing and Intelligent Systems, vol. 15, no. 1, 2022, doi:10.2478/ijssis-2022-0002.
H. Arabi, and M. Akbari, “Improving plagiarism detection in text document using hybrid weighted similarity”, Expert Systems With Applications, vol. 207, pp. 118034, 2022, doi:10.1016/j.eswa.2022.118034.
Z. Balani, and C. Varol, “Combining approximate string matching algorithms and term frequency in the detection of plagiarism,” International Journal of Computer Science and Security (IJCSS), vol. 15, no. 4, pp. 97–105, 2021.
J. Halim, and D. Lasut, “Document plagiarism detection application using web-based TF-IDF and Cosine similarity methods”, Bit-Tech, vol. 7, no. 2, pp. 202-213, 2024, doi:10.32877/bt.v7i2.1697.
C. Chang, S. Lee, C. Wu, C. Liu, and C. Liu, “Using word semantic concepts for plagiarism detection in text documents”, Information Retrieval Journal, vol. 24, no. 4-5, pp. 298-321, 2021, doi:10.1007/s10791-021-09394-4.
S. Torabi, M. Dib, E. Bou-Harb, C. Assi, and M. Debbabi, “A strings-based similarity analysis approach for characterizing IoT malware and inferring their underlying relationships," in IEEE Networking Letters, vol. 3, no. 3, pp. 161-165, Sept. 2021, doi: 10.1109/LNET.2021.3076600.
Y. Zhou, C. Li, G. Huang, Q. Guo, H. Li, and X. Wei, “A short-text similarity model combining semantic and syntactic information”, Electronics, vol. 12, no. 14, pp. 3126, 2023, doi:10.3390/electronics12143126.