Knowledge-based Word Tokenization System for Urdu

Main Article Content

Asif Khan
Khairullah Khan
Wahab Khan
Sadiq Nawaz Khan
Rafiul Haq


Word tokenization, a foundational step in natural language processing (NLP), is critical for tasks like part-of-speech tagging, named entity recognition, and parsing, as well as various independent NLP applications. In our tech-driven era, the exponential growth of textual data on the World Wide Web demands sophisticated tools for effective processing. Urdu, spoken widely across the globe, is experiencing a surge in, presents unique challenges due to its distinct writing style, the absence of capitalization features, and the prevalence of compound words. This study introduces a novel knowledge-based word tokenization system tailored for Urdu. Central to this system is a maximum matching model with forward and reverse variants, setting it apart from conventional approaches. The novelty of our system lies in its holistic approach, integrating knowledge-based techniques, dual-variant maximum matching, and heightened adaptability to low-resource language speakers, emphasizing the urgent need for advanced Urdu Language Processing (ULP) systems. However, Urdu, labeled as a low-resource language challenges compared to traditional machine learning (ML) approaches. Significantly, our system eliminates the need for a features file and pre-labelled datasets, streamlining the tokenization process. To evaluate the proposed model's efficacy, a comprehensive analysis was conducted on a dataset comprising 100 sentences with 5,000 Urdu words, yielding an impressive accuracy of 97%. This research makes a substantial contribution to Urdu language processing, providing an innovative solution to the complexities posed by the unique linguistic attributes of Urdu tokenization.

Article Details

How to Cite
Khan, A., Khan, K., Khan, W., Khan, S. N., & Haq, R. (2024). Knowledge-based Word Tokenization System for Urdu. Journal of Informatics and Web Engineering, 3(2), 86–97.
Regular issue


G. G. Chowdhury, "Natural language processing," Annual review of information science and technology, vol. 37, pp. 51-89, 2003.

R. Rashid and S. Latif, "A dictionary based urdu word segmentation using maximum matching algorithm for space omission problem," in Asian Language Processing (IALP), 2012 International Conference on, 2012, pp. 101-104.

S. Mukund, R. Srihari, and E. Peterson, "An Information-Extraction System for Urdu---A Resource-Poor Language," ACM Transactions on Asian Language Information Processing (TALIP), vol. 9, p. 15, 2010.

W. Khan, A. Daud, J. A. Nasir, T. Amjad, S. Arafat, N. Aljohani, et al., "Urdu part of speech tagging using conditional random fields," Language Resources and Evaluation, vol. 53, pp. 331-362, 2019.

G. S. Lehal, "A word segmentation system for handling space omission problem in urdu script," in 23rd International Conference on Computational Linguistics, 2010, p. 43.

G. S. Lehal, "A two stage word segmentation system for handling space insertion problem in Urdu script," analysis, vol. 6, p. 7, 2009.

B. Jawaid and T. Ahmed, "Hindi to Urdu conversion: beyond simple transliteration," in Conference on Language and Technology, 2009.

A. Mahmood, "Arabic & Urdu Text Segmentation Challenges & Techniques," vol. IV, pp. 32-34, 2013.

M. P. Akhter, Z. Jiangbin, I. R. Naqvi, M. Abdelmajeed, and M. Fayyaz, "Exploring deep learning approaches for Urdu text classification in product manufacturing," Enterprise Information Systems, vol. 16, pp. 223-248, 2022.

G. M. Raza, Z. S. Butt, S. Latif, and A. Wahid, "Sentiment analysis on COVID tweets: an experimental analysis on the impact of count vectorizer and TF-IDF on sentiment predictions using deep learning models," in 2021 International Conference on Digital Futures and Transformative Technologies (ICoDT2), 2021, pp. 1-6.

I. A. Norabid and F. Fauzi, "Rule-based Text Extraction for Multimodal Knowledge Graph," International Journal of Advanced Computer Science and Applications, vol. 13, 2022.

D. D. Palmer, "A trainable rule-based algorithm for word segmentation," in Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, 1997, pp. 321-328.

N. Durani and S. Hussain, "Urdu Word Segmentation, Human Language Technologies," in The Annual Conference of the North American Chapter of the ACL, Los Angeles, California, 2010, pp. 528-536.

C. Kit, H. Pan, and H. Chen, "Learning case-based knowledge for disambiguating Chinese word segmentation: A preliminary study," in COLING-02: The First SIGHAN Workshop on Chinese Language Processing, 2002.

J.-S. Chang, S.-D. Chen, Y. Zheng, X.-Z. Liu, and S.-J. Ke, "Large-corpus-based methods for Chinese personal name recognition," Journal of Chinese Information Processing, vol. 6, pp. 7-15, 1992.

W. Aroonmanakun, "Collocation and Thai word segmentation," Proceedings Of SNLP-Oriental COCOSDA, pp. 68-75, 2002.

W. Aroonmanakun, "Collocation and thai word segmentation," in Proceedings of the 5th SNLP & 5th Oriental COCOSDA Workshop, 2002, pp. 68-75.

A. Saeed, R. M. A. Nawab, M. Stevenson, and P. Rayson, "A word sense disambiguation corpus for Urdu," Language Resources and Evaluation, vol. 53, pp. 397-418, 2019.

A. Daud, W. Khan, and D. Che, "Urdu language processing: a survey," Artificial Intelligence Review, vol. 47, pp. 279-311, 2017.

A. Thawani, J. Pujara, P. A. Szekely, and F. Ilievski, "Representing numbers in NLP: a survey and a vision," arXiv preprint arXiv:2103.13136, 2021.

S. N. Khan, K. Khan, A. Khan, A. Khan, A. U. Khan, and B. Ullah, "Urdu word segmentation using machine learning approaches," International Journal of Advanced Computer Science and Applications, vol. 9, pp. 193-200, 2018.

H. E. Boukkouri, O. Ferret, T. Lavergne, H. Noji, P. Zweigenbaum, and J. Tsujii, "CharacterBERT: Reconciling ELMo and BERT for word-level open-vocabulary representations from characters," arXiv preprint arXiv:2010.10392, 2020.

L. Zhu, M. Zhang, J. Xu, C. Li, J. Yan, G. Zhou, et al., "Single-junction organic solar cells with over 19% efficiency enabled by a refined double-fibril network morphology," Nature Materials, vol. 21, pp. 656-663, 2022.

L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z.-H. Jiang, et al., "Tokens-to-token vit: Training vision transformers from scratch on imagenet," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 558-567.

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, et al., "Survey of hallucination in natural language generation," ACM Computing Surveys, vol. 55, pp. 1-38, 2023.

C. Collaboration†‡, T. Aaltonen, S. Amerio, D. Amidei, A. Anastassov, A. Annovi, et al., "High-precision measurement of the W boson mass with the CDF II detector," Science, vol. 376, pp. 170-176, 2022.

P. Charoenpornsawat, B. Kijsirikul, and S. Meknavin, "Feature-based thai unknown word boundary identification using winnow," in Circuits and Systems, 1998. IEEE APCCAS 1998. The 1998 IEEE Asia-Pacific Conference on, 1998, pp. 547-550.

Z. Hussain, J. K. Nurminen, T. Mikkonen, and M. Kowiel, "Combining Rule-Based System and Machine Learning to Classify Semi-natural Language Data," in Proceedings of SAI Intelligent Systems Conference, 2022, pp. 424-441.

O. Ahia, S. Kumar, H. Gonen, J. Kasai, D. R. Mortensen, N. A. Smith, et al., "Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models," arXiv preprint arXiv:2305.13707, 2023.

J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, et al., "Scaling language models: Methods, analysis & insights from training gopher," arXiv preprint arXiv:2112.11446, 2021.

A. Thawani, S. Ghanekar, X. Zhu, and J. Pujara, "Learn Your Tokens: Word-Pooled Tokenization for Language Modeling," arXiv preprint arXiv:2310.11628, 2023.

G. S. Lehal, "Ligature segmentation for Urdu OCR," in 2013 12th International Conference on Document Analysis and Recognition, 2013, pp. 1130-1134.

U. Khan, M. B. Ahmad, F. Shafiq, and M. Sarim, "Urdu Natural Language Processing Issues and Challenges: A Review Study," in Intelligent Technologies and Applications: Second International Conference, INTAP 2019, Bahawalpur, Pakistan, November 6–8, 2019, Revised Selected Papers 2, 2020, pp. 461-470.

N. Durrani and S. Hussain, "Urdu word segmentation," in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010, pp. 528-536.

M. Akram and S. Hussain, "Word segmentation for urdu OCR system," in Proceedings of the Eighth Workshop on Asian Language Resouces, 2010, pp. 88-94.

W. Khan, A. Daud, J. A. Nasir, and T. Amjad, "A survey on the state-of-the-art machine learning models in the context of NLP," Kuwait Journal of Science, vol. 43, 2016.

R. Rashid and S. Latif, "A dictionary based Urdu word segmentation using maximum matching algorithm for space omission problem," in 2012 International Conference on Asian Language Processing, 2012, pp. 101-104.