Optimised Data Integration using Transformer Model and Resource Description Framework
Main Article Content
Abstract
Organizations have become highly reliant on a range of data sources that span structured, semi-structured, and unstructured data types. These repositories allow large-scale storage for faster ingestion and analytics but pose tremendous challenges of integration owing to schema and contextual differences. Traditional data integration methods, such as the ontology-based Resource Description Framework (RDF), are often inadequate when dealing with these challenges. They specifically struggle with the dynamic evolution of the schema of data sources, context-aware interpretation, and achieving interoperability across heterogeneous data sources. This paper presents an integrated system that augments resource description knowledge with token embeddings using the attention mechanism of the transformer model with relative positional encoding to overcome these weaknesses. Data from unstructured sources are used to create an embedding, whereas structured data are mapped into the RDF. The embeddings were then integrated into the RDF using hasEmbedding. Virtual transformations are employed to handle schema alignment and cosine similarity merges similar entities to provide a unified data view. Thus, the model explicitly integrates contextual knowledge within resource description knowledge triples, thereby improving the semantic representation. The proposed system uses a Simple Protocol and Resource Description Knowledge Query Language for the efficient querying of resource description knowledge, thus enhancing interoperability across domains. The proposed model produces a result that attains a good schema mapping accuracy of 97.82%, thus enabling more accurate and meaningful linking of heterogeneous datasets. Empirical trials involving use cases across human activity analysis and flood risk management prove the system’s robustness, scalability, and effectiveness for knowledge discovery while allowing cross-domain integration of heterogeneous types of data within intricate scenarios. The results show that incorporating embedding into RDF reduces dependence on strict, pre-defined ontologies, simplifies schema on-demand alignment, and allows unified querying without the need to curate the integrated data into a traditional data warehouse.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
All articles published in JIWE are licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License. Readers are allowed to
- Share — copy and redistribute the material in any medium or format under the following conditions:
- Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use;
- NonCommercial — You may not use the material for commercial purposes;
- NoDerivatives — If you remix, transform, or build upon the material, you may not distribute the modified material.
References
S. Azzabi, Z. Alfughi, and A. Ouda, “Data Lakes: A Survey of Concepts and Architectures,” Computers, vol. 13, no. 7, pp. 183, 2024, doi:10.3390/computers13070183.
R. Hai, C. Koutras, C. Quix, and M. Jarke, “Data Lakes: A Survey of Functions and Systems,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 12, pp. 12571–12590, 2023, doi: 10.1109/TKDE.2023.3270101.
S. Ranatunga, R. S. Ødegård, K. Jetlund, and E. Onstein, “Use of Semantic Web Technologies to Enhance the Integration and Interoperability of Environmental Geospatial Data: A Framework Based on Ontology-Based Data Access,” ISPRS Int. J. Geo-Inf., vol. 14, no. 2, pp. 52, 2025, doi:10.3390/ijgi14020052.
E. Gilman, F. Bugiotti, A. Khalid, H. Mehmood, P. Kostakos, L. Tuovinen, J. Ylipulli, X. Su, and D. Ferreira, “Addressing Data Challenges to Drive the Transformation of Smart Cities,” ACM Trans. Intell. Syst. Technol., vol. 15, no. 5, Art. no. 88, pp. 1–65, 2024, doi: 10.1145/3663482.
Z. Wei, J. Su, Y. Wang, Y. Tian, and Y. Chang, “A Novel Cascade Binary Tagging Framework for Relational Triple Extraction,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 1476–1488, doi: 10.18653/v1/2020.acl-main.136.
C. Lu, H. Zhou, and H. Su, “Persona and Contextual Semantic Embeddings for Entity Alignment,” in 2023 IEEE 18th Conference on Industrial Electronics and Applications (ICIEA), Aug. 2023, doi: 10.1109/ICIEA58696.2023.10241455.
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word re presentations,” in Proc. NAACL-HLT, 2018, pp. 2227–2237, doi:10.18653/v1/N18-1202
J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186, doi:10.18653/v1/N19-1423
Y. Liu, E. Pena, A. Santos, E. Wu, and J. Freire, “Magneto: Combining Small and Large Language Models for Schema Matching,” arXiv:2412.08194v1 [cs.DB], Dec. 2024.
N. Fanourakis, V. Efthymiou, D. Kotzinos, and V. Christophides, “Knowledge graph embedding methods for entity alignment: Experimental review,” Data Min Knowl Disc, vol. 37, pp. 2070–2137, 2023, doi: 10.1007/s10618-023-00941-9.
M. Souibgui, F. Atigui, S. Zammali, S. Cherfi, S. Ben Yahia, “Data quality in ETL process: A preliminary study,” Procedia Computer Science, vol. 159, pp. 676–687, Jan. 2019, doi: 10.1016/j.procs.2019.09.223.
M. Farber and A. Rettinger, “A systematic approach to evaluating knowledge graph quality,” Semantic Web Journal, vol. 11, no. 2, pp. 393–420, 2020, doi:10.3233/SW-190362.
S. Sakr, M. Wylot, and P. Cudré-Mauroux, “RDF data management: A survey of systems,” ACM Computing Surveys, vol. 51, no. 4, pp. 1–84, 2019, doi: 10.1145/3342190.
S. Zhang, J. Li, and Z. Liu, “Embedding-enhanced RDF for hybrid knowledge graphs,” Information Sciences, vol. 534, pp. 186–203, 2020, doi:10.1016/j.ins.2020.03.075.
A. Hogan, E. Blomqvist, M. Cochez, C. d'Amato, G. de Melo, C. Gutierrez, J. E. Labra Gayo, S. Kirrane, S. Neumaier, A. Polleres, F. S. Alviano, M. N. Maleshkova, A. N. Ngomo, V. Tamma, and A. Zimmermann, “Knowledge graphs,” ACM Computing Surveys, vol. 54, no. 4, pp. 1–37, 2021, doi:10.1145/3447772.
Q. Chen, A. Allot, and Z. Lu, “LitCovid: An open database for COVID-19 research,” Nucleic Acids Research, vol. 49, no. D1, pp. D1534–D1540, 2020, doi:10.1093/nar/gkaa807.
T. Wang, Y. Zhang, and L. Guo, “Temporal knowledge graph embeddings for evolving RDF data,” Knowledge-Based Systems, vol. 194, p. 105532, 2020, doi:10.1016/j.knosys.2019.105532.
D. Q. Nguyen, T. Vu, and A. Nguyen, “Ontology matching with GNNs,” Semantic Web Journal, vol. 12, no. 5, pp. 887–905, 2021, doi:10.3233/SW-210436.
M. Ali, and R. Mehmood, “A semantic model for public administration data,” Government Information Quarterly, vol. 38, no. 3, pp. 101592, 2021, doi:10.1016/j.giq.2021.101592.
Y. Shao, B. Liu, and M. Zhang, “Context-aware RDF disambiguation,” Journal of Web Semantics, vol. 67, pp. 100663, 2021, doi:10.1016/j.websem.2021.100663.
V. Nundloll, A. Oloke, P. Smart, and N. Shadbolt, “Semantic integration of flood risk data using OWL and RDF,” Environmental Challenges, vol. 4, pp. 100064, 2021, doi:10.1016/j.envc.2021.100064.
W. Ali, M. Khan, A. Shams, A. Ullah, and M. M. Rathore, “Multilingual RDF integration using transformer embeddings,” Data & Knowledge Engineering, vol. 145, pp. 102116, 2023, doi: 10.1016/j.datak.2022.102116.
Y. Song, L. Zhang, Q. Wang, and S. Lin, “Hybrid reasoning over RDF with neural attention,” Neurocomputing, vol. 553, pp. 126837, 2024, doi:10.1016/j.neucom.2023.126837.
W. Li, R. Peng, and Z. Li, “Improving knowledge graph completion via increasing embedding interactions,” Applied Intelligence, vol. 52, pp. 9289–9307, 2022, doi : 10.1007/s10489-021-02554-2.
C. M. Chituru, S.-B. Ho, and I. Chai, “Diabetes risk prediction using Shapley additive explanations for feature engineering,” Journal of Informatics and Web Engineering, vol. 4, no. 2, pp. 18–35, 2025, doi: 10.33093/jiwe.2025.4.2.2.
M. T.T. Yong, S.-B. Ho, and C.-H. Tan, “Migraine generative artificial intelligence based on mobile personalized healthcare,” Journal of Informatics and Web Engineering, vol. 4, no. 1, pp. 275–291, 2025, doi: 10.33093/jiwe.2025.4.1.20.
J.L. Goh, S.-B. Ho, and C.-H. Tan, “Weather-based arthritis tracking: A mobile mechanism for preventive strategies,” Journal of Informatics and Web Engineering, vol. 3, no. 1, pp. 210–225, 2024, doi: 10.33093/jiwe.2024.3.1.14.
J. C. Couto and D. D. Ruiz, “An overview about data integration in data lakes,” 2022 17th Iberian Conference on Information Systems and Technologies (CISTI), Madrid, Spain, 2022, pp. 1-7, doi: 10.23919/CISTI54924.2022.9820576.
W.X. Ong, S.-B. Ho, and C.-H. Tan, “Enhancing migraine management system through weather forecasting for a better daily life,” Journal of Informatics and Web Engineering, vol. 2, no. 2, pp. 201–217, 2023, doi: 10.33093/jiwe.2023.2.2.15.
K. M. Jablonka, D. Ongari, S. M. Moosavi, and B. Smit, “Big-data science in porous materials: Materials genomics and machine learning,” Chem. Rev., vol. 120, no. 16, pp. 8066–8129, 2020, doi: 10.1021/acs.chemrev.0c00004.
T. Abgrall, “Schema Decomposition via Transformation Patterns,” in Proc. 42nd ACM SIGMOD-SIGACT-SIGAI Symp. Principles of Database Systems (PODS '23), Seattle, WA, USA, Jun. 2023, pp. 1–13.
M, Abdulkarim, M. Abdullahi, and J.A. Achir, “Improving Part-of-Speech Tagging with Relative Positional Encoding in Transformer Models and Basic Rules”, Indonesian Journal of Data and Science, 6(1), pp. 10-19, https://doi.org/10.56705/ijodas.v6i1.184
Z. Lin, D. Yang, and X. Yin, “Patient similarity via joint embeddings of medical knowledge graph and medical entity descriptions,” IEEE Access, vol. 8, pp. 156663–156676, 2020, doi:10.1109/ACCESS.2020.3002977
J. C. Couto, and D. D. Ruiz, “An overview about data integration in data lakes,” 2022 17th Iberian Conference on Information Systems and Technologies (CISTI), Madrid, Spain, 2022, pp. 1-7, doi: 10.23919/CISTI54924.2022.9820576.
Q. Chen, A. Fisch, J. Weston, and A. Bordes, “Reading Wikipedia to answer open-domain questions,” in Proc. ACL, 2017, pp. 1870–1879, doi:10.18653/v1/P17-1178.
J. Bos, V. Basile, K. Evang, N. J. Venhuizen, and J. Bjerva, “The Groningen Meaning Bank,” in Handbook of Linguistic Annotation, N. Ide and J. Pustejovsky, Eds. Dordrecht, Netherlands: Springer, 2017, pp. 463–496, doi: https://doi.org/10.1007/978-94-024-0881-2_20.