Deep learning models for solving the problem of comparing unstructured textual information
DOI:
https://doi.org/10.17721/1812-5409.2025/1.21Keywords:
deep learning, universal vector representations of sentences, financial documents, fiscal code, regular expressionsAbstract
This study is devoted to the application of modern deep learning methods, such as universal multilingual sentence vector representations, for analyzing and comparing the similarity of unstructured text documents. The paper develops an algorithm for determining the similarity of multilingual texts using deep neural networks. The authors propose a supervision-free method for fast analysis and comparison of complex documents on similar topics written in different countries and in different languages. The presented methodology involves pre-processing documents, reformatting them for comparison, and applying deep learning methods to detect textual similarities. The algorithm is tested on the example of a comparative analysis of the similarity of the texts of the Tax Code of Ukraine and the German fiscal code. The Ukrainian fiscal code, presented in HTML format, was processed using the library for analyzing HTML documents and regular expressions, while the German fiscal code in PDF format required special analysis of unstructured content. In the course of the study, to improve the accuracy and overcome the identified limitations of universal multilingual vector representations, the Ukrainian document was translated into English using the Google Translate API. Solving the given task requires deep pre-processing of the data, which was carried out in this study. The article provides a step- by-step algorithm of the proposed methodology, which leads to the formation of a correlation table, which determines the similarity between similar multilingual text documents. Additionally, the authors support every step with their reasoning about how to interpret intermediate results and understand if the method should be adjusted to the need or the problems of the specific documents, like applying machine translation or advanced pre-processing.
The authors propose an algorithm for applying this method to the fiscal codes of European Union member states or other countries. In this research, we identified and described the limitations of this approach, demonstrated the verification of intermediate results, and supplemented the method with additional capabilities to increase its reliability. Recommendations are given on the possibilities of applying the proposed methodology, in particular, the development of algorithms for standardizing certain documents of the European Union.
Pages of the article in the issue: 157 - 163
Language of the article: English
References
Basystiuk O., & Melnykova N. (2022). Multimodal speech recognition based on audio and text data. Bulletin of Khmelnytsky National University: Technical Sciences, 313(5), 22–25 [in Ukrainian]. https://www.doi.org/10.31891/2307-5732-2022-313-5-22-25
Basystiuk O., Shakhovska N., Bilynska V., Syvokon O., Shamuratov O., & Kuchkovskiy V. (2021) The Developing of the System for Autimatic Audio to Text Conversion. IT&AS'2021: Symposium on Information Technologies & Applied Sciences, March 5–6, 2021, (pp. 1–8). Bratislava, Slovak Republic. https://ceur-ws.org/Vol-2824/paper1.pdf
Black, J. E., Kueper, J. K., Williamson, T. S. (2023). An introduction to machine learning for classification and prediction. Family Practice, 40(1), 200–204.
Boyko, N. (2021). Research into machine learning algorithms for the construction of mathematical models of multimodal data classification problems. Computational Problems of Electrical Engineering. 11, 1-11.
Boyko, N., Mochurad, L., Parpan, U., & Basystiuk, O. (2019). Usage of Machine-based Translation Methods for Analyzing Open Data in Legal Cases. In Proc. of the Intl Workshop on Cyber Hygiene (CybHyg-2019) co-located with 1st International Conference on Cyber Hygiene and Conflict Management in Global Information Networks (р. 328–338). Kyiv, Ukraine. https://ceur-ws.org/Vol-2654/paper26.pdf
Golomoziy, V., Mishura, Yu., Izarova, I., & Ianevych, T. (2023). Processing Big Data of Court Decisions. Baltic Journal of Modern Computing, 11. https://doi.org/10.22364/bjmc.2023.11.4.04
Ianevych, T., Golomoziy, V., Mishura, Yu., & Izarova, I. (2023). Comparison of 2D convolutions and dense neural networks for natural language processing models with multi-sentence input. Bulletin of Taras Shevchenko National University of Kyiv. Series: Physics and Mathematics, 20–29. https://doi.org/10.17721/1812-5409.2023/2.3
Lakhno, V. A., Kasatkin, D. Y., Skliarenko, O. V., & Kolodinska, Y. O. (2022). Modeling and Optimization of Discrete Evolutionary Systems of İnformation Security Management in a Random Environment. Machine Learning and Autonomous Systems. Smart Innovation, Systems and Technologies, 269, 9–22. Springer, Singapore. https://doi.org/10.1007/978-981-16-7996-4_2
Peng, Li, Xi, Rao, Jennifer, Blase et al. (2021). CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. arXiv. https://doi.org/10.48550/arXiv.1904.09483
Savchenko, Y., Yarovyi, R., Kolodinska, Y., Levchenko, S. (2024). The role of artificial intelligence and machine learning under martial law. Measuring and computing devices in technological processes, 2, 213–216. [in Ukrainian] https://doi.org/10.31891/2219-9365-2024-78-25
Shakhovska, N., Basystiuk, O., Shakhovska, K. (2019). Development of the speech-to-text chatbot interface based on Google API. CEUR Workshop Proceedings, vol. 2386, 212–221.
Yang, Y., Cer, D., Ahmad, A. et al. (2019). Multilingual Universal Sentence Encoder for Semantic Retrieval. arXiv. https://doi.org/10.48550/arXiv.1907.04307
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Оlena Skliarenko, Аnatolii Pashko, Leonid Lytvynenko, Yanina Kolodinska

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
