Review of approaches for paraphrase identification




natural language processing, paraphrase identification, machine learning


The article is devoted to a review of approaches to solving the problem of identifying paraphrases. This problem's relevance and use in tasks such as plagiarism detection, text simplification, and information search are described. Several classes of solutions were considered. The first approach is based on manual rules - it uses manually selected features based on the fundamental properties of paraphrases. The second approach is based on lexical similarity and various databases and ontologies. Machine learning-based approaches are also presented in this paper and describe different architectures that can be used to identify paraphrases. The last approach considered is based on deep learning and modern models of transformers.

Pages of the article in the issue: 71 - 78

Language of the article: Ukrainian


AMAZON WEB SERVICES, INC. (2019). Chatbots in Call Centers – Amazon Web Services (AWS). [online] Available at:

CORTES, C. and VAPNIK V. (1995). Support-vector networks. Machine learning, 20(3), pp.273–297.

YIN, W., KANN, K., YU, M., & SCHÜTZE, H. (2017). Comparative Study of CNN and RNN for Natural Language Processing.

DEVLIN, J., CHANG, M.-W., LEE, K., & TOUTANOVA, K. (2018). BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding. arXiv.

RADFORD, A., & NARASIMHAN, K. (2018). Improving Language Understanding by Generative Pre-Training.

FELLBAUM, C. (1998). WordNet. An Electronic Lexical Database.

LANDAUER, T., FOLTZ, P., & LAHAM, D. (1998). An Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259–284.

BOONTHUM, C. (2004). iSTART: Paraphrase Recognition. Proceedings of the ACL Student Research Workshop, 31–36.

SOWA, J. F. (1992). Conceptual graphs as a universal knowledge representation. Computers & Mathematics with Applications, 23(2), 75–93.

SLEATOR, D., & TEMPERLEY, D. (1995). Parsing English with a Link Grammar. CoRR, abs/cmp-lg/9508004.

MIHALCEA, R., CORLEY, C., & STRAPPARAVA, C. (2006). Corpus-based and Knowledge-based Measures of Text Semantic Similarity. Proceedings of the National Conference on Artificial Intelligence, 1.

TURNEY, P. D. (2001). Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. Proceedings of the 12th European Conference on Machine Learning, 491–502.

LEACOCK, C., CHODOROW, M., & MILLER, G. A. (1998). Using Corpus Statistics and WordNet Relations for Sense Identification. Computational Linguistics, 24(1), 147–165.

WU, Z., & PALMER, M. (1994). Verb Semantics and Lexical Selection. 32nd Annual Meeting of the Association for Computational Linguistics, 133–138.

DOLAN, B., QUIRK, C., & BROCKETT, C. (2004). Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources. COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, 350–356.

MADNANI, N., TETREAULT, J., & CHODOROW, M. (2012). Re-examining Machine Translation Metrics for Paraphrase Identification. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 182–190.

PAPINENI, K., ROUKOS, S., WARD, T., & ZHU, W.-J. (2002). Bleu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318.

DODDINGTON, G. R. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics.

SNOVER, M., DORR, B., SCHWARTZ, R., MICCIULLA, L., & MAKHOUL, J. (2006). A Study of Translation Edit Rate with Targeted Human Annotation. Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, 223–231.

AHA, D. W., KIBLER, D., & ALBERT, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6(1), 37–66.

REIMERS, N., & GUREVYCH, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), 3982–3992.

GANITKEVITCH, J., VAN DURME, B., & CALLISON-BURCH, C. (2013). PPDB: The Paraphrase Database. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 758–764.

MARELLI, M., BENTIVOGLI, L., BARONI, M., BERNARDI, R., MENINI, S., & ZAMPARELLI, R. (2014). SemEval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment. Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 1–8.

KAGGLE. (2017) Quora Duplicate Questions [Online] – Available from:

WIETING, J., & GIMPEL, K. (2018). ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 451–462.




How to Cite

Vrublevskyi, V. N., & Marchenko, A. A. (2023). Review of approaches for paraphrase identification. Bulletin of Taras Shevchenko National University of Kyiv. Physical and Mathematical Sciences, (1), 71–78.



Computer Science and Informatics