Review of approaches for paraphrase identification

Authors

DOI:

https://doi.org/10.17721/1812-5409.2023/1.10

Keywords:

natural language processing, paraphrase identification, machine learning

Abstract

The article is devoted to a review of approaches to solving the problem of identifying paraphrases. This problem's relevance and use in tasks such as plagiarism detection, text simplification, and information search are described. Several classes of solutions were considered. The first approach is based on manual rules - it uses manually selected features based on the fundamental properties of paraphrases. The second approach is based on lexical similarity and various databases and ontologies. Machine learning-based approaches are also presented in this paper and describe different architectures that can be used to identify paraphrases. The last approach considered is based on deep learning and modern models of transformers.

Pages of the article in the issue: 71 - 78

Language of the article: Ukrainian

References

AMAZON WEB SERVICES, INC. (2019). Chatbots in Call Centers – Amazon Web Services (AWS). [online] Available at: https://aws.amazon.com/chatbots-in-call-centers/.

CORTES, C. and VAPNIK V. (1995). Support-vector networks. Machine learning, 20(3), pp.273–297.

YIN, W., KANN, K., YU, M., & SCHÜTZE, H. (2017). Comparative Study of CNN and RNN for Natural Language Processing.

DEVLIN, J., CHANG, M.-W., LEE, K., & TOUTANOVA, K. (2018). BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding. arXiv. https://doi.org/10.48550/ARXIV.1810.04805

RADFORD, A., & NARASIMHAN, K. (2018). Improving Language Understanding by Generative Pre-Training.

FELLBAUM, C. (1998). WordNet. An Electronic Lexical Database.

LANDAUER, T., FOLTZ, P., & LAHAM, D. (1998). An Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259–284. https://doi.org/10.1080/01638539809545028

BOONTHUM, C. (2004). iSTART: Paraphrase Recognition. Proceedings of the ACL Student Research Workshop, 31–36. https://aclanthology.org/P04-2006

SOWA, J. F. (1992). Conceptual graphs as a universal knowledge representation. Computers & Mathematics with Applications, 23(2), 75–93.

SLEATOR, D., & TEMPERLEY, D. (1995). Parsing English with a Link Grammar. CoRR, abs/cmp-lg/9508004.

MIHALCEA, R., CORLEY, C., & STRAPPARAVA, C. (2006). Corpus-based and Knowledge-based Measures of Text Semantic Similarity. Proceedings of the National Conference on Artificial Intelligence, 1.

TURNEY, P. D. (2001). Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. Proceedings of the 12th European Conference on Machine Learning, 491–502.

LEACOCK, C., CHODOROW, M., & MILLER, G. A. (1998). Using Corpus Statistics and WordNet Relations for Sense Identification. Computational Linguistics, 24(1), 147–165. https://aclanthology.org/J98-1006

WU, Z., & PALMER, M. (1994). Verb Semantics and Lexical Selection. 32nd Annual Meeting of the Association for Computational Linguistics, 133–138. https://doi.org/10.3115/981732.981751

DOLAN, B., QUIRK, C., & BROCKETT, C. (2004). Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources. COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, 350–356. https://aclanthology.org/C04-1051

MADNANI, N., TETREAULT, J., & CHODOROW, M. (2012). Re-examining Machine Translation Metrics for Paraphrase Identification. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 182–190. https://aclanthology.org/N12-1019

PAPINENI, K., ROUKOS, S., WARD, T., & ZHU, W.-J. (2002). Bleu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. https://doi.org/10.3115/1073083.1073135

DODDINGTON, G. R. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics.

SNOVER, M., DORR, B., SCHWARTZ, R., MICCIULLA, L., & MAKHOUL, J. (2006). A Study of Translation Edit Rate with Targeted Human Annotation. Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, 223–231. https://aclanthology.org/2006.amta-papers.25

AHA, D. W., KIBLER, D., & ALBERT, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6(1), 37–66.

REIMERS, N., & GUREVYCH, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), 3982–3992. https://doi.org/10.18653/v1/D19-1410

GANITKEVITCH, J., VAN DURME, B., & CALLISON-BURCH, C. (2013). PPDB: The Paraphrase Database. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 758–764. https://aclanthology.org/N13-1092

MARELLI, M., BENTIVOGLI, L., BARONI, M., BERNARDI, R., MENINI, S., & ZAMPARELLI, R. (2014). SemEval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment. Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 1–8. https://doi.org/10.3115/v1/S14-2001

KAGGLE. (2017) Quora Duplicate Questions [Online] – Available from: https://www.kaggle.com/aymenmouelhi/quora-duplicate-questions.

WIETING, J., & GIMPEL, K. (2018). ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 451–462. https://doi.org/10.18653/v1/P18-1042

Downloads

Published

2023-07-13

How to Cite

Vrublevskyi, V. N., & Marchenko, A. A. (2023). Review of approaches for paraphrase identification. Bulletin of Taras Shevchenko National University of Kyiv. Physical and Mathematical Sciences, (1), 71–78. https://doi.org/10.17721/1812-5409.2023/1.10

Issue

Section

Computer Science and Informatics