Review of approaches for paraphrase identification
DOI:
https://doi.org/10.17721/1812-5409.2023/1.10Keywords:
natural language processing, paraphrase identification, machine learningAbstract
The article is devoted to a review of approaches to solving the problem of identifying paraphrases. This problem's relevance and use in tasks such as plagiarism detection, text simplification, and information search are described. Several classes of solutions were considered. The first approach is based on manual rules - it uses manually selected features based on the fundamental properties of paraphrases. The second approach is based on lexical similarity and various databases and ontologies. Machine learning-based approaches are also presented in this paper and describe different architectures that can be used to identify paraphrases. The last approach considered is based on deep learning and modern models of transformers.
Pages of the article in the issue: 71 - 78
Language of the article: Ukrainian
References
AMAZON WEB SERVICES, INC. (2019). Chatbots in Call Centers – Amazon Web Services (AWS). [online] Available at: https://aws.amazon.com/chatbots-in-call-centers/.
CORTES, C. and VAPNIK V. (1995). Support-vector networks. Machine learning, 20(3), pp.273–297.
YIN, W., KANN, K., YU, M., & SCHÜTZE, H. (2017). Comparative Study of CNN and RNN for Natural Language Processing.
DEVLIN, J., CHANG, M.-W., LEE, K., & TOUTANOVA, K. (2018). BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding. arXiv. https://doi.org/10.48550/ARXIV.1810.04805
RADFORD, A., & NARASIMHAN, K. (2018). Improving Language Understanding by Generative Pre-Training.
FELLBAUM, C. (1998). WordNet. An Electronic Lexical Database.
LANDAUER, T., FOLTZ, P., & LAHAM, D. (1998). An Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259–284. https://doi.org/10.1080/01638539809545028
BOONTHUM, C. (2004). iSTART: Paraphrase Recognition. Proceedings of the ACL Student Research Workshop, 31–36. https://aclanthology.org/P04-2006
SOWA, J. F. (1992). Conceptual graphs as a universal knowledge representation. Computers & Mathematics with Applications, 23(2), 75–93.
SLEATOR, D., & TEMPERLEY, D. (1995). Parsing English with a Link Grammar. CoRR, abs/cmp-lg/9508004.
MIHALCEA, R., CORLEY, C., & STRAPPARAVA, C. (2006). Corpus-based and Knowledge-based Measures of Text Semantic Similarity. Proceedings of the National Conference on Artificial Intelligence, 1.
TURNEY, P. D. (2001). Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. Proceedings of the 12th European Conference on Machine Learning, 491–502.
LEACOCK, C., CHODOROW, M., & MILLER, G. A. (1998). Using Corpus Statistics and WordNet Relations for Sense Identification. Computational Linguistics, 24(1), 147–165. https://aclanthology.org/J98-1006
WU, Z., & PALMER, M. (1994). Verb Semantics and Lexical Selection. 32nd Annual Meeting of the Association for Computational Linguistics, 133–138. https://doi.org/10.3115/981732.981751
DOLAN, B., QUIRK, C., & BROCKETT, C. (2004). Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources. COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, 350–356. https://aclanthology.org/C04-1051
MADNANI, N., TETREAULT, J., & CHODOROW, M. (2012). Re-examining Machine Translation Metrics for Paraphrase Identification. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 182–190. https://aclanthology.org/N12-1019
PAPINENI, K., ROUKOS, S., WARD, T., & ZHU, W.-J. (2002). Bleu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. https://doi.org/10.3115/1073083.1073135
DODDINGTON, G. R. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics.
SNOVER, M., DORR, B., SCHWARTZ, R., MICCIULLA, L., & MAKHOUL, J. (2006). A Study of Translation Edit Rate with Targeted Human Annotation. Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, 223–231. https://aclanthology.org/2006.amta-papers.25
AHA, D. W., KIBLER, D., & ALBERT, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6(1), 37–66.
REIMERS, N., & GUREVYCH, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), 3982–3992. https://doi.org/10.18653/v1/D19-1410
GANITKEVITCH, J., VAN DURME, B., & CALLISON-BURCH, C. (2013). PPDB: The Paraphrase Database. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 758–764. https://aclanthology.org/N13-1092
MARELLI, M., BENTIVOGLI, L., BARONI, M., BERNARDI, R., MENINI, S., & ZAMPARELLI, R. (2014). SemEval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment. Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 1–8. https://doi.org/10.3115/v1/S14-2001
KAGGLE. (2017) Quora Duplicate Questions [Online] – Available from: https://www.kaggle.com/aymenmouelhi/quora-duplicate-questions.
WIETING, J., & GIMPEL, K. (2018). ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 451–462. https://doi.org/10.18653/v1/P18-1042
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 V. N. Vrublevskyi, A. A. Marchenko
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).