Paraphrase Scope and Typology. A Data-Driven Approach from Computational Linguistics
Marta Vila Rigat
Omega-S208 Campus Nord - UPC
Thu Jun 20, 2013
Paraphrasing is generally understood as approximate sameness of meaning between snippets of text with a different wording. Paraphrases are omnipresent in natural languages demonstrating all the aspects of its multifaceted nature. The pervasiveness of paraphrasing has made it a focus of several tasks in computational linguistics; its complexity has in turn resulted in paraphrase remaining a still unresolved challenge. Two basic issues, directly linked to the complex nature of paraphrasing, make its computational treatment particularly difficult, namely the absence of a precise and commonly accepted definition and the lack of reference corpora for paraphrasing. Based on the assumption that linguistic knowledge should underlie computational-linguistics research, this thesis aims to go a step forward in these two questions: paraphrase characterization and paraphrase-corpus building and annotation. The knowledge and resources created are then applied to natural language processing and, in concrete, to automatic plagiarism detection in order to empirically analyse their potential. This thesis is built as an article compendium comprising six core articles divided in three blocks: (i) paraphrase scope and typology, (ii) paraphrase-corpus creation and annotation, and (iii) paraphrasing in automatic plagiarism detection. In the first block, assuming that paraphrase boundaries are not fixed but depend on the field, task, and objectives, three borderline paraphrase cases are presented: paraphrases involving content loss, pragmatic knowledge, and certain grammatical features. The limits between paraphrasing and related phenomena such as coreference are also analysed. Paraphrase characterization takes on a new dimension if we look at it in extensional terms. We have built a general and linguistically-grounded paraphrase typology in line with this approach. The third issue addressed in this block is paraphrase representation, which we consider to be essential in order to formally apprehend paraphrasing. In the second block, the Wikipedia-based Relational Paraphrase Acquisition method (WRPA) is presented. It allows for the automatic extraction of paraphrases expressing a concrete relation from Wikipedia. Using this method, the WRPA corpus, covering different relations and two languages (English and Spanish), was built. A subset of the Spanish WRPA corpus, together with paraphrases in two English paraphrase corpora that are different in nature were annotated applying a new annotation scheme derived from our paraphrase typology. These annotations were validated applying the Inter-annotator Agreement for Paraphrase-Type Annotation measures (IAPTA), also developed in the framework of this thesis. In the third and final block, our typology is applied to the field of automatic plagiarism detection, demonstrating that more complex paraphrase phenomena and a high density of paraphrase mechanisms make plagiarism detection more difficult, and that lexical substitutions and text-snippet additions/deletions are the most widely used paraphrase mechanisms when plagiarizing. This provides insights for future research in automatic plagiarism detection and demonstrates, through a concrete example, the value of the knowledge and data provided in this thesis to computational-linguistics research.