10/02/2014 - Eva Martínez - Document Level Statistical Machine Translation

Title Document Level Statistical Machine Translation
Speaker Eva Martínez
Room Omega-S208 Campus Nord - UPC
Date Mon Feb 10, 2014
Time 12:00h

Machine Translation (MT) is one of the earliest problems in Natural Language Processing and Artificial Intelligence, which has gained a lot of attention from the industry and research community in the last decade. There are many kind of MT systems and services depending on their usage, linguistic analysis or architecture. Some of them being used everyday by millions users for a variety of purposes. However, most of the current MT systems are designed in a sentence-level fashion, that is, they translate a document assuming independence among sentences, totally ignoring discourse information. This simplified view has an impact in the quality of the resulting translations, which sometimes show poor cohesion and coherence at a document level. Following the path of some recent works ( Tiedemann,2010; Nagard-Koehn, 2010; Hardmeier et al., 2010; Xiao et al., 2011; Hardmeier et al., 2012;), in this research we aim at studying the translation problem at a document level taking into account cohesion and coherence aspects for improving statistical MT quality. Several phenomena will be studied paying special attention to: lexical semantic and topic cohesion, coreference, agreement, discourse structure, etc. In a complementary direction we will study how to take into account the document level aspects of quality in the current automatic MT evaluation measures.

Slides slides

07/02/2014 - Audi Primadhanty and Pranava Swaroop Madhyastha - Probabilistic Inference for Weakly-Supervised Entity-Relation and Learning Word Embeddings for Language Modelling


Probabilistic Inference for Weakly-Supervised Entity-Relation
Learning Word Embeddings for Language Modelling

Speaker Audi Primadhanty and Pranava Swaroop Madhyastha
Room Omega-S208 Campus Nord - UPC
Date Fri Feb 07, 2014
Time 12:00h - 14:00h

Abstract for Probabilistic Inference for Weakly-Supervised Entity-Relation

We investigate the task of extracting entities and relations from text documents given only a few examples of desired entities and relations. The task is relevant for information extraction in new, open domains where the availability of annotated corpus is negligible or expensive to obtain. We begin with the task of named entity classification by proposing a probabilistic generative model that uses hidden states. the purpose of hidden states is to capture commonalities of the contexts in which entities of different types appear. Our hope is that this model will have improved robustness when it comes to recognize unseen entities. Our aim is to further extend such techniques for extracting relations in any domain for specific target entities and relations in a large unlabeled corpus, requiring only few examples for each entity and relation type.  

Abstract for Learning Word Embeddings for Language Modelling

In Natural Language Processing, state-of-the-art systems for tasks such as parsing, semantic role labeling, word-sense disambiguation, etc. make use of lexical features. Most of these systems are trained using annotated corpus, which are used to gather statistics about each lexical item and its linguistic relations. However, even for large annotated corpora, it is unlikely to observe each lexical item in the context of all its possible relations. In this setting, one would like to exploit a notion of word similarity, and assume that similar words have similar behaviour. The focus of this thesis proposal is to formulate statistical models that improve performance on linguistic prediction tasks by making use of distributional word space representations. In particular, we are interested in designing computationally efficient and robust learning algorithms for lexical embeddings that use a combination of both supervised training methods and unsupervised training methods that use a large text corpus to induce a distributional representation. We present preliminary experiments to infer usefulness and proof of concept of the proposed approach.

27/11/2013 - Alma Delia Cuevas - Using Frames to converting texts to ontologies. Solutions can be combined?

Title Using Frames to converting texts to ontologies. Solutions can be combined?
Speaker Alma Delia Cuevas (CIC-IPN and UAEM) México
Room Omega-S208 Campus Nord - UPC
Date Wed Nov 27, 2013
Time 11:00h

To automatically extract knowledge from natural language documents is an interesting chore, since it obtains information in a simple manner, without the need of human interpretation, which often consumes large amounts of time. But for a computer to “understand” a document is a non-trivial task, since natural language is ambiguous, full of synonyms, idioms, anaphora, word declinations, analogies… which persons solve not only through context, but also with previous knowledge, real world experience and common sense. None of these are salient features of a computer.

To obtain knowledge automatically from any text (prose, poetry, news, event descriptions, text books, cooking recipes, descriptive documents, etc.) and to be able to transform it to a representation which a computer can understand and process, is still far from reality. Nevertheless, progress in this acquisition is performed with the use of natural language processing (NLP), information retrieval and knowledge acquisition tools.

Aware of the problem of trying to interpret all types of text, the scope of SERCDD (System for Extracting and Representing Knowledge from Descriptive Documents) is descriptive texts: documents describing tools, plants, geographic places, etc.

This topic presents an analysis method that starts with text which has suffered a semantic analysis, specifically tagging and lemmatization. Using the structure of the sentences found in the text, analysis proceeds trying to identify the relations present in it. For instance, a text describing a carpenter tool will usually contain its definition, a description, common uses, parts and materials forming it, as well as the classification of such tool. The result is a formal representation of the extracted knowledge, embodied in an ontology written in the OM Language. In order to produce it, it is necessary to identify the entities (concepts), relations and properties described in the original text.



  • Join us

    Job Offers and Research Positions

  • Awards

    Scientific awards Best: Publications, Thesis, Demos, ...

  • Seminars

    Framework to promote research related with Natural Language Processing or Speech Processing.

  • Thesis

    TALP PhD dissertation

  • Events

    Information about Courses, Conferences, workshops, talks, ...

  • Call For Papers

    Relevant Conference, workshops or special issues call for papers announcement

  • Relevant Publications
  • Members

    Professors, students and colaborators visiting TALP.

    New members incorporation and former members carreer.

  • Projects

    TALP research and innovation projects

  • Press

    Selected articles and reports about the TALP Research Center taken from the international press

  • Formation

Additional information