Machine Translation Resources

EPPS Word Alignment Trial and Test Data

The bilingual texts have been extracted from the Final Edition of the European Parliament Proceedings, available from the European Parliament's website

For our reference corpus, 500 sentences of at most 100 words have been selected at random in the English-Spanish training corpus used for March 2004 TC-STAR evaluation (data from July 1999 to September 2004). This collection contains 14691 English words and 15458 Spanish words. In order to facilitate comparisons between partners, the data set has been split in a 100 sentence pairs development corpus and a 400 sentence pairs test corpus. 

Details of the annotation procedure are available in the publication cited below. First, three annotators aligned manually, each, the first 50 sentence pairs of the reference corpus. Then, alignments were compared manually and cases of disagreement or doubts were discussed to refine the guidelines (without necessarily trying to reach an agreement). With the refined guidelines, the three annotators aligned the rest of the reference corpus. Again, alignments were compared manually to detect inadvertent mistakes. At this stage, differences between annotators alignments were considered distinct valid options. These different options were merged automatically to build the final reference alignment, in the following way. A score of respectively 1, 0 and -1 was given to a S link, a P link and the absence of link. For each possible link, annotator's scores were summed. The merged link was set to respectively Sure or absent when the sum was strictly greater or strictly less than half the number of annotators. In the other cases, it was set to Possible. For example, if the three annotations were S, S and P, sum was equal to 2, which is greater than 1.5, so the final link was set to S.

 

These data have been adapted to new versions of the corpus detecting automatically the corresponding sentence pairs in the new corpus and updating links accordingly. Links that can't be updated automatically were updated manually.

Tools

  • To reorder a giza .A3.final alignment file as the alignment reference: orderA3as_alignref-2.0.pl (to reorder a file in another format you may use the Lingua::AlignmentSet toolkit, as explained below).
  • To evaluate automatic alignments against these test data, or to visualise or manipulate alignments, you can use the Lingua::AlignmentSet toolkit. It is a toolkit to manage sets of aligned corpora available under the GNU General Public License.
  • To edit manually and/or visualise the data, you may use alpaco_sp.pl, a version of the Alpaco editor modified to support S and P links (note: because of some bug in alpaco_sp.pl, when you undo links, some of them don't disappear from the screen, but when selecting "Save an Alpaco File" they disappear as expected from the file).


How to perform the evaluation

  • If your corpus sentences do not correspond exactly to the reference (different preprocessing), adapt the reference to your corpus. code
  • Extract from your alignment file the reference sentences, in the same order. code
  • Run the evaluation tool. code


Citation

 

Please cite the following publication if you use this reference alignment in your work.

P. Lambert, A. de Gispert, R. Banchs and J.B. Mariño. 2005. Guidelines for Word Alignment Evaluation and Manual Alignment. Language Resources and Evaluation, 39 (4). pp. 267-285. Springer.   link


Download Database

  • epps-enes-alignref.v2005-12-05.tgz The data set corresponds to the "tagged EPPS corpus" distributed for TC-STAR 2006 evaluation as well as in Openlab 2006.  

Additional information