The bilingual texts have been extracted from the Final Edition of the European Parliament Proceedings, available from the European Parliament's website.
For our reference corpus, 500 sentences of at most 100 words have been selected at random in the English-Spanish training corpus used for March 2004 TC-STAR evaluation (data from July 1999 to September 2004). This collection contains 14691 English words and 15458 Spanish words. In order to facilitate comparisons between partners, the data set has been split in a 100 sentence pairs development corpus and a 400 sentence pairs test corpus.
Details of the annotation procedure are available in the publication cited below. First, three annotators aligned manually, each, the first 50 sentence pairs of the reference corpus. Then, alignments were compared manually and cases of disagreement or doubts were discussed to refine the guidelines (without necessarily trying to reach an agreement). With the refined guidelines, the three annotators aligned the rest of the reference corpus. Again, alignments were compared manually to detect inadvertent mistakes. At this stage, differences between annotators alignments were considered distinct valid options. These different options were merged automatically to build the final reference alignment, in the following way. A score of respectively 1, 0 and -1 was given to a S link, a P link and the absence of link. For each possible link, annotator's scores were summed. The merged link was set to respectively Sure or absent when the sum was strictly greater or strictly less than half the number of annotators. In the other cases, it was set to Possible. For example, if the three annotations were S, S and P, sum was equal to 2, which is greater than 1.5, so the final link was set to S.
These data have been adapted to new versions of the corpus detecting automatically the corresponding sentence pairs in the new corpus and updating links accordingly. Links that can't be updated automatically were updated manually.
Please cite the following publication if you use this reference alignment in your work.
P. Lambert, A. de Gispert, R. Banchs and J.B. Mariño. 2005. Guidelines for Word Alignment Evaluation and Manual Alignment. Language Resources and Evaluation, 39 (4). pp. 267-285. Springer. link