Machine Translation Tools

Alignment set - Library and command-line utilities to manage sets of sentence pairs

The Lingua-AlignmentSet distribution is a Perl Tools Library (and command-line utilities) to handle an Alignment Set, i.e. a set of sentences aligned at the word (or phrase) level. It provides methods to display the links, to apply a function to each alignment of the set, to evaluate the alignments against a reference, and more. One of the objectives of the module is to allow the user to perform all these operations without bothering with the particular physical format of the Alignment Set. Anyway it also provides format conversion methods.

 

Available Tools

 

Nearly all tools have options to indicate the input and output format, the treatment of NULL links and the range of input lines to be processed. See updated documentation running the tool with the '-man' option.

  • visualise_alSet-1.1.pl   Visualisation tool: displays the aligned sentence pairs as link enumeration or matrix
  • evaluate_alSet-1.1.pl   Evaluation tool: calculates Precision, Recall, F-measure, AER.
  • processAlignment_alSet-1.1.pl   Processing tool: apply a function to each alignment of the Alignment Set. Implemented functions include (see more with the '-man' option):
    • regexpReplace: substitutes, in a side of the corpus, a string (defined by a regular expression) by another and updates the links accordingly. Note: function based in "algorithm::diff", which in some cases doesn't find the minimal set of links to be changed. To avoid this use the "replaceWords" function.
    • replaceWords: substitutes, in a side of the corpus, a string (of words separated by a white space) by another and updates the links accordingly.
    • intersect, getUnion: takes respectively the intersection and union between source-to-target and target-to-source alignments
    • joined2ManyToMany,manyToMany2joined: respectively removes or introduces underscore between links of many-to-many groups in source to target alignment
    • internal functions: like splice or getAlClusters, a function which returns the alignment as clusters of positions aligned together. To be used within the AlignmentSet perl library.
  • adaptAlSetToBilCorpus.pl   Adapts the links of an Alignment Set to a slightly different bilingual corpus. For example, use it if you have a manual alignment reference that you need to adapt to a different version of the corpus (e.g. with different tokenization).
  • orderAlSetAsBilCorpus.pl   Place sentence pairs of a secondary corpus at the head of the Alignment Set, in the same order.
  • chFormat_alSet-1.1.pl   Processing tool: performs format conversions.
  • symmetrise_alSet-1.1.pl   Detects multi-words based on asymmetries between source-target and target-source alignments (see reference below).

Source code

 

version 1.1 (recommended)

version 1.0

Acknowledgements

 

This work has been partially supported by the Spanish government, under grant TIC-2002-04447-C02 (Aliado Project) and the European Union, under FP6-506738 grant (TC-STAR project).

References

For more details about the symmetrize method of the AlignmentSet.pm module, see:
  • Patrik Lambert and Núria Castell. 2004. Alignment of parallel corpora exploiting asymmetrically aligned phrases. In Proc. of the LREC 2004 Workshop on the Amazing Utility of Parallel and Comparable Corpora, Lisbon, Portugal, May 25.   pdf

Additional information