[CWB] Alignment format
Stefan Evert
stefanML at collocations.de
Thu Feb 4 23:34:09 CET 2010
>>> I was looking to the encode tutorial but it misses the alignment
>>> part :)
Yes, I'm afraid that part is one of the things on my todo list that
just don't ever seem to get done. :-(
I'll try to catch up on this ASAP.
>>> I would like to know how is alignment encoded. Is it as a common
>>> attribute?
It's a special type of attribute, similar to a structural attribute,
which aligns regions of the source and target corpus (not individual
words). Alignment attributes are usually employed for sentence
alignment, but you can also align at some other level (e.g. clauses or
paragraphs). Regions that are much smaller than sentences won't be
very useful because of the limitations of CQP, and you can only have a
single alignment attribute for any pair of corpora.
cwb-align expects the start/end corpus positions of aligned regions as
input format. Such files can most easily be generated with cwb-
align. If you already have aligned data, you have to use a trick to
converting them into the appropriate format with cwb-align (or work
out the correct corpus positions yourself).
If you have the latest version of the CWB/Perl packages installed,
there is a new script cwb-align-import for encoding pre-existing
alignments. You still need to provide input files in a specific
format, but this should be relatively easy to create. Unfortunately,
the script isn't documented yet ...
Please keep nagging if I haven't got round to writing the
documentation by next week! :-)
Best
Stefan
More information about the CWB
mailing list