[CWB] Alignment format

Stefan Evert stefanML at collocations.de
Thu Feb 4 23:34:09 CET 2010


>>> I was looking to the encode tutorial but it misses the alignment  
>>> part :)

Yes, I'm afraid that part is one of the things on my todo list that  
just don't ever seem to get done. :-(

I'll try to catch up on this ASAP.

>>> I would like to know how is alignment encoded. Is it as a common
>>> attribute?

It's a special type of attribute, similar to a structural attribute,  
which aligns regions of the source and target corpus (not individual  
words).  Alignment attributes are usually employed for sentence  
alignment, but you can also align at some other level (e.g. clauses or  
paragraphs).  Regions that are much smaller than sentences won't be  
very useful because of the limitations of CQP, and you can only have a  
single alignment attribute for any pair of corpora.

cwb-align expects the start/end corpus positions of aligned regions as  
input format.  Such files can most easily be generated with cwb- 
align.  If you already have aligned data, you have to use a trick to  
converting them into the appropriate format with cwb-align (or work  
out the correct corpus positions yourself).

If you have the latest version of the CWB/Perl packages installed,  
there is a new script cwb-align-import for encoding pre-existing  
alignments.  You still need to provide input files in a specific  
format, but this should be relatively easy to create.  Unfortunately,  
the script isn't documented yet ...

Please keep nagging if I haven't got round to writing the  
documentation by next week! :-)

Best
Stefan





More information about the CWB mailing list