[CWB] Aligning parallel corpora

Graham Ranger -- UAPV graham.ranger at univ-avignon.fr
Tue May 7 20:44:20 CEST 2019


Many, many thanks for all this, Andrew, as ever... A fully signposted 
route, which I hope will get me where I want to go!
Best,
Graham.

Le 07/05/2019 à 16:53, Hardie, Andrew a écrit :
>
> Hi Graham,
>
> >> I have not been able to find the English and German Holmes files used 
> in Stefan Evert's tutorial
>
> They aren’t currently in the release package (which I believe is 
> probably in need of an update!) but they are in the SVN tree: 
> https://sourceforge.net/p/cwb/code/HEAD/tree/doc/corpora/encoding_tutorial_data/
>
> >> what exactly is the required input format for the cwb-align command?
>
> cwb-align is an aligner, you don’t need to use it if you are dealing 
> with pre-aligned data. Its *output* format is what you need to 
> generate (to then input into cwb-align-encode, the program that 
> actually creates the a-attribute). This means in the tutorial, you can 
> skip to sec 8.4.
>
> The format necessary (called “.align file”) is described in the 
> section “OUTPUT FORMAT” of *man cwb-align*. Pasted below.
>
> You would need to output the begin/end points of the s attributes in 
> the first corpus (using cwb-s-decode), and then combine that with the 
> begin/end points of the s attribute in the second corpus, to get the 
> 4-tuples needed by cwb-align-encode. And then you’d need to add *1:1* 
> at the end of each line. That 5 column file defines the alignment.
>
> >> If I have .vrt files created in two languages with treetagger, and if 
> I have prealigned these, in such a way that the first sentence of one 
> file corresponds to the first sentence of the other, the second 
> sentence to the second, etc. then is that enough?
>
> Yes, that is enough if you use the method above to create the input 
> file for cwb-align-encode.
>
> >> Or should my files also including numerical information with all 
> sentences numbered?
>
> If your ranges do actually have unique ID attributes <s id=”..”> or 
> maybe <s n=”..”>, you can optionally use *cwb-align-import* instead of 
> cwb-align-encode: see section 8.5 . This has a different input format 
> than cwb-align-encode: it identifies matching segments by some ID 
> attribute, rather than by spelling out the corpus token position 
> ranges, as in cwb-align-encode. This is called a *bead file* whereas 
> the other is called an *.align file* (not the clearest terminology I 
> know).
>
> Your sentences would have to be numbered through your whole corpus. In 
> which case, your bead file could look like this:
>
> CORP1 CORP2    s     {n}
>
> 1 1
>
> 2 2
>
> 3 3
>
> […]
>
>  which might be easier to auto-generate than the cwb-align-encode 
> format based on raw corpus position numbers.
>
> Finally, whatever method you use, don’t forget the need to update the 
> registry file with the new a-attribute.
>
> best
>
> Andrew.
>
> =====================
>
> From man cwb-align:
>
> OUTPUT FORMAT
>
> cwb-align's output file uses CWB's ".align" file format. ".align" 
> files are ASCII text files
>
> (although they may contain characters from another encoding if the 
> corpus IDs include non-ASCII
>
> characters), formatted as follows.
>
> The first line is a header line, which contains the following four 
> elements, separated by tabs:
>
>  ·   The ID of the source corpus
>
> ·   The ID of the aligned s-attribute (the grid attribute - see above)
>
> ·   The ID of the target corpus
>
> ·   The ID of the aligned s-attribute (repeated)
>
> Following the header, each individual line represents a single pair of 
> aligned regions in the
>
> corpus.  This is specified by six fields of information, separated by 
> tabs. The six fields are
>
> as follows:
>
> ·   The beginning of the range in the source corpus (expressed as a 
> cpos, i.e. a token number)
>
> ·   The end of the range in the source corpus (expressed as a cpos)
>
> ·   The beginning of the range in the target corpus (expressed as a cpos)
>
> ·   The end of the range in the target corpus (expressed as a cpos)
>
> ·   The type of alignment: 1:1, 2:1, 1:2 or 2:2
>
> ·   The quality of the alignment: a score calculated by the alignment 
> engine
>
> For example,
>
> 140    169    137    180    1:2
>
> means that corpus position ranges [140,169] and [137,180] form a 1:2 
> alignment pair.
>
> (The final field, the quality, is optional in this file format, and is 
> absent in the example
>
> above; however, cwb-align will always provide it.)
>
> *From:*cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> *On 
> Behalf Of *Graham Ranger -- UAPV
> *Sent:* 07 May 2019 13:57
> *To:* Open source development of the Corpus WorkBench 
> <cwb at sslmit.unibo.it>
> *Subject:* [CWB] Aligning parallel corpora
>
> Hello to all,
>
> I have set up a parallel corpus on cqpweb using s-attributes for the 
> visualisation of translations but I would like to be able to do the 
> same thing more cleanly, using alignment attributes. However, try as I 
> might, I cannot seem to follow the instructions in the encoding 
> tutorial. I have not been able to find the English and German Holmes 
> files used in Stefan Evert's tutorial for illustration. Now, what I 
> would like to know is: what exactly is the required input format for 
> the cwb-align command? If I have .vrt files created in two languages 
> with treetagger, and if I have prealigned these, in such a way that 
> the first sentence of one file corresponds to the first sentence of 
> the other, the second sentence to the second, etc. then is that 
> enough? Or should my files also including numerical information with 
> all sentences numbered? I suspect this is a very naive question, but 
> it's one that I do not seem to be able to find my way around without help!
>
> Best,
>
> Graham.
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20190507/00e5df8f/attachment-0001.html>


More information about the CWB mailing list