[CWB] Aligning parallel corpora
Graham Ranger -- UAPV
graham.ranger at univ-avignon.fr
Tue May 7 20:44:20 CEST 2019
Many, many thanks for all this, Andrew, as ever... A fully signposted
route, which I hope will get me where I want to go!
Best,
Graham.
Le 07/05/2019 à 16:53, Hardie, Andrew a écrit :
>
> Hi Graham,
>
> >> I have not been able to find the English and German Holmes files used
> in Stefan Evert's tutorial
>
> They aren’t currently in the release package (which I believe is
> probably in need of an update!) but they are in the SVN tree:
> https://sourceforge.net/p/cwb/code/HEAD/tree/doc/corpora/encoding_tutorial_data/
>
> >> what exactly is the required input format for the cwb-align command?
>
> cwb-align is an aligner, you don’t need to use it if you are dealing
> with pre-aligned data. Its *output* format is what you need to
> generate (to then input into cwb-align-encode, the program that
> actually creates the a-attribute). This means in the tutorial, you can
> skip to sec 8.4.
>
> The format necessary (called “.align file”) is described in the
> section “OUTPUT FORMAT” of *man cwb-align*. Pasted below.
>
> You would need to output the begin/end points of the s attributes in
> the first corpus (using cwb-s-decode), and then combine that with the
> begin/end points of the s attribute in the second corpus, to get the
> 4-tuples needed by cwb-align-encode. And then you’d need to add *1:1*
> at the end of each line. That 5 column file defines the alignment.
>
> >> If I have .vrt files created in two languages with treetagger, and if
> I have prealigned these, in such a way that the first sentence of one
> file corresponds to the first sentence of the other, the second
> sentence to the second, etc. then is that enough?
>
> Yes, that is enough if you use the method above to create the input
> file for cwb-align-encode.
>
> >> Or should my files also including numerical information with all
> sentences numbered?
>
> If your ranges do actually have unique ID attributes <s id=”..”> or
> maybe <s n=”..”>, you can optionally use *cwb-align-import* instead of
> cwb-align-encode: see section 8.5 . This has a different input format
> than cwb-align-encode: it identifies matching segments by some ID
> attribute, rather than by spelling out the corpus token position
> ranges, as in cwb-align-encode. This is called a *bead file* whereas
> the other is called an *.align file* (not the clearest terminology I
> know).
>
> Your sentences would have to be numbered through your whole corpus. In
> which case, your bead file could look like this:
>
> CORP1 CORP2 s {n}
>
> 1 1
>
> 2 2
>
> 3 3
>
> […]
>
> which might be easier to auto-generate than the cwb-align-encode
> format based on raw corpus position numbers.
>
> Finally, whatever method you use, don’t forget the need to update the
> registry file with the new a-attribute.
>
> best
>
> Andrew.
>
> =====================
>
> From man cwb-align:
>
> OUTPUT FORMAT
>
> cwb-align's output file uses CWB's ".align" file format. ".align"
> files are ASCII text files
>
> (although they may contain characters from another encoding if the
> corpus IDs include non-ASCII
>
> characters), formatted as follows.
>
> The first line is a header line, which contains the following four
> elements, separated by tabs:
>
> · The ID of the source corpus
>
> · The ID of the aligned s-attribute (the grid attribute - see above)
>
> · The ID of the target corpus
>
> · The ID of the aligned s-attribute (repeated)
>
> Following the header, each individual line represents a single pair of
> aligned regions in the
>
> corpus. This is specified by six fields of information, separated by
> tabs. The six fields are
>
> as follows:
>
> · The beginning of the range in the source corpus (expressed as a
> cpos, i.e. a token number)
>
> · The end of the range in the source corpus (expressed as a cpos)
>
> · The beginning of the range in the target corpus (expressed as a cpos)
>
> · The end of the range in the target corpus (expressed as a cpos)
>
> · The type of alignment: 1:1, 2:1, 1:2 or 2:2
>
> · The quality of the alignment: a score calculated by the alignment
> engine
>
> For example,
>
> 140 169 137 180 1:2
>
> means that corpus position ranges [140,169] and [137,180] form a 1:2
> alignment pair.
>
> (The final field, the quality, is optional in this file format, and is
> absent in the example
>
> above; however, cwb-align will always provide it.)
>
> *From:*cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> *On
> Behalf Of *Graham Ranger -- UAPV
> *Sent:* 07 May 2019 13:57
> *To:* Open source development of the Corpus WorkBench
> <cwb at sslmit.unibo.it>
> *Subject:* [CWB] Aligning parallel corpora
>
> Hello to all,
>
> I have set up a parallel corpus on cqpweb using s-attributes for the
> visualisation of translations but I would like to be able to do the
> same thing more cleanly, using alignment attributes. However, try as I
> might, I cannot seem to follow the instructions in the encoding
> tutorial. I have not been able to find the English and German Holmes
> files used in Stefan Evert's tutorial for illustration. Now, what I
> would like to know is: what exactly is the required input format for
> the cwb-align command? If I have .vrt files created in two languages
> with treetagger, and if I have prealigned these, in such a way that
> the first sentence of one file corresponds to the first sentence of
> the other, the second sentence to the second, etc. then is that
> enough? Or should my files also including numerical information with
> all sentences numbered? I suspect this is a very naive question, but
> it's one that I do not seem to be able to find my way around without help!
>
> Best,
>
> Graham.
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20190507/00e5df8f/attachment-0001.html>
More information about the CWB
mailing list