[CWB] Querying parallel corpora

Stefan Evert stefanML at collocations.de
Thu Nov 12 14:22:22 CET 2015


> On 12 Nov 2015, at 12:23, José Manuel Martínez Martínez <chozelinek at gmail.com> wrote:
> 
> So, if I want to see the aligned sentences corresponding to the matches I just type this:
> 
> show +tdc-tt-fl
> 
> And then my query:
> 
> [word="catch-the-eye"];
> 
> Can one tabulate or save the alignments somehow corresponding to the matches? If yes, how?

Well, you can always redirect the "cat" output to a file:

	cat > "output_with_alignment.txt";

If you want more control with the help of "tabulate" and if you're using a recent beta version of CQP (v3.4.7 or newer), you can also "translate" the query result to the target language.  Note that this is an experimental feature, so no guarantees …

Let me give you an example based on the Europarl corpus:

[no corpus]> EUROPARL-EN
EUROPARL-EN> Law = "German" "law";
EUROPARL-EN> cat Law 0 2;
  1317230:  It is a funny saga of the mishaps and adventures of this group of men , who live beyond the margins of German society in the shadowy areas outside <German law> .
  2366610:  And the <German law> on energy saving is clearly supported by the proposal for a directive .
  4145616:  An example would be the <German law> on non-medical practitioners .

# now we use the new from … to … command to "translate" the query results to the aligned regions
EUROPARL-EN> Gesetz = from Law to EUROPARL-DE;
EUROPARL-EN> tabulate EUROPARL-DE:Gesetz 0 2 match .. matchend word;
Es ist eine humorvolle Erz?hlung von Mi?geschicken und Abenteuern , die eine Gruppe von M?nnern erlebt , die am Rande der deutschen Gesellschaft im Graubereich au?erhalb der deutschen Gesetze leben .
Auch das deutsche Stromeinspeisegesetz wird mit dem Richtlinienvorschlag klar unterst?tzt .
Ein Beispiel ist das deutsche Heilpraktikergesetz .

# one problem is that matches without alignment to the target language are silently dropped; the same happens for multiple matches within the same alignment bead;
# notice that the translated query result has only 35 lines rather than 38
EUROPARL-EN> show named;
Named Query Results:
   m-*  EUROPARL-DE:Gesetz [35]
   m-*  EUROPARL-EN:Law [38]

# if you want corresponding regions for both languages (which you probably do), you can translate back into English;
# of course, the actual matches are no longer marked, and there is no easy workaround for this
EUROPARL-EN> Law2 = from EUROPARL-DE:Gesetz to EUROPARL-EN;
EUROPARL-EN> show named;
Named Query Results:
   m-*  EUROPARL-EN:Law2 [35]
   m-*  EUROPARL-DE:Gesetz [35]
   m-*  EUROPARL-EN:Law [38]

Hope this helps,
Stefan



More information about the CWB mailing list