[CWB] Does cwb-align-encode support utf8?

Stefan Evert stefanML at collocations.de
Thu Jul 12 16:27:09 CEST 2012


> Hmmm. Ray’s initial problem was solved, but there does seem to be an underlying problem – arising from the English charset defaulting to Latin1 when unspecified, and the aligned Chinese data thus being treated as Latin1 for output.  (Though I don’t understand why the Chinese data, once output, wasn’t treated as UTF8 by the terminal...)

I suspect that this happens when data are viewed interactively in CQP?  Then it's an issue with the "less" pager used as a backend.  CQP needs to set the environment variable LESSCHARSET in order to tell "less" whether the input is in LatinX or UTF-8 encoding.  This is done based on the currently active corpus and cannot be switched across different parts of the output.

> 
> There is some kind of issue here, however I’m not quite sure what the answer is.
>  
> ·         Clearly, it would be advantageous to allow alignment to be declared between two corpora that are in different charsets.
> ·         However, that creates problems for display, since...
> ·         ...it’s equally clearly undesirable for CQP to be outputting two charsets in the same chunk of output.

Exactly.  The alignments should already be encoded correctly, but they cannot be displayed sensibly.

The only reasonable solution would be to allow CQP to re-encode text (and queries), so all corpora can be queried in UTF-8, Latin1, ... regardless of which encoding they're actually stored in.  But this probably opens several cans of worms ...

In the long run, wouldn't it make much more sense to switch all corpora to UTF-8?

Cheers,
Stefan




More information about the CWB mailing list