[CWB] Install BNC in utf8

Stefan Evert stefanML at collocations.de
Fri Aug 19 20:46:58 CEST 2022


> Hi there, I'm trying to install BNC corpora into an existing CQPweb
> installation, the BNCweb encoder is set to index in latin1, but I have
> seen that the BNCencoder (BNC_encoder-0.9.2) is set to index in utf8
> (as default).

Yes, that's because BNCweb uses a special mixed Latin1 / HTML encoding, while sensible persons will want their BNC corpus to be indexed in UTF-8.

> I have tried to index with BNCweb_encoder in utf8 changing
> line "$Encoder->charset("latin1");"
> to    "$Encoder->charset("utf8");

That's not going to work because the BNCweb encoder Perl script is hardwired to generate the Latin1/HTML encoding expected by BNCweb.  If you try to index this as UTF-8, CWB will necessarily stumble over invalid byte sequences.

> line "$Encoder->charset("utf8");"
> to   "$Encoder->charset(("utf8") ? "utf8" : "latin1");"

This doesn't make sense to me (the first line doesn't seem to exist in the source code).  If you've been trying to change something in "EncodeBNC.perl", there is no need to: it has a command-line option to select --encoding utf8.

However, I'm surprised by your statement that the BNC encoder would default to UTF-8 encoding because the master version in the SVN repository doesn't!  What modified version of the encoder are you using?

> First question, is it possible to achieve the indexing in utf8? If so,
> what else should I do? or what should I do instead?

Yes, simply use the official BNC encoder with --encoding utf8

Best,
Stephanie



More information about the CWB mailing list