[CWB] Install BNC in utf8

Andrés Chandía andres.chandia at upf.edu
Fri Aug 19 16:50:18 CEST 2022


Hi there, I'm trying to install BNC corpora into an existing CQPweb
installation, the BNCweb encoder is set to index in latin1, but I have
seen that the BNCencoder (BNC_encoder-0.9.2) is set to index in utf8
(as default).

I have tried to index with BNCweb_encoder in utf8 changing
line "$Encoder->charset("latin1");"
to    "$Encoder->charset("utf8");

and

line "$Encoder->charset("utf8");"
to   "$Encoder->charset(("utf8") ? "utf8" : "latin1");"

This is not working, and it is giving:
Encoding error: an invalid byte or byte sequence for charset "utf8"
was encountered.
[location of error: input line #4]

While I had no issue with BNC_encoder-0.9.2 for interactive use.

First question, is it possible to achieve the indexing in utf8? If so,
what else should I do? or what should I do instead?

Thanks. Regards.
-- 
Andrés Chandía
Unitat de Traducció i Ciències del Llenguatge
Roc Boronat 138, C. P.: 08018, Barcelona
Tel.: 935 055 722 - mail:andres.chandia at upf.edu


More information about the CWB mailing list