[CWB] Install BNC in utf8
Andrés Chandía
andres.chandia at upf.edu
Fri Aug 19 21:37:31 CEST 2022
Hi Stephanie,
thanks for your answer...
> regarding your last question... What modified version of the encoder are
> you using?
>
BNC_encoder-0.9.2
may be I misunderstood the documentation or I misinterpreted the next line:
$Encoder->charset(($Opt_Encoding eq "utf8") ? "utf8" : "latin1");
Just 2 more questions... which you mean by the oficial version of the
encoder:
the "BNCweb-distribution" or the "BNC_encoder-0.9.2"
And, is it really possible to index BNC on an existing CQPweb installation,
is there some documentation related to that?
Thanks again Stephanie...
Missatge de Stefan Evert <stefanML at collocations.de> del dia dv., 19 d’ag.
2022 a les 20:52:
> > Hi there, I'm trying to install BNC corpora into an existing CQPweb
> > installation, the BNCweb encoder is set to index in latin1, but I have
> > seen that the BNCencoder (BNC_encoder-0.9.2) is set to index in utf8
> > (as default).
>
> Yes, that's because BNCweb uses a special mixed Latin1 / HTML encoding,
> while sensible persons will want their BNC corpus to be indexed in UTF-8.
>
> > I have tried to index with BNCweb_encoder in utf8 changing
> > line "$Encoder->charset("latin1");"
> > to "$Encoder->charset("utf8");
>
> That's not going to work because the BNCweb encoder Perl script is
> hardwired to generate the Latin1/HTML encoding expected by BNCweb. If you
> try to index this as UTF-8, CWB will necessarily stumble over invalid byte
> sequences.
>
> > line "$Encoder->charset("utf8");"
> > to "$Encoder->charset(("utf8") ? "utf8" : "latin1");"
>
> This doesn't make sense to me (the first line doesn't seem to exist in the
> source code). If you've been trying to change something in
> "EncodeBNC.perl", there is no need to: it has a command-line option to
> select --encoding utf8.
>
> However, I'm surprised by your statement that the BNC encoder would
> default to UTF-8 encoding because the master version in the SVN repository
> doesn't! What modified version of the encoder are you using?
>
> > First question, is it possible to achieve the indexing in utf8? If so,
> > what else should I do? or what should I do instead?
>
> Yes, simply use the official BNC encoder with --encoding utf8
>
> Best,
> Stephanie
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
--
*Andrés Chandía*
Unitat de Traducció i Ciències del Llenguatge
Roc Boronat 138, C. P.: 08018, Barcelona
Tel.: 935 055 722 - mail:andres.chandia at upf.edu <andres.chandia at upf.edu>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20220819/86465e73/attachment.html>
More information about the CWB
mailing list