<div dir="ltr"><div>Hi Stephanie,</div><div>thanks for your answer...</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>regarding your last question... What modified version of the encoder are you using?<span class="gmail-im"><br></span></div></blockquote><div><br></div><div>BNC_encoder-0.9.2</div><div>may be I misunderstood the documentation or I misinterpreted the next line:</div><div>$Encoder->charset(($Opt_Encoding eq "utf8") ? "utf8" : "latin1");</div><div><br></div><div>Just 2 more questions... which you mean by the oficial version of the encoder:</div><div>the "BNCweb-distribution" or the "BNC_encoder-0.9.2"</div><div>And, is it really possible to index BNC on an existing CQPweb installation, is there some documentation related to that?</div><div>Thanks again Stephanie...<br></div><div><br></div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Missatge de Stefan Evert <<a href="mailto:stefanML@collocations.de">stefanML@collocations.de</a>> del dia dv., 19 d’ag. 2022 a les 20:52:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">> Hi there, I'm trying to install BNC corpora into an existing CQPweb<br>
> installation, the BNCweb encoder is set to index in latin1, but I have<br>
> seen that the BNCencoder (BNC_encoder-0.9.2) is set to index in utf8<br>
> (as default).<br>
<br>
Yes, that's because BNCweb uses a special mixed Latin1 / HTML encoding, while sensible persons will want their BNC corpus to be indexed in UTF-8.<br>
<br>
> I have tried to index with BNCweb_encoder in utf8 changing<br>
> line "$Encoder->charset("latin1");"<br>
> to "$Encoder->charset("utf8");<br>
<br>
That's not going to work because the BNCweb encoder Perl script is hardwired to generate the Latin1/HTML encoding expected by BNCweb. If you try to index this as UTF-8, CWB will necessarily stumble over invalid byte sequences.<br>
<br>
> line "$Encoder->charset("utf8");"<br>
> to "$Encoder->charset(("utf8") ? "utf8" : "latin1");"<br>
<br>
This doesn't make sense to me (the first line doesn't seem to exist in the source code). If you've been trying to change something in "EncodeBNC.perl", there is no need to: it has a command-line option to select --encoding utf8.<br>
<br>
However, I'm surprised by your statement that the BNC encoder would default to UTF-8 encoding because the master version in the SVN repository doesn't! What modified version of the encoder are you using?<br>
<br>
> First question, is it possible to achieve the indexing in utf8? If so,<br>
> what else should I do? or what should I do instead?<br>
<br>
Yes, simply use the official BNC encoder with --encoding utf8<br>
<br>
Best,<br>
Stephanie<br>
<br>
_______________________________________________<br>
CWB mailing list<br>
<a href="mailto:CWB@sslmit.unibo.it" target="_blank">CWB@sslmit.unibo.it</a><br>
<a href="http://liste.sslmit.unibo.it/mailman/listinfo/cwb" rel="noreferrer" target="_blank">http://liste.sslmit.unibo.it/mailman/listinfo/cwb</a><br>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><table width="100%" cellspacing="2" cellpadding="2" border="0">
<tbody>
<tr>
<td width="230"><span></span><span><img src="https://www.upf.edu/documents/10193/6775906/signatura-correu.png"></span></td>
<td><span style="font-family:Verdana,sans-serif;font-size:x-small"><b>Andrés Chandía</b></span><br><span style="font-family:Verdana,sans-serif;font-size:x-small">Unitat de Traducció i Ciències del Llenguatge<br>Roc Boronat 138, C. P.: 08018, Barcelona<br>Tel.: 935 055 722 - <a href="mailto:andres.chandia@upf.edu" rel="noopener" target="_blank">mail:andres.chandia@upf.edu</a></span></td>
</tr>
</tbody>
</table></div></div>