[CWB] Accents and codification

Daniel Renau alphak87 at gmail.com
Wed Dec 16 21:34:31 CET 2015


Thank you Stefan and Andrew,

I just change this line in the registry file:
##:: charset = "latin1"    # character encoding of corpus data
to
##:: charset = "utf8"    # character encoding of corpus data

and it works well: http://imgur.com/GvEnB11

Like I said days ago, they use this script to encode corpus:
http://pastebin.com/xTEHaDdm
When the script calls the CWB::Encoder don't tell it to use utf8 and the
encoder uses Latin1 by default:
 -c <charset> specify corpus character set (instead of the default latin1)

Now my doubts are...
1- Better modify the script to call the encoder with "-c utf8"?
2- How to install a new corpus from zero, the 2 ways... in CQP and in cqpWeb
If the texts are like "word pos lemma" how I fill the fields?
http://i.imgur.com/NC7ua7n.png
I don't know how to pass the different attributes.

Thank you so much.

2015-12-16 13:14 GMT+01:00 Hardie, Andrew <a.hardie en lancaster.ac.uk>:

> The mess via SSH is because
>
> - less highlights the numeric representation of bytes it can't represent
> - it emits the ESC-[7m control sequence to do this
> - ssh is not interpreting this correctly when it comes in the middle of a
> utf8 sequence
> - this may have something to do with how the receiving console is
> configured
> - but the vm terminal does handle the ESC-[7m correctly
> - but in either case, the action of less in introducing those highlight
> codes stops the UTF-8 sequence resolving as it does for the accented
> characters.
>
> Daniel: For the browser, something to check is what presentation encoding
> it is automatically set to when visiting any page on your CQPweb server;
> this will tell you what charset your HTTP server is emitting in the
> Content-Type header. It ought to be emitting UTF-8. (And check your browser
> has auto-detect encoding switched on, or the equivalent; that's optional in
> Chrome but I don't know about other browsers).
>
> CQPweb always tries to  set the HTTP Content-Type header to assert utf8 as
> the charset. It's possible, however, that your server might block this in
> some way ( I don't know how, but anything's possible I suppose).
>
> If neither browser nor HTTP server is the culprit, the next most likely
> explanation may be that you accidentally ticked the "Tick here if the
> corpus is encoded in Latin1 (iso-8859-1) " box in CQPweb's New Corpus form.
> If so, then the charset will be declared as latin1 in the registry file
> (instead of utf8 which is what it should be).
>
> If you did this, then CQPweb will re-code the data from (what it thinks
> is) latin1 to utf8 every time. If the underlying data is really UTF-8,
> that re-coding would produce the effect you are seeing.
>
> best
>
> Andrew.
>
>
> -----Original Message-----
> From: cwb-bounces en sslmit.unibo.it [mailto:cwb-bounces en sslmit.unibo.it] On
> Behalf Of Stefan Evert
> Sent: 16 December 2015 09:02
> To: CWBdev Mailing List
> Subject: Re: [CWB] Accents and codification
>
>
> > On 16 Dec 2015, at 01:28, Daniel Renau <alphak87 en gmail.com> wrote:
> >
> > There's a way to solve this problem with accents and apostrophe?
> >
> > Pic related: http://i.imgur.com/OAVDzuG.png
> >
> > At cqp via command line, the accents show OK (ssh connection or local
> terminal)
> > At cqpWEB the accents are not displayed correctly.
> > The apostrophe -> ' <- isn't shown properly anywhere, it shows <80><99>
> > At UTF8 table is named like "RIGHT SINGLE QUOTATION MARK"
>
> After taking a very close look at the screenshot, it would appear that
> your corpus is mostly encoded in UTF-8, but you have set CQPweb and/or your
> browser to interpret it as latin1.  If you change these settings to be
> consistent with your actual corpus encoding, the text should display fine.
>
> The encoding of RIGHT SINGLE QUOTATION mark might actually be broken in
> your input data, since it appears as a sequence of three bytes in your VM
> terminal and especially given the absolute mess of characters (and control
> codes??) showing up via ssh.
>
> Did "cwb-encode -c utf8" actually accept this input as well-formed UTF-8?
>
> Best,
> Stefan
>
>
> _______________________________________________
> CWB mailing list
> CWB en sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB en sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>



-- 
Un saludo, Dani.
------------ pr�xima parte ------------
Se ha borrado un adjunto en formato HTML...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20151216/17d81b87/attachment-0001.html>


More information about the CWB mailing list