[CWB] Accents and codification

Wed Dec 16 22:23:01 CET 2015

That script says it was written by Marco Baroni in 2005. So it is clearly quite a lot of versions out of date. CWB did not handle any charsets other than Latin1 back then, and CQPweb did not exist. You are inevitably going to run into problems attempting to use that script with a current version of CWB.

If Marco is reading this he might be kind enough to point you to a more recent version. Otherwise, I don’t understand why you don’t just use the normal tools i.e. cwb-encode/cwb-make. That’s why we document them: http://cwb.sourceforge.net/files/CWB_Encoding_Tutorial/

>>2- How to install a new corpus from zero, the 2 ways... in CQP and in cqpWeb

That’s what the tutorial is for.

>> If the texts are like "word pos lemma" how I fill the fields?

You put the information for “pos” in the first row, and the information for “lemma” in the second row. If a column doesn’t apply, leave it blank. Don’t tick “feature set”. It’s recommended to specify “pos” as primary.

OR, use the built in annotation template called “POS plus lemma (TreeTagger format)” which covers exactly this configuration of p-attributes.

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Daniel Renau
Sent: 16 December 2015 20:35
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Accents and codification

Thank you Stefan and Andrew,
I just change this line in the registry file:
##:: charset = "latin1"    # character encoding of corpus data
to
##:: charset = "utf8"    # character encoding of corpus data
and it works well: http://imgur.com/GvEnB11
Like I said days ago, they use this script to encode corpus: http://pastebin.com/xTEHaDdm
When the script calls the CWB::Encoder don't tell it to use utf8 and the encoder uses Latin1 by default:
 -c <charset> specify corpus character set (instead of the default latin1)

Now my doubts are...
1- Better modify the script to call the encoder with "-c utf8"?
2- How to install a new corpus from zero, the 2 ways... in CQP and in cqpWeb
If the texts are like "word pos lemma" how I fill the fields?
http://i.imgur.com/NC7ua7n.png
I don't know how to pass the different attributes.
Thank you so much.

2015-12-16 13:14 GMT+01:00 Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>>:
The mess via SSH is because

- less highlights the numeric representation of bytes it can't represent
- it emits the ESC-[7m control sequence to do this
- ssh is not interpreting this correctly when it comes in the middle of a utf8 sequence
- this may have something to do with how the receiving console is configured
- but the vm terminal does handle the ESC-[7m correctly
- but in either case, the action of less in introducing those highlight codes stops the UTF-8 sequence resolving as it does for the accented characters.

Daniel: For the browser, something to check is what presentation encoding it is automatically set to when visiting any page on your CQPweb server; this will tell you what charset your HTTP server is emitting in the Content-Type header. It ought to be emitting UTF-8. (And check your browser has auto-detect encoding switched on, or the equivalent; that's optional in Chrome but I don't know about other browsers).

CQPweb always tries to  set the HTTP Content-Type header to assert utf8 as the charset. It's possible, however, that your server might block this in some way ( I don't know how, but anything's possible I suppose).

If neither browser nor HTTP server is the culprit, the next most likely explanation may be that you accidentally ticked the "Tick here if the corpus is encoded in Latin1 (iso-8859-1) " box in CQPweb's New Corpus form. If so, then the charset will be declared as latin1 in the registry file (instead of utf8 which is what it should be).

If you did this, then CQPweb will re-code the data from (what it thinks is) latin1 to utf8 every time. If the underlying data is really UTF-8,  that re-coding would produce the effect you are seeing.

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it> [mailto:cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>] On Behalf Of Stefan Evert
Sent: 16 December 2015 09:02
To: CWBdev Mailing List
Subject: Re: [CWB] Accents and codification

> On 16 Dec 2015, at 01:28, Daniel Renau <alphak87 at gmail.com<mailto:alphak87 at gmail.com>> wrote:
>
> There's a way to solve this problem with accents and apostrophe?
>
> Pic related: http://i.imgur.com/OAVDzuG.png
>
> At cqp via command line, the accents show OK (ssh connection or local terminal)
> At cqpWEB the accents are not displayed correctly.
> The apostrophe -> ' <- isn't shown properly anywhere, it shows <80><99>
> At UTF8 table is named like "RIGHT SINGLE QUOTATION MARK"

After taking a very close look at the screenshot, it would appear that your corpus is mostly encoded in UTF-8, but you have set CQPweb and/or your browser to interpret it as latin1.  If you change these settings to be consistent with your actual corpus encoding, the text should display fine.

The encoding of RIGHT SINGLE QUOTATION mark might actually be broken in your input data, since it appears as a sequence of three bytes in your VM terminal and especially given the absolute mess of characters (and control codes??) showing up via ssh.

Did "cwb-encode -c utf8" actually accept this input as well-formed UTF-8?

Best,
Stefan

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://devel.sslmit.unibo.it/mailman/listinfo/cwb

--
Un saludo, Dani.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20151216/96478f49/attachment.html>