[CWB] Accents and codification

Hardie, Andrew a.hardie at lancaster.ac.uk
Fri Dec 18 16:58:07 CET 2015


I direct your attention to the actual error message prior to the file/line reference you underline: “Encoding error …”

Since it is falling over on the very first line with a non-ASCII character the obvious possibility would be that the file isn’t actually UTF-8 encoded.

best

Andrew.


From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Daniel Renau
Sent: 17 December 2015 22:45
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Accents and codification

Ok guys, thank you for info. I will tell them not to use the script anymore.
Maybe I will generate a script later, when I have a little time to sit down and look for perl info, but for now I need to encode de corpus.

What they told me they were doing so far?
- They take some texts and translations
- Put them in Déjàvu to align them
- Use a perl script to separte in two different texts (original and translation)
- Use TreeTagger to tag
- Join all the texts together to make just (cat original* > original.txt)

What I have?
- Like 10 .txt files in 2 languages. (original an translation)
I'm in the step before to join the originals and translations
What I did?
- Join the texts. http://i.imgur.com/KN9Z7cY.png
- Look the tags in them: <s> <text> <unknown> http://i.imgur.com/pTisyew.png (I had to change <unknown> to unknown)
- Fill the form like this. http://i.imgur.com/lmDvLTy.png
- Get the error: http://i.imgur.com/RoY8sRu.png

Is this the best way? Are we doing something wrong?
Thank you all.

2015-12-17 8:25 GMT+01:00 Stefan Evert <stefanML at collocations.de<mailto:stefanML at collocations.de>>:

> On 16 Dec 2015, at 21:34, Daniel Renau <alphak87 at gmail.com<mailto:alphak87 at gmail.com>> wrote:
>
> Now my doubts are...
> 1- Better modify the script to call the encoder with "-c utf8"?

Don't use the script from the command line, but rather write a small Perl script using the CWB::Encoder module.  The command-line script you're running is basically the same thing, and just sets some parameters from command-line flags, others to immutable default values.

With your own Perl script, you can then use the ->charset() method to encode a UTF-8 corpus.  If you know a little Perl, it would also be easy to change the command-line script so that it accepts a new flag for setting the charset.

Best,
Stefan
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://devel.sslmit.unibo.it/mailman/listinfo/cwb



--
Un saludo, Dani.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20151218/cdac4c68/attachment-0001.html>


More information about the CWB mailing list