[CWB] Accents and codification

Daniel Renau alphak87 at gmail.com
Thu Dec 17 23:44:58 CET 2015


Ok guys, thank you for info. I will tell them not to use the script anymore.

Maybe I will generate a script later, when I have a little time to sit down
and look for perl info, but for now I need to encode de corpus.

What they told me they were doing so far?
- They take some texts and translations
- Put them in Déjàvu to align them
- Use a perl script to separte in two different texts (original and
translation)
- Use TreeTagger to tag
- Join all the texts together to make just (cat original* > original.txt)

What I have?
- Like 10 .txt files in 2 languages. (original an translation)
I'm in the step before to join the originals and translations

What I did?
- Join the texts. http://i.imgur.com/KN9Z7cY.png
- Look the tags in them: <s> <text> <unknown> http://i.imgur.com/pTisyew.png
(I had to change <unknown> to unknown)
- Fill the form like this. http://i.imgur.com/lmDvLTy.png
- Get the error: http://i.imgur.com/RoY8sRu.png

Is this the best way? Are we doing something wrong?

Thank you all.

2015-12-17 8:25 GMT+01:00 Stefan Evert <stefanML en collocations.de>:

>
> > On 16 Dec 2015, at 21:34, Daniel Renau <alphak87 en gmail.com> wrote:
> >
> > Now my doubts are...
> > 1- Better modify the script to call the encoder with "-c utf8"?
>
> Don't use the script from the command line, but rather write a small Perl
> script using the CWB::Encoder module.  The command-line script you're
> running is basically the same thing, and just sets some parameters from
> command-line flags, others to immutable default values.
>
> With your own Perl script, you can then use the ->charset() method to
> encode a UTF-8 corpus.  If you know a little Perl, it would also be easy to
> change the command-line script so that it accepts a new flag for setting
> the charset.
>
> Best,
> Stefan
> _______________________________________________
> CWB mailing list
> CWB en sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>



-- 
Un saludo, Dani.
------------ pr�xima parte ------------
Se ha borrado un adjunto en formato HTML...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20151217/4a7c6470/attachment.html>


More information about the CWB mailing list