[CWB] change charset to latin1

Hardie, Andrew a.hardie at lancaster.ac.uk
Tue Mar 9 18:19:12 CET 2010


Hi,

I'm not sure what the issue with the registry is, but I may be able to explain the problem with %d.

The charset property doesn't currently affect accent folding (or, erm, anything else really) though it will once we get the unicode features up and running, soon-ish!

Diacritic insensitivity (%d) is done treating the byte-string as Latin1 regardless of either (a) the charset declaration or (b) the actual character encoding.

If you do either case (%c) or diacritic (%d) folding on data that contains accented letters encoded as UTF8, all it will do is scramble them, unfortunately.

If your data really is UTF8, you have two solutions: re-code to Latin1 and then reindex. Or, wait a few weeks for a candidate release of CWB with proper UTF8 charset support...

On the other hand, the XML property ' encoding="ISO-8859-1" ' actually indicates the data is encoded in Latin1. But depending on the provenance of your XML files, and on what happened during any preprocessing you did, that declaration may or may not be correct. The only way to be sure is to check the actual encoding of an accented character.

best

Andrew.


-----Original Message-----
From: cwb-bounces at sslmit.unibo.it on behalf of Albert Albert
Sent: Tue 09/03/2010 17:03
To: Open source development of the Corpus WorkBench
Subject: [CWB] change charset to latin1
 
Hi all,
I have a corpus and the diacritic argument (%d) doesn't run. I think
that my charset is UTF8 because I look the commented sentence in the
registry:
##:: charset  = "latin1" # character encoding of corpus data
I want to change uncommenting the sentence:
:: charset = "latin1"
or
charset = "latin1"

And then the registry doesn't run, return undefined when I call it.

How I specify the charset of my corpus?
In the header of my vrt's I have:
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes" ?>

Thanks for all!
cheers
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 3611 bytes
Desc: not available
Url : http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20100309/8d98d58f/attachment.bin


More information about the CWB mailing list