[CWB] invalid UTF8 string passed to cl_string_canonical...

Hardie, Andrew a.hardie at lancaster.ac.uk
Thu May 12 19:43:45 CEST 2016


So we now know (thanks Andrés!) that the answer was (a), the problem was the Unicode tables compiled into Glib, and recompiling fixes it. 

(The reason this emerges with cwb-align, and not cwb-encode, is that different glib funcs are used to get the "good/bad utf8" message.) 

This is the downside of using static linking for Glib of course.... without regular recompilation the internal unicode data will drift gradually into incompleteness vis a vis the U. standard.

Perhaps there should be a warning in the docs somewhere that if you work with data that is at all likely to include any characters added to Unicode in recent revisions, you ought to rebuild CWB every few months.

Andrew.


-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Stefan Evert
Sent: 12 May 2016 16:07
To: CWBdev Mailing List
Subject: Re: [CWB] invalid UTF8 string passed to cl_string_canonical...

Hi Andrew,

you're absolutely right.  I'm sorry I jumped to your hypothesis without checking the source code.

From what I can see, it has always been implemented in this way in CWB 3.4, so the only explanations for the errors seem to be that

a) the corpus contains some weird characters (or character sequences) that cl_string_canonical() or the underlying Glib routines don't handle; or

b) the corpus isn't valid UTF-8 after all; or

c) the source corpus is UTF-8, but the target corpus has a different encoding. cwb-align expects both corpora to have the same encoding, but it doesn't actually check this and simply uses the declared encoding of the first corpus.

I guess I should update the bug ticket …

Best,
Stefan


> On 12 May 2016, at 03:56, Hardie, Andrew <a.hardie at lancaster.ac.uk> wrote:
> 
> @ Andrés - while considering your original message again, I noticed your error message....
> 
> " CL: major error, invalid UTF8 string passed to cl_string_canonical... "
> 
> ... is actually out of date. I changed it in March last year, to be more specific. Your version lacks this and a bag of other changes I made at that time.
> 
> Can you recompile with an up to date copy of the code (and also make sure your copy of Glib is as up to date as possible), and recheck to see if you still get the error messages? It's just possible the error will go away on its own.
> 
> The previous report of this error, from Ruprecht, *Also* went away upon recompilation, incidentally (because of newer Unicode tables, or so we thought at the time)
> 
> best
> 
> Andrew.
> 
> PS - @ Stefan, here's what I mean : https://sourceforge.net/p/cwb/code/624/
> 
> And here's the last time we ran into this:  http://devel.sslmit.unibo.it/pipermail/cwb/2015-March/thread.html thread called "[CWB] unicode problems with Greek and OCS   Ruprecht von Waldenfels" - start at the top and go down.
> 
> If our memories were only a bit better....
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list