[CWB] invalid UTF8 string passed to cl_string_canonical...
stefanML at collocations.de
Thu May 12 17:06:48 CEST 2016
you're absolutely right. I'm sorry I jumped to your hypothesis without checking the source code.
From what I can see, it has always been implemented in this way in CWB 3.4, so the only explanations for the errors seem to be that
a) the corpus contains some weird characters (or character sequences) that cl_string_canonical() or the underlying Glib routines don't handle; or
b) the corpus isn't valid UTF-8 after all; or
c) the source corpus is UTF-8, but the target corpus has a different encoding. cwb-align expects both corpora to have the same encoding, but it doesn't actually check this and simply uses the declared encoding of the first corpus.
I guess I should update the bug ticket …
> On 12 May 2016, at 03:56, Hardie, Andrew <a.hardie at lancaster.ac.uk> wrote:
> @ Andrés - while considering your original message again, I noticed your error message....
> " CL: major error, invalid UTF8 string passed to cl_string_canonical... "
> ... is actually out of date. I changed it in March last year, to be more specific. Your version lacks this and a bag of other changes I made at that time.
> Can you recompile with an up to date copy of the code (and also make sure your copy of Glib is as up to date as possible), and recheck to see if you still get the error messages? It's just possible the error will go away on its own.
> The previous report of this error, from Ruprecht, *Also* went away upon recompilation, incidentally (because of newer Unicode tables, or so we thought at the time)
> PS - @ Stefan, here's what I mean : https://sourceforge.net/p/cwb/code/624/
> And here's the last time we ran into this: http://devel.sslmit.unibo.it/pipermail/cwb/2015-March/thread.html thread called "[CWB] unicode problems with Greek and OCS Ruprecht von Waldenfels" - start at the top and go down.
> If our memories were only a bit better....
> CWB mailing list
> CWB at sslmit.unibo.it
More information about the CWB