[CWB] invalid UTF8 string passed to cl_string_canonical...
Hardie, Andrew
a.hardie at lancaster.ac.uk
Mon May 9 17:38:31 CEST 2016
Well, if the correct output is produced, you can just ignore it. (And I can’t currently think why it wouldn’t be: I have now remembered why the aligner uses that function, it is for accent-insensitive character comparison, so the fact that some of the comparanda terminate halfway through a character should only have the effect of those comparisons being of no use in detecting parallels).
If not, you can use the configuration flags to specify that the alignment should not use letter n grams as a feature for alignment, thus avoiding that branch of the code. See man cwb-align and in particular the flags -1, -2, etc.
Fixing this bug is something that needs to be done but is going to be a right royal pain in the neck because it will mean fairly complex checking of the byte sequences – so not something I am going to have time for in the near future I’m afraid.
best
Andrew.
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of "Andrés Chandía"
Sent: 09 May 2016 15:37
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] invalid UTF8 string passed to cl_string_canonical...
yes I use: cwb-encode -c utf8
so, what should I do?
El Lun, 9 de Mayo de 2016, 15:55, Hardie, Andrew escribió:
Is the corpus declared as UTF-8?
If so, the problem is likely to be that, in testing letter n-grams, the aligner is slicing up UTF characters. (I???m not quite sure why this causes an error with
cl_string_canonical as I wasn???t aware that the aligner used that function??? but possibly I???ve just forgotten).
best
Andrew.
From: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it> [mailto:cwb-bounces at sslmit.unibo.it]
On Behalf Of "Andr??s Chand??a"
Sent: 09 May 2016 14:31
To: Open source development of the Corpus WorkBench
Subject: [CWB] invalid UTF8 string passed to cl_string_canonical...
I'm geting this error message when aligning but I don't know how to deal with it, I just found one comment about it, it didn't help me though, thanks.
OPENING btcataladeutsch_ca [205899 tokens, 7733 regions]
OPENING btcataladeutsch_de [112264 tokens, 4951 regions]
LEXICON SIZE: 24709 / 19889
FEATURE: character count, weight=1 ... [1]
FEATURE: Shared words, threshold=40.0%, weight=50 ... [6]
FEATURE: 3-grams, weight=3 ... CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
[21952]
FEATURE: 4-grams, weight=4 ... CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
[614656]
[636615 features allocated]
[290636 entries in source text feature map]
[296034 entries in target text feature map]
PASS 2: Setting character count weight.
PASS 2: Processing shared words (th=40.0%).
PASS 2: Processing 3-grams.
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
PASS 2: Processing 4-grams.
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
PASS 2: Creating character counts.
_______________________
andr??s chand??a
[chandia.net]<http://www.chandia.net>[X]<https://twitter.com/andreschandia>
administrador de:
parles.upf<http://parles.upf.edu> |
delingua<http://www.delingua.es> | amind terapia<http://amindterapia.com> |
mapuche koyaktu<http://koyaktumapuche.net> | mail ong mapuche koyaktu<http://mail.corporacionkoyaktu.net> |
mail psicoaching<http://mail.psicoaching.net> |
P No imprima innecesariamente. ??Cuide el medio ambiente!
_______________________
andrés chandía
[chandia.net]<http://www.chandia.net>[http://www.upf.edu/universitat/_img/ico_tw.png]<https://twitter.com/andreschandia>
administrador de:
parles.upf<http://parles.upf.edu> | delingua<http://www.delingua.es> | amind terapia<http://amindterapia.com> | mapuche koyaktu<http://koyaktumapuche.net> | mail ong mapuche koyaktu<http://mail.corporacionkoyaktu.net> | mail psicoaching<http://mail.psicoaching.net> |
P No imprima innecesariamente. ¡Cuide el medio ambiente!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20160509/e0295db9/attachment-0001.html>
More information about the CWB
mailing list