[CWB] invalid UTF8 string passed to cl_string_canonical...

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon May 9 15:55:25 CEST 2016


Is the corpus declared as UTF-8?

If so, the problem is likely to be that, in testing letter n-grams, the aligner is slicing up UTF characters. (I’m not quite sure why this causes an error with cl_string_canonical as I wasn’t aware that the aligner used that function… but possibly I’ve just forgotten).

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of "Andrés Chandía"
Sent: 09 May 2016 14:31
To: Open source development of the Corpus WorkBench
Subject: [CWB] invalid UTF8 string passed to cl_string_canonical...

I'm geting this error message when aligning but I don't know how to deal with it, I just found one comment about it, it didn't help me though, thanks.

OPENING btcataladeutsch_ca [205899 tokens, 7733 <s_id> regions]
OPENING btcataladeutsch_de [112264 tokens, 4951 <s_id> regions]
LEXICON SIZE: 24709 / 19889
FEATURE: character count, weight=1 ... [1]
FEATURE: Shared words, threshold=40.0%, weight=50 ... [6]
FEATURE: 3-grams, weight=3 ... CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
[21952]
FEATURE: 4-grams, weight=4 ... CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
[614656]
[636615 features allocated]
[290636 entries in source text feature map]
[296034 entries in target text feature map]
PASS 2: Setting character count weight.
PASS 2: Processing shared words (th=40.0%).
PASS 2: Processing 3-grams.
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
PASS 2: Processing 4-grams.
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
PASS 2: Creating character counts.

_______________________
            andrés chandía
[chandia.net]<http://www.chandia.net>[http://www.upf.edu/universitat/_img/ico_tw.png]<https://twitter.com/andreschandia>
administrador de:
parles.upf<http://parles.upf.edu> | delingua<http://www.delingua.es> | amind terapia<http://amindterapia.com> | mapuche koyaktu<http://koyaktumapuche.net> | mail ong mapuche koyaktu<http://mail.corporacionkoyaktu.net> | mail psicoaching<http://mail.psicoaching.net> |
P No imprima innecesariamente. ¡Cuide el medio ambiente!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20160509/734cd6fd/attachment-0001.html>


More information about the CWB mailing list