[CWB] invalid UTF8 string passed to cl_string_canonical...

"Andrés Chandía" andres at chandia.net
Mon May 9 16:36:58 CEST 2016



yes I use: cwb-encode -c utf8
so, what should I do?

El Lun, 9 de Mayo de 2016,
15:55, Hardie, Andrew escribió:
  

Is the corpus declared as
UTF-8?

 
If so, the problem is likely
to be that, in testing letter n-grams, the aligner is slicing up UTF characters. (I???m not
quite sure why this causes an error with
  cl_string_canonical as I wasn???t aware that
the aligner used that function??? but possibly I???ve just forgotten).
 
best
 
Andrew.
 
From: cwb-bounces at sslmit.unibo.it
[mailto:cwb-bounces at sslmit.unibo.it]
 On Behalf Of "Andr??s
Chand??a"
 Sent: 09 May 2016 14:31
 To: Open
source development of the Corpus WorkBench
 Subject: [CWB] invalid UTF8
string passed to cl_string_canonical...
 
I'm geting this error message when aligning but I don't know how to deal
with it, I just found one comment about it, it didn't help me though, thanks.
 

OPENING btcataladeutsch_ca [205899 tokens, 7733  regions]
 OPENING
btcataladeutsch_de [112264 tokens, 4951  regions]
 LEXICON SIZE: 24709 / 19889
 FEATURE: character count, weight=1 ... [1]
 FEATURE: Shared words, threshold=40.0%,
weight=50 ... [6]
 FEATURE: 3-grams, weight=3 ... CL: major error, invalid UTF8 string
passed to cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 [21952]
 FEATURE: 4-grams, weight=4 ... CL: major error,
invalid UTF8 string passed to cl_string_canonical...
 CL: major error, invalid UTF8
string passed to cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 [614656]
 [636615 features allocated]
 [290636 entries
in source text feature map]
 [296034 entries in target text feature map]
 PASS 2:
Setting character count weight.
 PASS 2: Processing shared words (th=40.0%).
 PASS
2: Processing 3-grams.
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 PASS 2: Processing 4-grams.
 CL: major error, invalid UTF8
string passed to cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 CL: major error, invalid UTF8 string passed to
cl_string_canonical...
 PASS 2: Creating character counts.
 

_______________________
             andr??s chand??a
 

administrador de:
 parles.upf | 
 delingua
| amind terapia | 
 mapuche koyaktu
| mail ong mapuche koyaktu |
 mail psicoaching |
 P No imprima innecesariamente. ??Cuide el medio
ambiente!


 

 
   


_______________________
            andrés
chandía

administrador de:
parles.upf | delingua | amind
terapia | mapuche koyaktu | mail ong mapuche koyaktu | mail psicoaching |
P No imprima innecesariamente. ¡Cuide el medio ambiente!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20160509/a6f5db9e/attachment.html>


More information about the CWB mailing list