[CWB] Encoding problem in CWB version 3.2.4

Tue Sep 28 14:04:45 CEST 2010

Hi,
thanks a lot for the help - and sorry I opened such a can of worms :)
Indeed, if I simply run cqp -e then the problem goes away; ok, I can't 
see the utf chars on the screen correctly, but that is easily something 
to do with my terminal settings. Or maybe with the fact that info does 
show that cqp thinks the corpus is latin1.
I have to admit I wasn't even aware that registry can (should) contain 
char set encoding, which just goes to show I have some more reading to 
do. But in my defence, 
http://cwb.sourceforge.net/files/CWB_Encoding_Tutorial.pdf
only mentions registry in the appendix, but doesn't give the commands it 
can contain. So, any hints where the registry and correct setting to 
utf8 is explained are very much welcome.
All the best,
Tomaž

On 28.9.2010 10:54, Stefan Evert wrote:
> Ah, I see the problem now, and can reproduce Tomaz's problem if I run
> cqp without "-e" (something one should _never_ do, of course ;-).
> With command-line editing enabled, the newline is stripped before
> each command is passed to CQP.
>
> In my book, cl_string_validate_encoding() should only be called on
> "object" strings and regular expressions passed to the CL, not on
> complete queries and input lines.
>
> I recall adding the TAB-exception to INVALID_CTRL specifically so we
> could run it on the full input lines in cwb-encode, so I'm clearly
> not true to myself.  However, I don't think I had problems with the
> -C option like those reported by Peter, so perhaps his files have
> Windows line endings that don't get stripped properly on a Unix
> platform?
>
> The easy solution is to make INVALID_CTRL more lenient so that it
> allows CR and LF characters as well -- that should cover everything
> that's allowed to pop up in CQP queries, right?
>
> I think it would still be better to do the validation only for those
> strings that are actually used by CL functions.  I'm a little bit
> queasy about the use of QueryBuffer -- IIRC, that's a bad hack I
> designed for my macro expansion technique, and I've never been
> confident that the whole thing works correctly :-).
>
> One negative side-effect is that filenames read and written from
> within CQP must be in the current corpus encoding.  If anything, we
> should force people to use ASCII filenames, but not e.g. Latin1 on a
> Mac OS X computer.  Any chance of improving the validation method in
> the mid to long run, or is that just too complicated?
>
> @Tomaz: If you get these error messages, that means your corpus
> encoding is not properly declared as "utf8", but rather as "latin1"
> (which may be implicitly set for an undeclared encoding?). Can you
> check your registry file or the output of CQP's "info" command?  Did
> you automatically generate the makefile with cwb-encode's -R option?
> This is very much recommended to ensure consistency of the charset
> declaration.
>
> Cheers, Stefan
>
> On 28 Sep 2010, at 10:18, Hardie, Andrew wrote:
>
>> @Stefan - it's probably the INVALID_CTRL macro in special-chars.
>> It treats anything under 0x20 as invalid, which I think you put in
>> because you can't have \n or \r in CWB tokens.
>>
>> #define INVALID_CTRL(c) (c<  0x20&&  c != 0x09)
>>
>> BUT when that same function is used to validate query strings it
>> will fail - they have a \n on the end because the user just pressed
>> ENTER. I *think* that's it, anyway. The parser code is a source of
>> immense and continuing mysrtery to me.
>>
>> (Also, I believe I made cwb-encode run the validation check on
>> entire input lines, not on individual tokens - so that may also now
>> be a bug if the line break is not trimmed off before that
>> happens.)
>
> _______________________________________________ CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

-- 
Tomaž Erjavec
Dept. of Knowledge Technologies
Jožef Stefan Institute, Ljubljana
WWW: http://nl.ijs.si/et/