[CWB] Encoding problem in CWB version 3.2.4

Tue Sep 28 10:54:41 CEST 2010

Ah, I see the problem now, and can reproduce Tomaz's problem if I run cqp without "-e" (something one should _never_ do, of course ;-).  With command-line editing enabled, the newline is stripped before each command is passed to CQP.

In my book, cl_string_validate_encoding() should only be called on "object" strings and regular expressions passed to the CL, not on complete queries and input lines.

I recall adding the TAB-exception to INVALID_CTRL specifically so we could run it on the full input lines in cwb-encode, so I'm clearly not true to myself.  However, I don't think I had problems with the -C option like those reported by Peter, so perhaps his files have Windows line endings that don't get stripped properly on a Unix platform?

The easy solution is to make INVALID_CTRL more lenient so that it allows CR and LF characters as well -- that should cover everything that's allowed to pop up in CQP queries, right?

I think it would still be better to do the validation only for those strings that are actually used by CL functions.  I'm a little bit queasy about the use of QueryBuffer -- IIRC, that's a bad hack I designed for my macro expansion technique, and I've never been confident that the whole thing works correctly :-).

One negative side-effect is that filenames read and written from within CQP must be in the current corpus encoding.  If anything, we should force people to use ASCII filenames, but not e.g. Latin1 on a Mac OS X computer.  Any chance of improving the validation method in the mid to long run, or is that just too complicated?

@Tomaz: If you get these error messages, that means your corpus encoding is not properly declared as "utf8", but rather as "latin1" (which may be implicitly set for an undeclared encoding?). Can you check your registry file or the output of CQP's "info" command?  Did you automatically generate the makefile with cwb-encode's -R option?  This is very much recommended to ensure consistency of the charset declaration.

Cheers,
Stefan

On 28 Sep 2010, at 10:18, Hardie, Andrew wrote:

> @Stefan - it's probably the INVALID_CTRL macro in special-chars. It
> treats anything under 0x20 as invalid, which I think you put in because
> you can't have \n or \r in CWB tokens. 
> 
> #define INVALID_CTRL(c) (c < 0x20 && c != 0x09)
> 
> BUT when that same function is used to validate query strings it will
> fail - they have a \n on the end because the user just pressed ENTER. I
> *think* that's it, anyway. The parser code is a source of immense and
> continuing mysrtery to me.
> 
> (Also, I believe I made cwb-encode run the validation check on entire
> input lines, not on individual tokens - so that may also now be a bug if
> the line break is not trimmed off before that happens.)