[CWB] INVALID_CTRL marking \n wrongly? (schtepf)

Stefan Evert stefanML at collocations.de
Thu Jan 6 08:52:57 CET 2011


> Yes, known issue. Me & Stefan were actually talking about precisely this at the start of Oct when term hit and we suddenly had no more time for programming. 
> 
> The current situation is clearly wrong BUT there are certain implications regarding parity of treatment of C0 control chars in Latin1 vs utf8 so it's not obvious what the Right Thing is. 

The conclusion at the time was that Andrew is right and I've just been too much of a pedantic bureaucrat when I added that code.  I had been firmly convinced that the change had been reverted in the mean time -- that's why it didn't occur to me this could be your problem -- but apparently both of us in fact stopped coding after we'd reached the agreement.

@Andrew: I'm in favour of a separate "cleanup" flag for cwb-encode, which deletes or rewrites all invalid and control characters, so careless users don't run into nasty surprises.  (The underlying function should probably take a flag that decides whether TABs and newlines are invalid or not.)

Cheers & thanks for the quick fix, Andrew,
Stefan



More information about the CWB mailing list