[CWB] Problem encoding corpus with POS tags

Stefan Evert stefanML at collocations.de
Tue Nov 6 17:09:18 CET 2012


>  This is a common enough gotcha that we should probably give cwb-encode the ability to spot CR on POSIX and raise the alarm.

Two ideas off the top of my head:

 - We could extend -B to remove all whitespace characters around tokens, not just blanks.

 - We should probably change line #46 of cwb-encode.c to 

	#define FIELDSEPS  "\t\n\r"

If I read the manpage correctly, this should solve the problem in future.

Watcha think?
Stefan




More information about the CWB mailing list