[CWB] Problem encoding corpus with POS tags
Stefan Evert
stefanML at collocations.de
Tue Nov 6 17:09:18 CET 2012
> This is a common enough gotcha that we should probably give cwb-encode the ability to spot CR on POSIX and raise the alarm.
Two ideas off the top of my head:
- We could extend -B to remove all whitespace characters around tokens, not just blanks.
- We should probably change line #46 of cwb-encode.c to
#define FIELDSEPS "\t\n\r"
If I read the manpage correctly, this should solve the problem in future.
Watcha think?
Stefan
More information about the CWB
mailing list