[CWB] [ cwb-Bugs-3391655 ] cwb-scan-corpus: -C option doesn't respect encoding

SourceForge.net noreply at sourceforge.net
Mon Aug 15 01:47:30 CEST 2011


Bugs item #3391655, was opened at 2011-08-14 23:47
Message generated for change (Tracker Item Submitted) made by andrewhardie
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3391655&group_id=131809

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Command-line utilities
Group: TODO-3.5
Status: Open
Resolution: None
Priority: 7
Private: No
Submitted By: Andrew Hardie (andrewhardie)
Assigned to: Andrew Hardie (andrewhardie)
Summary: cwb-scan-corpus: -C option doesn't respect encoding

Initial Comment:
The -C "cleanup" option assesses whether or not each token value is "regular", that is, consists only of letters and numbers and hyphens.

But this is done solely on the basis of Latin1. It will fail badly with UTF8 and may fail with legacy iso encodings.

The culpable functions are is_regular and is_letter in cwb-scan-corpus.c -- it is probably better to create CL functionality in special-chars.c (which can call GLib if necessary).

In the short term, the -C behaviour should NOT be relied on except for LAtin1 corpora.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3391655&group_id=131809


More information about the CWB mailing list