[CWB] [ cwb-Bugs-3391655 ] cwb-scan-corpus: -C option doesn't respect encoding

SourceForge.net noreply at sourceforge.net
Tue Jan 17 23:13:09 CET 2012


Bugs item #3391655, was opened at 2011-08-14 16:47
Message generated for change (Comment added) made by andrewhardie
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3391655&group_id=131809

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Command-line utilities
Group: TODO-3.5
>Status: Closed
>Resolution: Fixed
Priority: 7
Private: No
Submitted By: Andrew Hardie (andrewhardie)
Assigned to: Andrew Hardie (andrewhardie)
Summary: cwb-scan-corpus: -C option doesn't respect encoding

Initial Comment:
The -C "cleanup" option assesses whether or not each token value is "regular", that is, consists only of letters and numbers and hyphens.

But this is done solely on the basis of Latin1. It will fail badly with UTF8 and may fail with legacy iso encodings.

The culpable functions are is_regular and is_letter in cwb-scan-corpus.c -- it is probably better to create CL functionality in special-chars.c (which can call GLib if necessary).

In the short term, the -C behaviour should NOT be relied on except for LAtin1 corpora.

----------------------------------------------------------------------

>Comment By: Andrew Hardie (andrewhardie)
Date: 2012-01-17 14:13

Message:
Fixed so is_regular() now uses the CL regex engine, in v3.4.2. 

Test for correct behaviour please!

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=3391655&group_id=131809


More information about the CWB mailing list