[CWB] [ cwb-Bugs-2917570 ] CQPweb: inappropriate collation for German & other languages

SourceForge.net noreply at sourceforge.net
Sat Dec 19 11:43:41 CET 2009


Bugs item #2917570, was opened at 2009-12-19 11:43
Message generated for change (Tracker Item Submitted) made by schtepf
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=2917570&group_id=131809

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: CQPweb
Group: None
Status: Open
Resolution: None
Priority: 9
Private: No
Submitted By: Stefan Evert (schtepf)
Assigned to: Andrew Hardie (andrewhardie)
Summary: CQPweb: inappropriate collation for German & other languages

Initial Comment:
CQPweb stores all MySQL tables with collation utf8_general_ci, which is inappropriate for German and presumably many other languages because it ignores accents and other diacritics in addition to case-folding strings.  This leads to bogus entries in collocation tables, frequency lists and frequency breakdown.  

Examples: German verbs "fallen" and "fällen" are collapsed into a single entry, which is labelled randomly and may mislead German users to think that the frequent verb "fallen" doesn't occur in a corpus. Also, verbs and nouns are often collapsed, e.g. "treffen" and "Treffen".

For such languages, there should be an option to use binary collation (utf8_bin) in all MySQL tables (specified on a per-corpus basis) in order to get sensible collocations and frequency lists.  In the current state, these options are unusable for German corpora.

(Note: the most appropriate collation for German seems to be latin1_german2_ci, but this would require multi-charset support throughout CQPweb.)

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722303&aid=2917570&group_id=131809


More information about the CWB mailing list