[CWB] problems with Cqpweb and frequency lists

Hardie, Andrew a.hardie at lancaster.ac.uk
Sat Jun 20 15:13:23 CEST 2015


Hi Stefania,

This is a known problem which arises from the fact that the available MySQL collations for sorting-and-merging strings considered “equal” even if they are not do not match the case/diacritic folding in CWB.

There are two choices made possible by the available collations: the behaviour you have currently, in which all accents and case distinctions are ignored when collating; OR, a collation which doesn’t merge anything, i.e. it treats accented characters as distinct, but also treats case distinctions as significant.

You can engage the latter mode under the “Corpus settings” option in the main screen menu. If you set “Corpus requires case-sensitive collation for string comparison and searches ” to “yes”, you will switch the collation over to the case/diacritic-sensitive mode. Please note well the warning about the need to rebuild all frequency tables.

I have a  long-term idea for a solution to this but unfortunately (a) I don’t know yet whether it will work, (b) even if does, it will take a long time to implement.

The solution in question involves MySQL custom collations and the big open question is the impact they have on performance. If anyone has experience with custom  collations, your input here would be welcome.

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Stefania Spina
Sent: 20 June 2015 10:21
To: cwb
Subject: [CWB] problems with Cqpweb and frequency lists

Hello,
I have an Italian corpus indexed in Cqpweb (v3.1.13); the corpus is encoded in iso-8859-1.
When I use frequency lists, it seems that accented and non-accented characters are not properly distinguished. For example, in the word frequency list, the word "è" combines the frequency values of "è" and "e", and the unaccented word "e" is not included in the frequency list.
This does not happen in the queries, where accented and non accented characters are perfectly distinguished.
Is there a way I can solve this problem?
Thank you for your help,
Stefania

--
Stefania Spina
Università per Stranieri di Perugia
Dipartimento di Scienze Umane e Sociali
stefania.spina at unistrapg.it<mailto:stefania.spina at unistrapg.it>
https://unistrapg.academia.edu/StefaniaSpina
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150620/5f896d81/attachment.html>


More information about the CWB mailing list