[CWB] problems with Cqpweb and frequency lists

Stefania Spina stefania.spina at unistrapg.it
Sun Jun 21 13:10:27 CEST 2015


Thank you Andrew.
Or, as a third choice, prevent users from viewing frequency lists. Is this
possible in Cqpweb? I mean, setting privileges so as users can make queries
and use all the CQP functions, except viewing frequency lists?
Thank you again,
Stefania

2015-06-20 15:14 GMT+02:00 Hardie, Andrew <a.hardie at lancaster.ac.uk>:

>  Hi Stefania,
>
>
>
> This is a known problem which arises from the fact that the available
> MySQL collations for sorting-and-merging strings considered “equal” even if
> they are not do not match the case/diacritic folding in CWB.
>
>
>
> There are two choices made possible by the available collations: the
> behaviour you have currently, in which all accents and case distinctions
> are ignored when collating; OR, a collation which doesn’t merge *anything*,
> i.e. it treats accented characters as distinct, but also treats case
> distinctions as significant.
>
>
>
> You can engage the latter mode under the “Corpus settings” option in the
> main screen menu. If you set “Corpus requires case-sensitive collation
> for string comparison and searches ” to “yes”, you will switch the
> collation over to the case/diacritic-sensitive mode. Please note well the
> warning about the need to rebuild all frequency tables.
>
>
>
> I have a  long-term idea for a solution to this but unfortunately (a) I
> don’t know yet whether it will work, (b) even if does, it will take a long
> time to implement.
>
>
>
> The solution in question involves MySQL custom collations and the big open
> question is the impact they have on performance. If anyone has experience
> with custom  collations, your input here would be welcome.
>
>
>
> best
>
>
>
> Andrew.
>
>
>
> *From:* cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] *On
> Behalf Of *Stefania Spina
> *Sent:* 20 June 2015 10:21
> *To:* cwb
> *Subject:* [CWB] problems with Cqpweb and frequency lists
>
>
>
> Hello,
>
> I have an Italian corpus indexed in Cqpweb (v3.1.13); the corpus is
> encoded in iso-8859-1.
>
> When I use frequency lists, it seems that accented and non-accented
> characters are not properly distinguished. For example, in the word
> frequency list, the word "è" combines the frequency values of "è" and "e",
> and the unaccented word "e" is not included in the frequency list.
>
> This does not happen in the queries, where accented and non accented
> characters are perfectly distinguished.
>
> Is there a way I can solve this problem?
>
> Thank you for your help,
>
> Stefania
>
>
>
> --
>
> Stefania Spina
> Università per Stranieri di Perugia
> Dipartimento di Scienze Umane e Sociali
> stefania.spina at unistrapg.it
> https://unistrapg.academia.edu/StefaniaSpina
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
>


-- 
Stefania Spina
Università per Stranieri di Perugia
Dipartimento di Scienze Umane e Sociali
stefania.spina at unistrapg.it
https://unistrapg.academia.edu/StefaniaSpina
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150621/7fa0836a/attachment.html>


More information about the CWB mailing list