[CWB] Strange issue with character encoding (?) in frequency lists

Scott Sadowsky ssadowsky at gmail.com
Tue May 28 06:16:22 CEST 2019


On Mon, May 27, 2019 at 3:37 PM Hardie, Andrew <a.hardie at lancaster.ac.uk>
wrote:

Hi Andrew,

There is a known issue in frequency tables using CI collations, which is
> that although all diacritics are folded together, the version that
> *appears* is the first version that is seen. Using English, if you have
> [...] then what will appear is naive with no *diaresis*. (same happens
> with case).
>
>
>
> This issue would seem to be behind the cases you report like “mi” and
> “mí”. The frequency list rolls them together, so it is just luck of the
> draw which shows up. It’s expected behaviour.
>

I understand that behavior occurring when the various forms are treated as
equivalents (e.g. "naive" and "naïve" with %cd). But in my case, "mi" and
"mí" actually correspond to different lemmas ("mi" and "mí", respectively).
Would this *still* be the expected behavior in this case?

If so, does this mean that lemmas that differ only by diacritical marks are
treated as one single lemma by one or more parts of CWB/CQPweb?


I think the best thing would be, for one word, to use CQP queries to see
> exactly what is in the index. IE run the following,
>
>
>
> [lemma="que"]
>
> [lemma="qúe"]
>
> [lemma="qùe"]
>

Okay. Here are the results:

   1. Your query “[lemma="que"]” returned 314,658 matches in 4,701
   different texts
   2. Your query “[lemma="qúe"]” returned 4 matches in 4 different texts
   3. Your query “[lemma="qué"]” returned 12,924 matches in 2,723 different
   texts
   4. (“[lemma="qùe"]” has 0 results)

1 and 3 are both actual words in Spanish, so both forms are expected. 2
shows that I've got four of these typos in the corpus, which isn't
unexpected since this is a speech corpus transcribed by fallible humans. 4
wasn't actually a word I saw in frequency lists, so 0 hits would also be
expected here.


That will narrow down the problem IE determine whether it is a CWB issue or
> a MySQL issue.
>

Hope this provides enough info to figure that out. If not, just let me know.

Best wishes,
Scott



>
>
> *From:* cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> *On
> Behalf Of *Scott Sadowsky
> *Sent:* 25 May 2019 21:39
> *To:* Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it
> >
> *Cc:* Open source development of the Corpus WorkBench <
> CWB at liste.sslmit.unibo.it>
> *Subject:* Re: [CWB] Strange issue with character encoding (?) in
> frequency lists
>
>
>
> On Sat, May 25, 2019 at 2:20 PM Hardie, Andrew <a.hardie at lancaster.ac.uk>
> wrote:
>
>
>
> Hi Andrew,
>
>
>
> One possibility is that the wrong charset/collation is being activated for
> the frequency tables. Could you check this?
>
> If you run  create table freq_corpus_*nameofyrcorpus*_word;   the mysql
> command prompt, then the character set / collation should be stated either
> for the table as a whole, or for the “item” column.
>
>
>
> That shows "ENGINE=InnoDB DEFAULT CHARSET=utf8". All my source texts are
> UTF8, and the database is created as that too, by the way.
>
>
>
> Cheers,
>
> Scott
>
>
>
>
>
>
>
> *From:* cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> *On
> Behalf Of *Scott Sadowsky
> *Sent:* 25 May 2019 13:45
> *To:* Open source development of the Corpus WorkBench <
> CWB at liste.sslmit.unibo.it>
> *Subject:* [CWB] Strange issue with character encoding (?) in frequency
> lists
>
>
>
> I've run into a strange issue that might have to do with character
> encoding (or it might not).
>
>
>
> When I go to Corpus Queries > Frequency lists, select my full corpus,
> choose to view a list based on lemmas, and then hit Show Frequency List, I
> get a list of lemmas in which quite a few have phantom accent marks and
> other diacriticals, e.g. "sì", "còmo", "èn", "sú", "ïgual", "cúando"
> (obviously, this is a Spanish corpus).
>
>
>
> However, when I click on the links for these words and go to the
> concordance, not a single word has these marks. When I further click
> through and go to the source texts, the marks also aren't there.
>
>
>
> I've grepped through my tagger's dictionary files (FreeLing), and none of
> these forms exist as lemmas or lexemes. I've also grepped through the *.vrt
> files that the corpus was compiled from, and none of these forms are
> present.
>
>
>
> I've run into an additional strange issue that is probably related. When I
> make a subcorpus that is an exact copy of the source corpus, the same
> problem occurs,but most of the spurious accents and such are *different* (e.g.
> "qúe", "nó", "á", "sì", "én").
>
>
>
> I'm attaching an edited screenshot that shows the top of the frequency
> list based on the full corpus on the left and the subcorpus that contains
> the full corpus on the right, with errors in red boxes.
>
>
>
> [image: Lemmas.png]
>
> Of course, in some cases both of the highlighted forms exist in Spanish
> (e.g. #28, "mi" and "mí"), but in spite of being different they have the
> same frequencies in the corpus and the subcorpus, which further suggests
> that it's not the underlying data that's causing this.
>
>
>
> Best,
>
> Scott
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>

-- 
Dr. Scott Sadowsky
Profesor Asistente de Lingüística
Pontificia Universidad Católica de Chile

ssadowsky gmail com
scsadowsky uc cl
http://sadowsky.cl/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20190528/79a2dfb2/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 102909 bytes
Desc: not available
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20190528/79a2dfb2/attachment-0001.png>


More information about the CWB mailing list