[CWB] Strange issue with character encoding (?) in frequency lists
Scott Sadowsky
ssadowsky at gmail.com
Sat May 25 14:45:09 CEST 2019
I've run into a strange issue that might have to do with character encoding
(or it might not).
When I go to Corpus Queries > Frequency lists, select my full corpus,
choose to view a list based on lemmas, and then hit Show Frequency List, I
get a list of lemmas in which quite a few have phantom accent marks and
other diacriticals, e.g. "sì", "còmo", "èn", "sú", "ïgual", "cúando"
(obviously, this is a Spanish corpus).
However, when I click on the links for these words and go to the
concordance, not a single word has these marks. When I further click
through and go to the source texts, the marks also aren't there.
I've grepped through my tagger's dictionary files (FreeLing), and none of
these forms exist as lemmas or lexemes. I've also grepped through the *.vrt
files that the corpus was compiled from, and none of these forms are
present.
I've run into an additional strange issue that is probably related. When I
make a subcorpus that is an exact copy of the source corpus, the same
problem occurs,but most of the spurious accents and such are *different* (e.g.
"qúe", "nó", "á", "sì", "én").
I'm attaching an edited screenshot that shows the top of the frequency list
based on the full corpus on the left and the subcorpus that contains the
full corpus on the right, with errors in red boxes.
[image: Lemmas.png]
Of course, in some cases both of the highlighted forms exist in Spanish
(e.g. #28, "mi" and "mí"), but in spite of being different they have the
same frequencies in the corpus and the subcorpus, which further suggests
that it's not the underlying data that's causing this.
Best,
Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20190525/450128fe/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Lemmas.png
Type: image/png
Size: 48821 bytes
Desc: not available
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20190525/450128fe/attachment-0001.png>
More information about the CWB
mailing list