[CWB] Difference in token number between CQP and CQPweb
Hannah Kermes
h.kermes at mx.uni-saarland.de
Fri Feb 14 16:26:09 CET 2014
Hi Andrew,
It seems that for 10 (out of 310) texts, the word count is wrong.
I simply looked for all tokens ("[]") and made a frequency distribution
across texts.
The result was:
Your query "[]" returned 2,076,963 matches in 310 different texts (in
1,961,752 words [310 texts]; frequency: 1058728.63 instances per million
words).
So all tokens are basically there.
A frequency distribution showed that in 10 text the word count (second
column) is lower than the number of hits (third column, which shows the
correct word count).
So all tokens are basically asigned to the correct texts, but the word
count misses out of them somehow.
Hope this helps with debugging
Best
Hannah
1696_Tryon
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1696_Tryon&uT=y>
4,446 15,937
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&newPostP=text&newPostP_textTargetId=1696_Tryon&uT=y>
3584570.4
1563_Gale
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1563_Gale&uT=y>
12,082 36,168
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&newPostP=text&newPostP_textTargetId=1563_Gale&uT=y>
2993544.12
1539_Moulton
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1539_Moulton&uT=y>
4,446 9,167
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&newPostP=text&newPostP_textTargetId=1539_Moulton&uT=y>
2061853.35
1698_Colbatch
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1698_Colbatch&uT=y>
11,341 23,070
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&newPostP=text&newPostP_textTargetId=1698_Colbatch&uT=y>
2034212.15
1700_Salmon
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1700_Salmon&uT=y>
12,167 24,623
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&newPostP=text&newPostP_textTargetId=1700_Salmon&uT=y>
2023752.77
1612_Guillemeau
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1612_Guillemeau&uT=y>
11,928 24,103
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&newPostP=text&newPostP_textTargetId=1612_Guillemeau&uT=y>
2020707.58
1596_Clowes
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1596_Clowes&uT=y>
12,295 24,789
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&newPostP=text&newPostP_textTargetId=1596_Clowes&uT=y>
2016185.44
1652_Fioravanti
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1652_Fioravanti&uT=y>
11,754 23,614
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&newPostP=text&newPostP_textTargetId=1652_Fioravanti&uT=y>
2009018.21
1652_Culpeper
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1652_Culpeper&uT=y>
12,014 23,682
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&newPostP=text&newPostP_textTargetId=1652_Culpeper&uT=y>
1971200.27
1659_Culpeper
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1659_Culpeper&uT=y>
3,343 5,874
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&newPostP=text&newPostP_textTargetId=1659_Culpeper&uT=y>
1757104.4
Am 14.02.2014 15:04, schrieb Hardie, Andrew:
> Hannah & Stefan,
>
> Can you tell me (a) which function you used to get the CQP word count (b) where you got the CQPweb wordcount (corpus metadata, or concordance infobar)?
>
> The most obvious explanation is that there are tokens outside <text> elements, since CQPweb calculates the size of the corpus by summing the tokens in each individual text. This in turn is based on calculating cpos differences.
>
> But I would like to investigate on my own server first.
>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Stefan Evert
> Sent: 14 February 2014 12:49
> To: Open source development of the Corpus WorkBench
> Subject: Re: [CWB] Difference in token number between CQP and CQPweb
>
>
> On 14 Feb 2014, at 12:07, Hannah Kermes <h.kermes at mx.uni-saarland.de> wrote:
>
>> I just realized a difference in the token numbers between CQP and CQPweb.
>> The encoded corpus in CQPweb is a copy of the CQP corpus. The encoding has been performed with CQP on the command line and has been installed in CQPweb as an encoded corpus.
>>
>> Token numbers: 1,961,752 (CQPweb); 2,076,963 (CQP)
>>
>> The difference is also present if you look at subcorpora.
> Interesting. I see the same discrepancy on my local copy of CQPweb (v3.0.7) for _one_ of the corpora I installed. Everything else is fine.
>
> Andrew, is it possible that this may be caused by some particular corpus settings, e.g. if it's not in UTF-8 encoding?
>
> Otherwise, the only explanation I can think of is that you may have re-encoded the CWB corpus, changing its size, and forgot to re-install it in CQPweb (so CQPweb still has the old frequency information etc. and all subcorpora and distributions will be totally messed up)?
>
> Cheers,
> Stefan
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
--
Dr. Hannah Kermes
Dept. of Applied Linguistics, Interpreting and Translation (FR 4.6)
Universität des Saarlandes
Campus, Building A2.2, Room 1.07
D-66123 Saarbrücken
phone: +49-(0)681-302-70077
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20140214/421cbcc1/attachment.html>
More information about the CWB
mailing list