[CWB] Difference in token number between CQP and CQPweb

Hannah Kermes h.kermes at mx.uni-saarland.de
Fri Feb 14 16:26:09 CET 2014


Hi Andrew,

It seems that for 10 (out of 310) texts, the word count is wrong.
I simply looked for all tokens ("[]") and made a frequency distribution 
across texts.
The result was:
Your query "[]" returned 2,076,963 matches in 310 different texts (in 
1,961,752 words [310 texts]; frequency: 1058728.63 instances per million 
words).

So all tokens are basically there.
A frequency distribution showed that in 10 text the word count (second 
column) is lower than the number of hits (third column, which shows the 
correct word count).

So all tokens are basically asigned to the correct texts, but the word 
count misses out of them somehow.

Hope this helps with debugging

Best
Hannah

1696_Tryon 
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1696_Tryon&uT=y> 
	4,446 	15,937 
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&newPostP=text&newPostP_textTargetId=1696_Tryon&uT=y> 
	3584570.4
1563_Gale 
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1563_Gale&uT=y> 
	12,082 	36,168 
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&newPostP=text&newPostP_textTargetId=1563_Gale&uT=y> 
	2993544.12
1539_Moulton 
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1539_Moulton&uT=y> 
	4,446 	9,167 
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&newPostP=text&newPostP_textTargetId=1539_Moulton&uT=y> 
	2061853.35
1698_Colbatch 
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1698_Colbatch&uT=y> 
	11,341 	23,070 
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&newPostP=text&newPostP_textTargetId=1698_Colbatch&uT=y> 
	2034212.15
1700_Salmon 
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1700_Salmon&uT=y> 
	12,167 	24,623 
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&newPostP=text&newPostP_textTargetId=1700_Salmon&uT=y> 
	2023752.77
1612_Guillemeau 
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1612_Guillemeau&uT=y> 
	11,928 	24,103 
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&newPostP=text&newPostP_textTargetId=1612_Guillemeau&uT=y> 
	2020707.58
1596_Clowes 
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1596_Clowes&uT=y> 
	12,295 	24,789 
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&newPostP=text&newPostP_textTargetId=1596_Clowes&uT=y> 
	2016185.44
1652_Fioravanti 
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1652_Fioravanti&uT=y> 
	11,754 	23,614 
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&newPostP=text&newPostP_textTargetId=1652_Fioravanti&uT=y> 
	2009018.21
1652_Culpeper 
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1652_Culpeper&uT=y> 
	12,014 	23,682 
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&newPostP=text&newPostP_textTargetId=1652_Culpeper&uT=y> 
	1971200.27
1659_Culpeper 
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1659_Culpeper&uT=y> 
	3,343 	5,874 
<https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&newPostP=text&newPostP_textTargetId=1659_Culpeper&uT=y> 
	1757104.4



Am 14.02.2014 15:04, schrieb Hardie, Andrew:
> Hannah & Stefan,
>
> Can you tell me (a) which function you used to get the CQP word count (b) where you got the CQPweb wordcount (corpus metadata, or concordance infobar)?
>
> The most obvious explanation is that there are tokens outside <text> elements, since CQPweb calculates the size of the corpus by summing the tokens in each individual text. This in turn is based on calculating cpos differences.
>
> But I would like to investigate on my own server first.
>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Stefan Evert
> Sent: 14 February 2014 12:49
> To: Open source development of the Corpus WorkBench
> Subject: Re: [CWB] Difference in token number between CQP and CQPweb
>
>
> On 14 Feb 2014, at 12:07, Hannah Kermes <h.kermes at mx.uni-saarland.de> wrote:
>
>> I just realized a difference in the token numbers between CQP and CQPweb.
>> The encoded corpus in CQPweb is a copy of the CQP corpus. The encoding has been performed with CQP on the command line and has been installed in CQPweb as an encoded corpus.
>>
>> Token numbers: 1,961,752 (CQPweb); 2,076,963 (CQP)
>>
>> The difference is also present if you look at subcorpora.
> Interesting. I see the same discrepancy on my local copy of CQPweb (v3.0.7) for _one_ of the corpora I installed.   Everything else is fine.
>
> Andrew, is it possible that this may be caused by some particular corpus settings, e.g. if it's not in UTF-8 encoding?
>
> Otherwise, the only explanation I can think of is that you may have re-encoded the CWB corpus, changing its size, and forgot to re-install it in CQPweb (so CQPweb still has the old frequency information etc. and all subcorpora and distributions will be totally messed up)?
>
> Cheers,
> Stefan
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

-- 
Dr. Hannah Kermes
Dept. of Applied Linguistics, Interpreting and Translation (FR 4.6)
Universität des Saarlandes
Campus, Building A2.2, Room 1.07
D-66123 Saarbrücken
phone: +49-(0)681-302-70077

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20140214/421cbcc1/attachment.html>


More information about the CWB mailing list