[CWB] Difference in token number between CQP and CQPweb

Hannah Kermes h.kermes at mx.uni-saarland.de
Fri Feb 14 14:10:45 CET 2014


Am 14.02.2014 13:49, schrieb Stefan Evert:
> On 14 Feb 2014, at 12:07, Hannah Kermes <h.kermes at mx.uni-saarland.de> wrote:
>
>> I just realized a difference in the token numbers between CQP and CQPweb.
>> The encoded corpus in CQPweb is a copy of the CQP corpus. The encoding has been performed with CQP on the command line and has been installed in CQPweb as an encoded corpus.
>>
>> Token numbers: 1,961,752 (CQPweb); 2,076,963 (CQP)
>>
>> The difference is also present if you look at subcorpora.
> Interesting. I see the same discrepancy on my local copy of CQPweb (v3.0.7) for _one_ of the corpora I installed.   Everything else is fine.
I don't know whether this is the case for our other corpora as well, if 
it would help I can check
>
> Andrew, is it possible that this may be caused by some particular corpus settings, e.g. if it's not in UTF-8 encoding?
the corpus is in latin1
>
> Otherwise, the only explanation I can think of is that you may have re-encoded the CWB corpus, changing its size, and forgot to re-install it in CQPweb (so CQPweb still has the old frequency information etc. and all subcorpora and distributions will be totally messed up)?
I double checked that in advance (re-installed the whole corpus twice), 
the difference in token number remained.

But by the way another question. If I want to add additional structural 
annotations or changed only a particular structural annotation is it 
sufficient to add the respective cqp files and change the registry or do 
I have to re-install the corpus anew (as I did up to know)

Thanks
Hannah
>
> Cheers,
> Stefan
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

-- 
Dr. Hannah Kermes
Dept. of Applied Linguistics, Interpreting and Translation (FR 4.6)
Universität des Saarlandes
Campus, Building A2.2, Room 1.07
D-66123 Saarbrücken
phone: +49-(0)681-302-70077



More information about the CWB mailing list