[CWB] Incorrect total words count in a Traditional Chinese corpus on CQPweb

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon Jun 25 10:57:27 CEST 2018


OK, so the problem is other than what I thought it was.

When you get that blank page, is there a PHP error in the httpd log? IF so, can you copy-paste it in a reply? Thanks.

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Hermann Lai
Sent: 25 June 2018 05:05
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: Re: [CWB] Incorrect total words count in a Traditional Chinese corpus on CQPweb

I am sorry. The new version does not help.

After I replace the old version with the new version. I go to "Corpus frequency list controls" and try to "Update CWB text-position records". It redirect me to

http://localhost/cqpweb/canton1/execute.php?function=populate_corpus_cqp_positions&args=canton1&locationAfter=index.php%3FthisQ%3DmanageFreqLists%26uT%3Dy&uT=y

with a blank page.

However, "Recreate CWB frequency table" and "Recreate frequency tables" redirect me to "Corpus frequency list controls".

I try to reinstall the corpus again but the blank page problem is still there.

Regards,
Lai

2018-06-21 21:10 GMT+08:00 Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>>:
Hmm. I think this is a bug already fixed after 3.2.11.

The culprit is, if I recall correctly the function “update_corpus_sizes” – to be found in the file metadata.inc.php.

Old version:


function update_corpus_size($corpus = NULL)
{

        $corpus = safe_specified_or_global_corpus($corpus);

        $result = do_mysql_query("select sum(words), count(*) from text_metadata_for_$corpus");

        list($ntok, $ntext) = mysql_fetch_row($result);

        do_mysql_query("update corpus_info set size_tokens = $ntok, size_texts = $ntext where corpus = '$corpus'");

}


New version:


function update_corpus_size($corpus = NULL)
{
        $corpus = safe_specified_or_global_corpus($corpus);
        $result = do_mysql_query("select count(*) from text_metadata_for_$corpus");
        list($ntext) = mysql_fetch_row($result);

        $info = get_corpus_info($corpus);
        global $cqp;
        if (empty($cqp))
               connect_global_cqp();
        $cqp->set_corpus($info->cqp_name);
        $ntok = $cqp->get_corpus_tokens();
        do_mysql_query("update corpus_info set size_tokens = $ntok, size_texts = $ntext where corpus = '$corpus'");
}

Can I suggest that you replace the “old version” in your code with the “new version”, and redo frequency list setup?

That may well fix the issue. But if not, let me know…

best

Andrew.



From: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it> [mailto:cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>] On Behalf Of Hermann Lai
Sent: 19 June 2018 20:10

To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it<mailto:cwb at sslmit.unibo.it>>
Subject: Re: [CWB] Incorrect total words count in a Traditional Chinese corpus on CQPweb

No, I didn't get any messages when I use the frequency list controls.

I am using CQPwebinabox Esmeralda (CQPweb 3.2.11) and CWB 3.4.8(checked by using "cqb -v").

Regards,
Lai

2018-06-19 23:06 GMT+08:00 Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>>:
Did you get any odd messages when you ran the frequency-list setup on CQPweb?

If not – what version of the code do you have?

best

Andrew.

From: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it> [mailto:cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>] On Behalf Of Hermann Lai
Sent: 19 June 2018 11:32
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it<mailto:cwb at sslmit.unibo.it>>
Subject: Re: [CWB] Incorrect total words count in a Traditional Chinese corpus on CQPweb

part of the output of "cwb-decode -C CANTON1 -ALL | less"

<s>
<text>
<text_id T01>
中環    N       中環
保育    V       保育
奇觀    N       奇觀
:      PU      :
孫中山  N       孫中山
史蹟    N       史蹟
徑      N       徑
至      CONJ    至
大館    N       大館
</text_id>
</text>
</s>


part of the output of "cwb-described-corpus -s CANTON1"

============================================================
Corpus: CANTON1
============================================================

description:
registry file:  /usr/local/share/cwb/registry/canton1
home directory: /usr/local/corpora/data/canton1/
info file:      /usr/local/corpora/data/canton1/.info
size (tokens):  23

  3 positional attributes
  3 structural attributes
  0 alignment  attributes

p-ATT word                     23 tokens,       22 types
p-ATT pos                      23 tokens,        8 types
p-ATT lemma                    23 tokens,       22 types
s-ATT s                         2 regions
s-ATT text                      2 regions
s-ATT text_id                   2 regions (with annotations)


It seems that CWB can recognize the number of words but CQPweb doesn't.

Regards,
Lai

2018-06-19 15:43 GMT+08:00 Stefan Evert <stefanML at collocations.de<mailto:stefanML at collocations.de>>:
What does the corpus look like if you decode it from the CWB index with the following command?

        cwb-decode -C CANTON1 -ALL | less

Can you show us part of the output?  It would also be useful to see the output of

        cwb-described-corpus -s CANTON1


One possibility I can think of is that your linebreaks are messed up so that CWB treats everything within the text region as a single long line.

Best,
Stefan


> On 19 Jun 2018, at 09:26, Hermann Lai <halflifelai at gmail.com<mailto:halflifelai at gmail.com>> wrote:
>
> I am using CQPwebinabox and I have indexed a Traditonal Chinese corpus called "canton1" by using two commands:
>
> sudo cwb-encode -d /usr/local/corpora/data/canton1 -f /home/user/Desktop/corpora/canton1/canton1.vrt -R /usr/local/share/cwb/registry/canton1 -c utf8 -xsB -P pos -P lemma -S s:0 -S text:0+id
>
> sudo cwb-make -V CANTON1
>
> After that, I install the corpus onto CQPweb. Most of the thing are correct. However, the total number of corpus texts is as same as the total words in all corpus texts.

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb



--
Gaspard Germannson

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb



--
Gaspard Germannson

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb



--
Gaspard Germannson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180625/057558c7/attachment-0001.html>


More information about the CWB mailing list