[CWB] Error #1300 generating word frequency lists

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon Aug 6 15:01:48 CEST 2018


>> is it possible to add a new corpus from the command line?

Not yet.

>> I've seen a create-corpus.php script but it says //TODO

Precisely!

From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of José Manuel Martínez Martínez
Sent: 06 August 2018 13:30
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: Re: [CWB] Error #1300 generating word frequency lists

Hi again,

last question, is it possible to add a new corpus from the command line? Not only the generation of the frequency lists? I've seen a create-corpus.php script but it says //TODO ;-)

And just in case it helps, this is what I see regarding my MySQL config


mysql> show VARIABLES like '%collation%';

+----------------------+-------------------+

| Variable_name        | Value             |

+----------------------+-------------------+

| collation_connection | utf8_general_ci   |

| collation_database   | utf8_general_ci   |

| collation_server     | latin1_swedish_ci |

+----------------------+-------------------+

3 rows in set (0.00 sec)



mysql> show variables like '%character%';

+--------------------------+----------------------------+

| Variable_name            | Value                      |

+--------------------------+----------------------------+

| character_set_client     | utf8                       |

| character_set_connection | utf8                       |

| character_set_database   | utf8                       |

| character_set_filesystem | binary                     |

| character_set_results    | utf8                       |

| character_set_server     | latin1                     |

| character_set_system     | utf8                       |

| character_sets_dir       | /usr/share/mysql/charsets/ |

+--------------------------+----------------------------+

8 rows in set (0.00 sec)

SHOW FULL COLUMNS FROM __tempfreq_spanish;

+----------+------------------+-----------------+------+-----+---------+-------+---------------------------------+---------+

| Field    | Type             | Collation       | Null | Key | Default | Extra | Privileges                      | Comment |

+----------+------------------+-----------------+------+-----+---------+-------+---------------------------------+---------+

| freq     | int(11) unsigned | NULL            | YES  |     | NULL    |       | select,insert,update,references |         |

| word     | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |       | select,insert,update,references |         |

| dep      | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |       | select,insert,update,references |         |

| ent_type | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |       | select,insert,update,references |         |

| is_alpha | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |       | select,insert,update,references |         |

| is_digit | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |       | select,insert,update,references |         |

| is_oov   | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |       | select,insert,update,references |         |

| lemma    | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |       | select,insert,update,references |         |

| lower    | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |       | select,insert,update,references |         |

| pos      | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |       | select,insert,update,references |         |

| tag      | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |       | select,insert,update,references |         |

+----------+------------------+-----------------+------+-----+---------+-------+---------------------------------+---------+

11 rows in set (0.00 sec)


--
José Manuel Martínez Martínez
https://chozelinek.github.io

On Mon, Aug 6, 2018 at 1:31 PM, José Manuel Martínez Martínez <chozelinek at gmail.com<mailto:chozelinek at gmail.com>> wrote:
Hi Andrew,

thanks for the pointers. I didn't mention it, but I'm installing the new corpora from already indexed corpora. Just in case this might be relevant.

I'll check with iconv and also with the generation of the frequency lists.

Cheers,

--
José Manuel Martínez Martínez
https://chozelinek.github.io

On Mon, Aug 6, 2018 at 12:03 PM, Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>> wrote:
A record is kept of the messages retrieved during indexing. Run this MySQL query to see it:

SELECT indexing_notes FROM corpus_info WHERE corpus="lowercase corpus handle here";

And you will see all the messages that cwb-encode & friends emitted during indexing.

>> Would be there a way to run from the command line the command to generate the frequency lists?

Yes, see Admin manual section 5.10 (p 48 in the version on the website<http://cwb.sourceforge.net/files/CQPwebAdminManual.pdf>)

That’s just the freqlist. To encode offline, use the cwb binaries.

But actually, it might be easier to run iconv(1) on your files with UTF-8 as input encoding, and see whether/where it chokes.

best

Andrew.


From: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it> <cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>> On Behalf Of José Manuel Martínez Martínez
Sent: 06 August 2018 10:44
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it<mailto:cwb at sslmit.unibo.it>>
Subject: Re: [CWB] Error #1300 generating word frequency lists

Hi Andrew,

thank you very much for your quick reply.

CQPweb v3.2.31
CWB v3.4.14

The underlying data should be UTF-8.

I cannot remember right now if I had encoding error at the encoding stage.

I'll re-encode the corpus and let you know if I get any error on that regard.

Would be there a way to run from the command line the command to generate the frequency lists? I think I can leave a script encoding incrementally all texts I have in my corpus, to find out at least, which file is producing problems.

Cheers,


--
José Manuel Martínez Martínez
https://chozelinek.github.io

On Mon, Aug 6, 2018 at 10:10 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>> wrote:
The key bit of the error message is this:

Error # 1300: Invalid utf8 character string: ''

(unfortunate that the actual bad string can’t be identified from this.)

This suggests that there is a bad string in the CWB index, and it is caught by the MySql db on freq list setup. Recent versions of CWB however should not permit the indexing of badly-encoded strings (recent meaning, last several years). You should have had an error at the encoding stage if there was an encoding error in your data.

What’s your CWB version? (also your CQPweb version) Also, is the underlying data UTF-8 or Latin-1?

best

Andrew.



From: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it> <cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>> On Behalf Of José Manuel Martínez Martínez
Sent: 06 August 2018 08:18
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it<mailto:cwb at sslmit.unibo.it>>
Subject: [CWB] Error #1300 generating word frequency lists

Good morning!

Trying to run collocations on a corpus in Spanish, I've got an error.

Somehow, the word frequency list wasn't generated.

I tried to generate it again but the process fails and I get the traceback that I copy/paste below.

Is this a CQPweb issue or should I check some settings of the MySQL database?

Cheers,

jmm

--- TRACEBACK ---

CQPweb encountered an error and could not continue.


A MySQL query did not run successfully!





Original query: LOAD DATA LOCAL INFILE '/data/cqpweb/tmp/______tempfreq_spanish.tbl' INTO TABLE `__tempfreq_spanish` FIELDS ESCAPED BY '' /* from User: datamaran | Function: corpus_make_freqtables() | 2018-Aug-03 12:41:27 */





Error # 1300: Invalid utf8 character string: ''



PHP debugging backtrace
array(6) {
  [1]=>
  array(4) {
    ["file"]=>
    string(40) "/var/www/html/cqpweb/lib/library.inc.php"
    ["line"]=>
    int(286)
    ["function"]=>
    string(20) "exiterror_mysqlquery"
    ["args"]=>
    array(3) {
      [0]=>
      int(1300)
      [1]=>
      string(33) "Invalid utf8 character string: ''"
      [2]=>
      string(210) "LOAD DATA LOCAL INFILE '/data/cqpweb/tmp/______tempfreq_spanish.tbl' INTO TABLE `__tempfreq_spanish` FIELDS ESCAPED BY ''
            /* from User: datamaran | Function: corpus_make_freqtables() | 2018-Aug-03 12:41:27 */"
    }
  }
  [2]=>
  array(4) {
    ["file"]=>
    string(40) "/var/www/html/cqpweb/lib/library.inc.php"
    ["line"]=>
    int(410)
    ["function"]=>
    string(14) "do_mysql_query"
    ["args"]=>
    array(1) {
      [0]=>
      &string(210) "LOAD DATA LOCAL INFILE '/data/cqpweb/tmp/______tempfreq_spanish.tbl' INTO TABLE `__tempfreq_spanish` FIELDS ESCAPED BY ''
            /* from User: datamaran | Function: corpus_make_freqtables() | 2018-Aug-03 12:41:27 */"
    }
  }
  [3]=>
  array(4) {
    ["file"]=>
    string(42) "/var/www/html/cqpweb/lib/freqtable.inc.php"
    ["line"]=>
    int(124)
    ["function"]=>
    string(21) "do_mysql_infile_query"
    ["args"]=>
    array(3) {
      [0]=>
      string(18) "__tempfreq_spanish"
      [1]=>
      string(43) "/data/cqpweb/tmp/______tempfreq_spanish.tbl"
      [2]=>
      bool(true)
    }
  }
  [4]=>
  array(4) {
    ["file"]=>
    string(42) "/var/www/html/cqpweb/lib/admin-lib.inc.php"
    ["line"]=>
    int(838)
    ["function"]=>
    string(22) "corpus_make_freqtables"
    ["args"]=>
    array(1) {
      [0]=>
      string(7) "spanish"
    }
  }
  [5]=>
  array(4) {
    ["file"]=>
    string(47) "/var/www/html/cqpweb/lib/metadata-admin.inc.php"
    ["line"]=>
    int(179)
    ["function"]=>
    string(40) "create_text_metadata_auto_freqlist_calls"
    ["args"]=>
    array(1) {
      [0]=>
      string(7) "spanish"
    }
  }
  [6]=>
  array(4) {
    ["file"]=>
    string(43) "/var/www/html/cqpweb/exe/metadata-admin.php"
    ["line"]=>
    int(3)
    ["args"]=>
    array(1) {
      [0]=>
      string(47) "/var/www/html/cqpweb/lib/metadata-admin.inc.php"
    }
    ["function"]=>
    string(7) "require"
  }
}

--
José Manuel Martínez Martínez
https://chozelinek.github.io

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://liste.sslmit.unibo.it/mailman/listinfo/cwb


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180806/9120671d/attachment-0001.html>


More information about the CWB mailing list