[CWB] Error #1300 generating word frequency lists

José Manuel Martínez Martínez chozelinek at gmail.com
Mon Aug 6 14:29:58 CEST 2018


Hi again,

last question, is it possible to add a new corpus from the command line?
Not only the generation of the frequency lists? I've seen a
create-corpus.php script but it says //TODO ;-)

And just in case it helps, this is what I see regarding my MySQL config

mysql> show VARIABLES like '%collation%';

+----------------------+-------------------+

| Variable_name        | Value             |

+----------------------+-------------------+

| collation_connection | utf8_general_ci   |

| collation_database   | utf8_general_ci   |

| collation_server     | latin1_swedish_ci |

+----------------------+-------------------+

3 rows in set (0.00 sec)


mysql> show variables like '%character%';

+--------------------------+----------------------------+

| Variable_name            | Value                      |

+--------------------------+----------------------------+

| character_set_client     | utf8                       |

| character_set_connection | utf8                       |

| character_set_database   | utf8                       |

| character_set_filesystem | binary                     |

| character_set_results    | utf8                       |

| character_set_server     | latin1                     |

| character_set_system     | utf8                       |

| character_sets_dir       | /usr/share/mysql/charsets/ |

+--------------------------+----------------------------+

8 rows in set (0.00 sec)

SHOW FULL COLUMNS FROM __tempfreq_spanish;

+----------+------------------+-----------------+------+-----+---------+-------+---------------------------------+---------+

| Field    | Type             | Collation       | Null | Key | Default |
Extra | Privileges                      | Comment |

+----------+------------------+-----------------+------+-----+---------+-------+---------------------------------+---------+

| freq     | int(11) unsigned | NULL            | YES  |     | NULL    |
    | select,insert,update,references |         |

| word     | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |
    | select,insert,update,references |         |

| dep      | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |
    | select,insert,update,references |         |

| ent_type | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |
    | select,insert,update,references |         |

| is_alpha | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |
    | select,insert,update,references |         |

| is_digit | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |
    | select,insert,update,references |         |

| is_oov   | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |
    | select,insert,update,references |         |

| lemma    | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |
    | select,insert,update,references |         |

| lower    | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |
    | select,insert,update,references |         |

| pos      | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |
    | select,insert,update,references |         |

| tag      | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |
    | select,insert,update,references |         |

+----------+------------------+-----------------+------+-----+---------+-------+---------------------------------+---------+

11 rows in set (0.00 sec)


--
José Manuel Martínez Martínez
https://chozelinek.github.io

On Mon, Aug 6, 2018 at 1:31 PM, José Manuel Martínez Martínez <
chozelinek at gmail.com> wrote:

> Hi Andrew,
>
> thanks for the pointers. I didn't mention it, but I'm installing the new
> corpora from already indexed corpora. Just in case this might be relevant.
>
> I'll check with iconv and also with the generation of the frequency lists.
>
> Cheers,
>
> --
> José Manuel Martínez Martínez
> https://chozelinek.github.io
>
> On Mon, Aug 6, 2018 at 12:03 PM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
> wrote:
>
>> A record is kept of the messages retrieved during indexing. Run this
>> MySQL query to see it:
>>
>>
>>
>> SELECT indexing_notes FROM corpus_info WHERE corpus="lowercase corpus
>> handle here";
>>
>>
>>
>> And you will see all the messages that cwb-encode & friends emitted
>> during indexing.
>>
>>
>>
>> >> Would be there a way to run from the command line the command to
>> generate the frequency lists?
>>
>>
>>
>> Yes, see Admin manual section 5.10 (p 48 in the version on the website
>> <http://cwb.sourceforge.net/files/CQPwebAdminManual.pdf>)
>>
>>
>>
>> That’s just the freqlist. To encode offline, use the cwb binaries.
>>
>>
>>
>> But actually, it might be easier to run iconv(1) on your files with UTF-8
>> as input encoding, and see whether/where it chokes.
>>
>>
>>
>> best
>>
>>
>>
>> Andrew.
>>
>>
>>
>>
>>
>> *From:* cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> *On
>> Behalf Of *José Manuel Martínez Martínez
>> *Sent:* 06 August 2018 10:44
>> *To:* Open source development of the Corpus WorkBench <
>> cwb at sslmit.unibo.it>
>> *Subject:* Re: [CWB] Error #1300 generating word frequency lists
>>
>>
>>
>> Hi Andrew,
>>
>>
>>
>> thank you very much for your quick reply.
>>
>>
>>
>> CQPweb v3.2.31
>>
>> CWB v3.4.14
>>
>>
>>
>> The underlying data should be UTF-8.
>>
>>
>>
>> I cannot remember right now if I had encoding error at the encoding stage.
>>
>>
>>
>> I'll re-encode the corpus and let you know if I get any error on that
>> regard.
>>
>>
>>
>> Would be there a way to run from the command line the command to generate
>> the frequency lists? I think I can leave a script encoding incrementally
>> all texts I have in my corpus, to find out at least, which file is
>> producing problems.
>>
>>
>>
>> Cheers,
>>
>>
>>
>>
>> --
>>
>> José Manuel Martínez Martínez
>>
>> https://chozelinek.github.io
>>
>>
>>
>> On Mon, Aug 6, 2018 at 10:10 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
>> wrote:
>>
>> The key bit of the error message is this:
>>
>>
>>
>> Error # 1300: Invalid utf8 character string: ''
>>
>>
>>
>> (unfortunate that the actual bad string can’t be identified from this.)
>>
>>
>>
>> This suggests that there is a bad string in the CWB index, and it is
>> caught by the MySql db on freq list setup. Recent versions of CWB however
>> should not permit the indexing of badly-encoded strings (recent meaning,
>> last several years). You should have had an error at the encoding stage if
>> there was an encoding error in your data.
>>
>>
>>
>> What’s your CWB version? (also your CQPweb version) Also, is the
>> underlying data UTF-8 or Latin-1?
>>
>>
>>
>> best
>>
>>
>>
>> Andrew.
>>
>>
>>
>>
>>
>>
>>
>> *From:* cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> *On
>> Behalf Of *José Manuel Martínez Martínez
>> *Sent:* 06 August 2018 08:18
>> *To:* Open source development of the Corpus WorkBench <
>> cwb at sslmit.unibo.it>
>> *Subject:* [CWB] Error #1300 generating word frequency lists
>>
>>
>>
>> Good morning!
>>
>>
>>
>> Trying to run collocations on a corpus in Spanish, I've got an error.
>>
>>
>>
>> Somehow, the word frequency list wasn't generated.
>>
>>
>>
>> I tried to generate it again but the process fails and I get the
>> traceback that I copy/paste below.
>>
>>
>>
>> Is this a CQPweb issue or should I check some settings of the MySQL
>> database?
>>
>>
>>
>> Cheers,
>>
>>
>>
>> jmm
>>
>>
>>
>> --- TRACEBACK ---
>>
>>
>>
>> CQPweb encountered an error and could not continue.
>>
>>
>>
>>
>>
>> A MySQL query did not run successfully!
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Original query: LOAD DATA LOCAL INFILE '/data/cqpweb/tmp/______tempfreq_spanish.tbl'
>> INTO TABLE `__tempfreq_spanish` FIELDS ESCAPED BY '' /* from User:
>> datamaran | Function: corpus_make_freqtables() | 2018-Aug-03 12:41:27 */
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Error # 1300: Invalid utf8 character string: ''
>>
>>
>>
>>
>>
>>
>>
>> PHP debugging backtrace
>>
>> array(6) {
>>
>>   [1]=>
>>
>>   array(4) {
>>
>>     ["file"]=>
>>
>>     string(40) "/var/www/html/cqpweb/lib/library.inc.php"
>>
>>     ["line"]=>
>>
>>     int(286)
>>
>>     ["function"]=>
>>
>>     string(20) "exiterror_mysqlquery"
>>
>>     ["args"]=>
>>
>>     array(3) {
>>
>>       [0]=>
>>
>>       int(1300)
>>
>>       [1]=>
>>
>>       string(33) "Invalid utf8 character string: ''"
>>
>>       [2]=>
>>
>>       string(210) "LOAD DATA LOCAL INFILE '/data/cqpweb/tmp/______tempfreq_spanish.tbl'
>> INTO TABLE `__tempfreq_spanish` FIELDS ESCAPED BY ''
>>
>>             /* from User: datamaran | Function: corpus_make_freqtables()
>> | 2018-Aug-03 12:41:27 */"
>>
>>     }
>>
>>   }
>>
>>   [2]=>
>>
>>   array(4) {
>>
>>     ["file"]=>
>>
>>     string(40) "/var/www/html/cqpweb/lib/library.inc.php"
>>
>>     ["line"]=>
>>
>>     int(410)
>>
>>     ["function"]=>
>>
>>     string(14) "do_mysql_query"
>>
>>     ["args"]=>
>>
>>     array(1) {
>>
>>       [0]=>
>>
>>       &string(210) "LOAD DATA LOCAL INFILE '/data/cqpweb/tmp/______tempfreq_spanish.tbl'
>> INTO TABLE `__tempfreq_spanish` FIELDS ESCAPED BY ''
>>
>>             /* from User: datamaran | Function: corpus_make_freqtables()
>> | 2018-Aug-03 12:41:27 */"
>>
>>     }
>>
>>   }
>>
>>   [3]=>
>>
>>   array(4) {
>>
>>     ["file"]=>
>>
>>     string(42) "/var/www/html/cqpweb/lib/freqtable.inc.php"
>>
>>     ["line"]=>
>>
>>     int(124)
>>
>>     ["function"]=>
>>
>>     string(21) "do_mysql_infile_query"
>>
>>     ["args"]=>
>>
>>     array(3) {
>>
>>       [0]=>
>>
>>       string(18) "__tempfreq_spanish"
>>
>>       [1]=>
>>
>>       string(43) "/data/cqpweb/tmp/______tempfreq_spanish.tbl"
>>
>>       [2]=>
>>
>>       bool(true)
>>
>>     }
>>
>>   }
>>
>>   [4]=>
>>
>>   array(4) {
>>
>>     ["file"]=>
>>
>>     string(42) "/var/www/html/cqpweb/lib/admin-lib.inc.php"
>>
>>     ["line"]=>
>>
>>     int(838)
>>
>>     ["function"]=>
>>
>>     string(22) "corpus_make_freqtables"
>>
>>     ["args"]=>
>>
>>     array(1) {
>>
>>       [0]=>
>>
>>       string(7) "spanish"
>>
>>     }
>>
>>   }
>>
>>   [5]=>
>>
>>   array(4) {
>>
>>     ["file"]=>
>>
>>     string(47) "/var/www/html/cqpweb/lib/metadata-admin.inc.php"
>>
>>     ["line"]=>
>>
>>     int(179)
>>
>>     ["function"]=>
>>
>>     string(40) "create_text_metadata_auto_freqlist_calls"
>>
>>     ["args"]=>
>>
>>     array(1) {
>>
>>       [0]=>
>>
>>       string(7) "spanish"
>>
>>     }
>>
>>   }
>>
>>   [6]=>
>>
>>   array(4) {
>>
>>     ["file"]=>
>>
>>     string(43) "/var/www/html/cqpweb/exe/metadata-admin.php"
>>
>>     ["line"]=>
>>
>>     int(3)
>>
>>     ["args"]=>
>>
>>     array(1) {
>>
>>       [0]=>
>>
>>       string(47) "/var/www/html/cqpweb/lib/metadata-admin.inc.php"
>>
>>     }
>>
>>     ["function"]=>
>>
>>     string(7) "require"
>>
>>   }
>>
>> }
>>
>>
>>
>> --
>>
>> José Manuel Martínez Martínez
>>
>> https://chozelinek.github.io
>>
>>
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>>
>>
>>
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180806/fbcdd590/attachment-0001.html>


More information about the CWB mailing list