[CWB] Error #1300 generating word frequency lists

José Manuel Martínez Martínez chozelinek at gmail.com
Wed Aug 8 10:04:49 CEST 2018


Hi, Andrew,

I think I found the root of the problem. My VRT files contain characters
that are valid UTF-8 however, MySQL's UTF-8 encoding and collation is a
subset of the full UTF-8. In order to have the full character set one needs
to use utf8mb4.

After some testing I found the files giving problems, and I think that all
of them contained some kind of character out of the subset used by MySQL.

We need to be sure that the database will use utf8mb4 instead of utf8 as
character encoding and collation. See <
https://mathiasbynens.be/notes/mysql-utf8mb4> or first answer here <
https://stackoverflow.com/questions/22572558/how-to-set-character-set-database-and-collation-database-to-utf8-in-my-ini>
and the third answer here could be relevant from the Python side <
https://stackoverflow.com/questions/26532722/how-to-encode-utf8mb4-in-python
>.

If we wanted the tables to use the encoding and collation we would need to
change the CQPweb's code. However, a change from utf8 to utf8mb4 is not
trivial because the length of `char`, `varchar` and `handles` are affected
(as we use 4 bytes for every character instead of 3 the size in characters
of those types of variables is reduced). Having said that, I did not need
to mess with the tables, it was enough to change some global configuration
and the charset and collation of the database.

However, my issue was solved just by doing the following:

In mysql configuration file `/etc/mysql/my.cnf` I wrote:

```sql
[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
```

And then I also modified the character set and the collation for cqpweb_db:

```sql
ALTER DATABASE cqpweb_db CHARACTER SET = utf8mb4 COLLATE =
utf8mb4_unicode_ci;
```

Check in mysql console with:

```sql
SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR
Variable_name LIKE 'collation%';
```

If one is creating the database from scratch, one could use:

```sql
CREATE DATABASE cqpweb_db2 DEFAULT CHARSET utf8mb4 COLLATE
utf8mb4_general_ci;
```

After modifying the MySQL configuration file and changing the character set
and collation for the database (I did not change anything for the tables),
CQPweb was able to generate the frequency lists without problems.

I couldn't say if this is a critical issue. I never had this problem
before, because I used to normalize characters. Now, I'm working with very
heterogenous data. I can foresee problems if someone is working with emojis
and the like (tweets, etc.).

Cheers,

jmm


--
José Manuel Martínez Martínez
https://chozelinek.github.io

On Mon, Aug 6, 2018 at 3:01 PM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
wrote:

> >> is it possible to add a new corpus from the command line?
>
>
>
> Not yet.
>
>
>
> >> I've seen a create-corpus.php script but it says //TODO
>
>
>
> Precisely!
>
>
>
> *From:* cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> *On
> Behalf Of *José Manuel Martínez Martínez
> *Sent:* 06 August 2018 13:30
>
> *To:* Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it
> >
> *Subject:* Re: [CWB] Error #1300 generating word frequency lists
>
>
>
> Hi again,
>
>
>
> last question, is it possible to add a new corpus from the command line?
> Not only the generation of the frequency lists? I've seen a
> create-corpus.php script but it says //TODO ;-)
>
>
>
> And just in case it helps, this is what I see regarding my MySQL config
>
>
>
> mysql> show VARIABLES like '%collation%';
>
> +----------------------+-------------------+
>
> | Variable_name        | Value             |
>
> +----------------------+-------------------+
>
> | collation_connection | utf8_general_ci   |
>
> | collation_database   | utf8_general_ci   |
>
> | collation_server     | latin1_swedish_ci |
>
> +----------------------+-------------------+
>
> 3 rows in set (0.00 sec)
>
>
>
> mysql> show variables like '%character%';
>
> +--------------------------+----------------------------+
>
> | Variable_name            | Value                      |
>
> +--------------------------+----------------------------+
>
> | character_set_client     | utf8                       |
>
> | character_set_connection | utf8                       |
>
> | character_set_database   | utf8                       |
>
> | character_set_filesystem | binary                     |
>
> | character_set_results    | utf8                       |
>
> | character_set_server     | latin1                     |
>
> | character_set_system     | utf8                       |
>
> | character_sets_dir       | /usr/share/mysql/charsets/ |
>
> +--------------------------+----------------------------+
>
> 8 rows in set (0.00 sec)
>
> SHOW FULL COLUMNS FROM __tempfreq_spanish;
>
> +----------+------------------+-----------------+------+----
> -+---------+-------+---------------------------------+---------+
>
> | Field    | Type             | Collation       | Null | Key | Default |
> Extra | Privileges                      | Comment |
>
> +----------+------------------+-----------------+------+----
> -+---------+-------+---------------------------------+---------+
>
> | freq     | int(11) unsigned | NULL            | YES  |     | NULL    |
>     | select,insert,update,references |         |
>
> | word     | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |
>     | select,insert,update,references |         |
>
> | dep      | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |
>     | select,insert,update,references |         |
>
> | ent_type | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |
>     | select,insert,update,references |         |
>
> | is_alpha | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |
>     | select,insert,update,references |         |
>
> | is_digit | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |
>     | select,insert,update,references |         |
>
> | is_oov   | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |
>     | select,insert,update,references |         |
>
> | lemma    | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |
>     | select,insert,update,references |         |
>
> | lower    | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |
>     | select,insert,update,references |         |
>
> | pos      | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |
>     | select,insert,update,references |         |
>
> | tag      | varchar(255)     | utf8_general_ci | NO   | MUL | NULL    |
>     | select,insert,update,references |         |
>
> +----------+------------------+-----------------+------+----
> -+---------+-------+---------------------------------+---------+
>
> 11 rows in set (0.00 sec)
>
>
>
>
> --
>
> José Manuel Martínez Martínez
>
> https://chozelinek.github.io
>
>
>
> On Mon, Aug 6, 2018 at 1:31 PM, José Manuel Martínez Martínez <
> chozelinek at gmail.com> wrote:
>
> Hi Andrew,
>
>
>
> thanks for the pointers. I didn't mention it, but I'm installing the new
> corpora from already indexed corpora. Just in case this might be relevant.
>
>
>
> I'll check with iconv and also with the generation of the frequency lists.
>
>
>
> Cheers,
>
>
> --
>
> José Manuel Martínez Martínez
>
> https://chozelinek.github.io
>
>
>
> On Mon, Aug 6, 2018 at 12:03 PM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
> wrote:
>
> A record is kept of the messages retrieved during indexing. Run this MySQL
> query to see it:
>
>
>
> SELECT indexing_notes FROM corpus_info WHERE corpus="lowercase corpus
> handle here";
>
>
>
> And you will see all the messages that cwb-encode & friends emitted during
> indexing.
>
>
>
> >> Would be there a way to run from the command line the command to
> generate the frequency lists?
>
>
>
> Yes, see Admin manual section 5.10 (p 48 in the version on the website
> <http://cwb.sourceforge.net/files/CQPwebAdminManual.pdf>)
>
>
>
> That’s just the freqlist. To encode offline, use the cwb binaries.
>
>
>
> But actually, it might be easier to run iconv(1) on your files with UTF-8
> as input encoding, and see whether/where it chokes.
>
>
>
> best
>
>
>
> Andrew.
>
>
>
>
>
> *From:* cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> *On
> Behalf Of *José Manuel Martínez Martínez
> *Sent:* 06 August 2018 10:44
> *To:* Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it
> >
> *Subject:* Re: [CWB] Error #1300 generating word frequency lists
>
>
>
> Hi Andrew,
>
>
>
> thank you very much for your quick reply.
>
>
>
> CQPweb v3.2.31
>
> CWB v3.4.14
>
>
>
> The underlying data should be UTF-8.
>
>
>
> I cannot remember right now if I had encoding error at the encoding stage.
>
>
>
> I'll re-encode the corpus and let you know if I get any error on that
> regard.
>
>
>
> Would be there a way to run from the command line the command to generate
> the frequency lists? I think I can leave a script encoding incrementally
> all texts I have in my corpus, to find out at least, which file is
> producing problems.
>
>
>
> Cheers,
>
>
>
>
> --
>
> José Manuel Martínez Martínez
>
> https://chozelinek.github.io
>
>
>
> On Mon, Aug 6, 2018 at 10:10 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
> wrote:
>
> The key bit of the error message is this:
>
>
>
> Error # 1300: Invalid utf8 character string: ''
>
>
>
> (unfortunate that the actual bad string can’t be identified from this.)
>
>
>
> This suggests that there is a bad string in the CWB index, and it is
> caught by the MySql db on freq list setup. Recent versions of CWB however
> should not permit the indexing of badly-encoded strings (recent meaning,
> last several years). You should have had an error at the encoding stage if
> there was an encoding error in your data.
>
>
>
> What’s your CWB version? (also your CQPweb version) Also, is the
> underlying data UTF-8 or Latin-1?
>
>
>
> best
>
>
>
> Andrew.
>
>
>
>
>
>
>
> *From:* cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> *On
> Behalf Of *José Manuel Martínez Martínez
> *Sent:* 06 August 2018 08:18
> *To:* Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it
> >
> *Subject:* [CWB] Error #1300 generating word frequency lists
>
>
>
> Good morning!
>
>
>
> Trying to run collocations on a corpus in Spanish, I've got an error.
>
>
>
> Somehow, the word frequency list wasn't generated.
>
>
>
> I tried to generate it again but the process fails and I get the traceback
> that I copy/paste below.
>
>
>
> Is this a CQPweb issue or should I check some settings of the MySQL
> database?
>
>
>
> Cheers,
>
>
>
> jmm
>
>
>
> --- TRACEBACK ---
>
>
>
> CQPweb encountered an error and could not continue.
>
>
>
>
>
> A MySQL query did not run successfully!
>
>
>
>
>
>
>
>
>
>
>
> Original query: LOAD DATA LOCAL INFILE '/data/cqpweb/tmp/______tempfreq_spanish.tbl'
> INTO TABLE `__tempfreq_spanish` FIELDS ESCAPED BY '' /* from User:
> datamaran | Function: corpus_make_freqtables() | 2018-Aug-03 12:41:27 */
>
>
>
>
>
>
>
>
>
>
>
> Error # 1300: Invalid utf8 character string: ''
>
>
>
>
>
>
>
> PHP debugging backtrace
>
> array(6) {
>
>   [1]=>
>
>   array(4) {
>
>     ["file"]=>
>
>     string(40) "/var/www/html/cqpweb/lib/library.inc.php"
>
>     ["line"]=>
>
>     int(286)
>
>     ["function"]=>
>
>     string(20) "exiterror_mysqlquery"
>
>     ["args"]=>
>
>     array(3) {
>
>       [0]=>
>
>       int(1300)
>
>       [1]=>
>
>       string(33) "Invalid utf8 character string: ''"
>
>       [2]=>
>
>       string(210) "LOAD DATA LOCAL INFILE '/data/cqpweb/tmp/______tempfreq_spanish.tbl'
> INTO TABLE `__tempfreq_spanish` FIELDS ESCAPED BY ''
>
>             /* from User: datamaran | Function: corpus_make_freqtables() |
> 2018-Aug-03 12:41:27 */"
>
>     }
>
>   }
>
>   [2]=>
>
>   array(4) {
>
>     ["file"]=>
>
>     string(40) "/var/www/html/cqpweb/lib/library.inc.php"
>
>     ["line"]=>
>
>     int(410)
>
>     ["function"]=>
>
>     string(14) "do_mysql_query"
>
>     ["args"]=>
>
>     array(1) {
>
>       [0]=>
>
>       &string(210) "LOAD DATA LOCAL INFILE '/data/cqpweb/tmp/______tempfreq_spanish.tbl'
> INTO TABLE `__tempfreq_spanish` FIELDS ESCAPED BY ''
>
>             /* from User: datamaran | Function: corpus_make_freqtables() |
> 2018-Aug-03 12:41:27 */"
>
>     }
>
>   }
>
>   [3]=>
>
>   array(4) {
>
>     ["file"]=>
>
>     string(42) "/var/www/html/cqpweb/lib/freqtable.inc.php"
>
>     ["line"]=>
>
>     int(124)
>
>     ["function"]=>
>
>     string(21) "do_mysql_infile_query"
>
>     ["args"]=>
>
>     array(3) {
>
>       [0]=>
>
>       string(18) "__tempfreq_spanish"
>
>       [1]=>
>
>       string(43) "/data/cqpweb/tmp/______tempfreq_spanish.tbl"
>
>       [2]=>
>
>       bool(true)
>
>     }
>
>   }
>
>   [4]=>
>
>   array(4) {
>
>     ["file"]=>
>
>     string(42) "/var/www/html/cqpweb/lib/admin-lib.inc.php"
>
>     ["line"]=>
>
>     int(838)
>
>     ["function"]=>
>
>     string(22) "corpus_make_freqtables"
>
>     ["args"]=>
>
>     array(1) {
>
>       [0]=>
>
>       string(7) "spanish"
>
>     }
>
>   }
>
>   [5]=>
>
>   array(4) {
>
>     ["file"]=>
>
>     string(47) "/var/www/html/cqpweb/lib/metadata-admin.inc.php"
>
>     ["line"]=>
>
>     int(179)
>
>     ["function"]=>
>
>     string(40) "create_text_metadata_auto_freqlist_calls"
>
>     ["args"]=>
>
>     array(1) {
>
>       [0]=>
>
>       string(7) "spanish"
>
>     }
>
>   }
>
>   [6]=>
>
>   array(4) {
>
>     ["file"]=>
>
>     string(43) "/var/www/html/cqpweb/exe/metadata-admin.php"
>
>     ["line"]=>
>
>     int(3)
>
>     ["args"]=>
>
>     array(1) {
>
>       [0]=>
>
>       string(47) "/var/www/html/cqpweb/lib/metadata-admin.inc.php"
>
>     }
>
>     ["function"]=>
>
>     string(7) "require"
>
>   }
>
> }
>
>
>
> --
>
> José Manuel Martínez Martínez
>
> https://chozelinek.github.io
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
>
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180808/09037ba1/attachment-0001.html>


More information about the CWB mailing list