[CWB] Error #1300 generating word frequency lists
José Manuel Martínez Martínez
chozelinek at gmail.com
Mon Aug 6 14:29:58 CEST 2018
Hi again,
last question, is it possible to add a new corpus from the command line?
Not only the generation of the frequency lists? I've seen a
create-corpus.php script but it says //TODO ;-)
And just in case it helps, this is what I see regarding my MySQL config
mysql> show VARIABLES like '%collation%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)
mysql> show variables like '%character%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
SHOW FULL COLUMNS FROM __tempfreq_spanish;
+----------+------------------+-----------------+------+-----+---------+-------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default |
Extra | Privileges | Comment |
+----------+------------------+-----------------+------+-----+---------+-------+---------------------------------+---------+
| freq | int(11) unsigned | NULL | YES | | NULL |
| select,insert,update,references | |
| word | varchar(255) | utf8_general_ci | NO | MUL | NULL |
| select,insert,update,references | |
| dep | varchar(255) | utf8_general_ci | NO | MUL | NULL |
| select,insert,update,references | |
| ent_type | varchar(255) | utf8_general_ci | NO | MUL | NULL |
| select,insert,update,references | |
| is_alpha | varchar(255) | utf8_general_ci | NO | MUL | NULL |
| select,insert,update,references | |
| is_digit | varchar(255) | utf8_general_ci | NO | MUL | NULL |
| select,insert,update,references | |
| is_oov | varchar(255) | utf8_general_ci | NO | MUL | NULL |
| select,insert,update,references | |
| lemma | varchar(255) | utf8_general_ci | NO | MUL | NULL |
| select,insert,update,references | |
| lower | varchar(255) | utf8_general_ci | NO | MUL | NULL |
| select,insert,update,references | |
| pos | varchar(255) | utf8_general_ci | NO | MUL | NULL |
| select,insert,update,references | |
| tag | varchar(255) | utf8_general_ci | NO | MUL | NULL |
| select,insert,update,references | |
+----------+------------------+-----------------+------+-----+---------+-------+---------------------------------+---------+
11 rows in set (0.00 sec)
--
José Manuel Martínez Martínez
https://chozelinek.github.io
On Mon, Aug 6, 2018 at 1:31 PM, José Manuel Martínez Martínez <
chozelinek at gmail.com> wrote:
> Hi Andrew,
>
> thanks for the pointers. I didn't mention it, but I'm installing the new
> corpora from already indexed corpora. Just in case this might be relevant.
>
> I'll check with iconv and also with the generation of the frequency lists.
>
> Cheers,
>
> --
> José Manuel Martínez Martínez
> https://chozelinek.github.io
>
> On Mon, Aug 6, 2018 at 12:03 PM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
> wrote:
>
>> A record is kept of the messages retrieved during indexing. Run this
>> MySQL query to see it:
>>
>>
>>
>> SELECT indexing_notes FROM corpus_info WHERE corpus="lowercase corpus
>> handle here";
>>
>>
>>
>> And you will see all the messages that cwb-encode & friends emitted
>> during indexing.
>>
>>
>>
>> >> Would be there a way to run from the command line the command to
>> generate the frequency lists?
>>
>>
>>
>> Yes, see Admin manual section 5.10 (p 48 in the version on the website
>> <http://cwb.sourceforge.net/files/CQPwebAdminManual.pdf>)
>>
>>
>>
>> That’s just the freqlist. To encode offline, use the cwb binaries.
>>
>>
>>
>> But actually, it might be easier to run iconv(1) on your files with UTF-8
>> as input encoding, and see whether/where it chokes.
>>
>>
>>
>> best
>>
>>
>>
>> Andrew.
>>
>>
>>
>>
>>
>> *From:* cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> *On
>> Behalf Of *José Manuel Martínez Martínez
>> *Sent:* 06 August 2018 10:44
>> *To:* Open source development of the Corpus WorkBench <
>> cwb at sslmit.unibo.it>
>> *Subject:* Re: [CWB] Error #1300 generating word frequency lists
>>
>>
>>
>> Hi Andrew,
>>
>>
>>
>> thank you very much for your quick reply.
>>
>>
>>
>> CQPweb v3.2.31
>>
>> CWB v3.4.14
>>
>>
>>
>> The underlying data should be UTF-8.
>>
>>
>>
>> I cannot remember right now if I had encoding error at the encoding stage.
>>
>>
>>
>> I'll re-encode the corpus and let you know if I get any error on that
>> regard.
>>
>>
>>
>> Would be there a way to run from the command line the command to generate
>> the frequency lists? I think I can leave a script encoding incrementally
>> all texts I have in my corpus, to find out at least, which file is
>> producing problems.
>>
>>
>>
>> Cheers,
>>
>>
>>
>>
>> --
>>
>> José Manuel Martínez Martínez
>>
>> https://chozelinek.github.io
>>
>>
>>
>> On Mon, Aug 6, 2018 at 10:10 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
>> wrote:
>>
>> The key bit of the error message is this:
>>
>>
>>
>> Error # 1300: Invalid utf8 character string: ''
>>
>>
>>
>> (unfortunate that the actual bad string can’t be identified from this.)
>>
>>
>>
>> This suggests that there is a bad string in the CWB index, and it is
>> caught by the MySql db on freq list setup. Recent versions of CWB however
>> should not permit the indexing of badly-encoded strings (recent meaning,
>> last several years). You should have had an error at the encoding stage if
>> there was an encoding error in your data.
>>
>>
>>
>> What’s your CWB version? (also your CQPweb version) Also, is the
>> underlying data UTF-8 or Latin-1?
>>
>>
>>
>> best
>>
>>
>>
>> Andrew.
>>
>>
>>
>>
>>
>>
>>
>> *From:* cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> *On
>> Behalf Of *José Manuel Martínez Martínez
>> *Sent:* 06 August 2018 08:18
>> *To:* Open source development of the Corpus WorkBench <
>> cwb at sslmit.unibo.it>
>> *Subject:* [CWB] Error #1300 generating word frequency lists
>>
>>
>>
>> Good morning!
>>
>>
>>
>> Trying to run collocations on a corpus in Spanish, I've got an error.
>>
>>
>>
>> Somehow, the word frequency list wasn't generated.
>>
>>
>>
>> I tried to generate it again but the process fails and I get the
>> traceback that I copy/paste below.
>>
>>
>>
>> Is this a CQPweb issue or should I check some settings of the MySQL
>> database?
>>
>>
>>
>> Cheers,
>>
>>
>>
>> jmm
>>
>>
>>
>> --- TRACEBACK ---
>>
>>
>>
>> CQPweb encountered an error and could not continue.
>>
>>
>>
>>
>>
>> A MySQL query did not run successfully!
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Original query: LOAD DATA LOCAL INFILE '/data/cqpweb/tmp/______tempfreq_spanish.tbl'
>> INTO TABLE `__tempfreq_spanish` FIELDS ESCAPED BY '' /* from User:
>> datamaran | Function: corpus_make_freqtables() | 2018-Aug-03 12:41:27 */
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Error # 1300: Invalid utf8 character string: ''
>>
>>
>>
>>
>>
>>
>>
>> PHP debugging backtrace
>>
>> array(6) {
>>
>> [1]=>
>>
>> array(4) {
>>
>> ["file"]=>
>>
>> string(40) "/var/www/html/cqpweb/lib/library.inc.php"
>>
>> ["line"]=>
>>
>> int(286)
>>
>> ["function"]=>
>>
>> string(20) "exiterror_mysqlquery"
>>
>> ["args"]=>
>>
>> array(3) {
>>
>> [0]=>
>>
>> int(1300)
>>
>> [1]=>
>>
>> string(33) "Invalid utf8 character string: ''"
>>
>> [2]=>
>>
>> string(210) "LOAD DATA LOCAL INFILE '/data/cqpweb/tmp/______tempfreq_spanish.tbl'
>> INTO TABLE `__tempfreq_spanish` FIELDS ESCAPED BY ''
>>
>> /* from User: datamaran | Function: corpus_make_freqtables()
>> | 2018-Aug-03 12:41:27 */"
>>
>> }
>>
>> }
>>
>> [2]=>
>>
>> array(4) {
>>
>> ["file"]=>
>>
>> string(40) "/var/www/html/cqpweb/lib/library.inc.php"
>>
>> ["line"]=>
>>
>> int(410)
>>
>> ["function"]=>
>>
>> string(14) "do_mysql_query"
>>
>> ["args"]=>
>>
>> array(1) {
>>
>> [0]=>
>>
>> &string(210) "LOAD DATA LOCAL INFILE '/data/cqpweb/tmp/______tempfreq_spanish.tbl'
>> INTO TABLE `__tempfreq_spanish` FIELDS ESCAPED BY ''
>>
>> /* from User: datamaran | Function: corpus_make_freqtables()
>> | 2018-Aug-03 12:41:27 */"
>>
>> }
>>
>> }
>>
>> [3]=>
>>
>> array(4) {
>>
>> ["file"]=>
>>
>> string(42) "/var/www/html/cqpweb/lib/freqtable.inc.php"
>>
>> ["line"]=>
>>
>> int(124)
>>
>> ["function"]=>
>>
>> string(21) "do_mysql_infile_query"
>>
>> ["args"]=>
>>
>> array(3) {
>>
>> [0]=>
>>
>> string(18) "__tempfreq_spanish"
>>
>> [1]=>
>>
>> string(43) "/data/cqpweb/tmp/______tempfreq_spanish.tbl"
>>
>> [2]=>
>>
>> bool(true)
>>
>> }
>>
>> }
>>
>> [4]=>
>>
>> array(4) {
>>
>> ["file"]=>
>>
>> string(42) "/var/www/html/cqpweb/lib/admin-lib.inc.php"
>>
>> ["line"]=>
>>
>> int(838)
>>
>> ["function"]=>
>>
>> string(22) "corpus_make_freqtables"
>>
>> ["args"]=>
>>
>> array(1) {
>>
>> [0]=>
>>
>> string(7) "spanish"
>>
>> }
>>
>> }
>>
>> [5]=>
>>
>> array(4) {
>>
>> ["file"]=>
>>
>> string(47) "/var/www/html/cqpweb/lib/metadata-admin.inc.php"
>>
>> ["line"]=>
>>
>> int(179)
>>
>> ["function"]=>
>>
>> string(40) "create_text_metadata_auto_freqlist_calls"
>>
>> ["args"]=>
>>
>> array(1) {
>>
>> [0]=>
>>
>> string(7) "spanish"
>>
>> }
>>
>> }
>>
>> [6]=>
>>
>> array(4) {
>>
>> ["file"]=>
>>
>> string(43) "/var/www/html/cqpweb/exe/metadata-admin.php"
>>
>> ["line"]=>
>>
>> int(3)
>>
>> ["args"]=>
>>
>> array(1) {
>>
>> [0]=>
>>
>> string(47) "/var/www/html/cqpweb/lib/metadata-admin.inc.php"
>>
>> }
>>
>> ["function"]=>
>>
>> string(7) "require"
>>
>> }
>>
>> }
>>
>>
>>
>> --
>>
>> José Manuel Martínez Martínez
>>
>> https://chozelinek.github.io
>>
>>
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>>
>>
>>
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180806/fbcdd590/attachment-0001.html>
More information about the CWB
mailing list