[CWB] Error #1300 generating word frequency lists

José Manuel Martínez Martínez chozelinek at gmail.com
Mon Aug 6 13:31:37 CEST 2018


Hi Andrew,

thanks for the pointers. I didn't mention it, but I'm installing the new
corpora from already indexed corpora. Just in case this might be relevant.

I'll check with iconv and also with the generation of the frequency lists.

Cheers,

--
José Manuel Martínez Martínez
https://chozelinek.github.io

On Mon, Aug 6, 2018 at 12:03 PM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
wrote:

> A record is kept of the messages retrieved during indexing. Run this MySQL
> query to see it:
>
>
>
> SELECT indexing_notes FROM corpus_info WHERE corpus="lowercase corpus
> handle here";
>
>
>
> And you will see all the messages that cwb-encode & friends emitted during
> indexing.
>
>
>
> >> Would be there a way to run from the command line the command to
> generate the frequency lists?
>
>
>
> Yes, see Admin manual section 5.10 (p 48 in the version on the website
> <http://cwb.sourceforge.net/files/CQPwebAdminManual.pdf>)
>
>
>
> That’s just the freqlist. To encode offline, use the cwb binaries.
>
>
>
> But actually, it might be easier to run iconv(1) on your files with UTF-8
> as input encoding, and see whether/where it chokes.
>
>
>
> best
>
>
>
> Andrew.
>
>
>
>
>
> *From:* cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> *On
> Behalf Of *José Manuel Martínez Martínez
> *Sent:* 06 August 2018 10:44
> *To:* Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it
> >
> *Subject:* Re: [CWB] Error #1300 generating word frequency lists
>
>
>
> Hi Andrew,
>
>
>
> thank you very much for your quick reply.
>
>
>
> CQPweb v3.2.31
>
> CWB v3.4.14
>
>
>
> The underlying data should be UTF-8.
>
>
>
> I cannot remember right now if I had encoding error at the encoding stage.
>
>
>
> I'll re-encode the corpus and let you know if I get any error on that
> regard.
>
>
>
> Would be there a way to run from the command line the command to generate
> the frequency lists? I think I can leave a script encoding incrementally
> all texts I have in my corpus, to find out at least, which file is
> producing problems.
>
>
>
> Cheers,
>
>
>
>
> --
>
> José Manuel Martínez Martínez
>
> https://chozelinek.github.io
>
>
>
> On Mon, Aug 6, 2018 at 10:10 AM, Hardie, Andrew <a.hardie at lancaster.ac.uk>
> wrote:
>
> The key bit of the error message is this:
>
>
>
> Error # 1300: Invalid utf8 character string: ''
>
>
>
> (unfortunate that the actual bad string can’t be identified from this.)
>
>
>
> This suggests that there is a bad string in the CWB index, and it is
> caught by the MySql db on freq list setup. Recent versions of CWB however
> should not permit the indexing of badly-encoded strings (recent meaning,
> last several years). You should have had an error at the encoding stage if
> there was an encoding error in your data.
>
>
>
> What’s your CWB version? (also your CQPweb version) Also, is the
> underlying data UTF-8 or Latin-1?
>
>
>
> best
>
>
>
> Andrew.
>
>
>
>
>
>
>
> *From:* cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> *On
> Behalf Of *José Manuel Martínez Martínez
> *Sent:* 06 August 2018 08:18
> *To:* Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it
> >
> *Subject:* [CWB] Error #1300 generating word frequency lists
>
>
>
> Good morning!
>
>
>
> Trying to run collocations on a corpus in Spanish, I've got an error.
>
>
>
> Somehow, the word frequency list wasn't generated.
>
>
>
> I tried to generate it again but the process fails and I get the traceback
> that I copy/paste below.
>
>
>
> Is this a CQPweb issue or should I check some settings of the MySQL
> database?
>
>
>
> Cheers,
>
>
>
> jmm
>
>
>
> --- TRACEBACK ---
>
>
>
> CQPweb encountered an error and could not continue.
>
>
>
>
>
> A MySQL query did not run successfully!
>
>
>
>
>
>
>
>
>
>
>
> Original query: LOAD DATA LOCAL INFILE '/data/cqpweb/tmp/______tempfreq_spanish.tbl'
> INTO TABLE `__tempfreq_spanish` FIELDS ESCAPED BY '' /* from User:
> datamaran | Function: corpus_make_freqtables() | 2018-Aug-03 12:41:27 */
>
>
>
>
>
>
>
>
>
>
>
> Error # 1300: Invalid utf8 character string: ''
>
>
>
>
>
>
>
> PHP debugging backtrace
>
> array(6) {
>
>   [1]=>
>
>   array(4) {
>
>     ["file"]=>
>
>     string(40) "/var/www/html/cqpweb/lib/library.inc.php"
>
>     ["line"]=>
>
>     int(286)
>
>     ["function"]=>
>
>     string(20) "exiterror_mysqlquery"
>
>     ["args"]=>
>
>     array(3) {
>
>       [0]=>
>
>       int(1300)
>
>       [1]=>
>
>       string(33) "Invalid utf8 character string: ''"
>
>       [2]=>
>
>       string(210) "LOAD DATA LOCAL INFILE '/data/cqpweb/tmp/______tempfreq_spanish.tbl'
> INTO TABLE `__tempfreq_spanish` FIELDS ESCAPED BY ''
>
>             /* from User: datamaran | Function: corpus_make_freqtables() |
> 2018-Aug-03 12:41:27 */"
>
>     }
>
>   }
>
>   [2]=>
>
>   array(4) {
>
>     ["file"]=>
>
>     string(40) "/var/www/html/cqpweb/lib/library.inc.php"
>
>     ["line"]=>
>
>     int(410)
>
>     ["function"]=>
>
>     string(14) "do_mysql_query"
>
>     ["args"]=>
>
>     array(1) {
>
>       [0]=>
>
>       &string(210) "LOAD DATA LOCAL INFILE '/data/cqpweb/tmp/______tempfreq_spanish.tbl'
> INTO TABLE `__tempfreq_spanish` FIELDS ESCAPED BY ''
>
>             /* from User: datamaran | Function: corpus_make_freqtables() |
> 2018-Aug-03 12:41:27 */"
>
>     }
>
>   }
>
>   [3]=>
>
>   array(4) {
>
>     ["file"]=>
>
>     string(42) "/var/www/html/cqpweb/lib/freqtable.inc.php"
>
>     ["line"]=>
>
>     int(124)
>
>     ["function"]=>
>
>     string(21) "do_mysql_infile_query"
>
>     ["args"]=>
>
>     array(3) {
>
>       [0]=>
>
>       string(18) "__tempfreq_spanish"
>
>       [1]=>
>
>       string(43) "/data/cqpweb/tmp/______tempfreq_spanish.tbl"
>
>       [2]=>
>
>       bool(true)
>
>     }
>
>   }
>
>   [4]=>
>
>   array(4) {
>
>     ["file"]=>
>
>     string(42) "/var/www/html/cqpweb/lib/admin-lib.inc.php"
>
>     ["line"]=>
>
>     int(838)
>
>     ["function"]=>
>
>     string(22) "corpus_make_freqtables"
>
>     ["args"]=>
>
>     array(1) {
>
>       [0]=>
>
>       string(7) "spanish"
>
>     }
>
>   }
>
>   [5]=>
>
>   array(4) {
>
>     ["file"]=>
>
>     string(47) "/var/www/html/cqpweb/lib/metadata-admin.inc.php"
>
>     ["line"]=>
>
>     int(179)
>
>     ["function"]=>
>
>     string(40) "create_text_metadata_auto_freqlist_calls"
>
>     ["args"]=>
>
>     array(1) {
>
>       [0]=>
>
>       string(7) "spanish"
>
>     }
>
>   }
>
>   [6]=>
>
>   array(4) {
>
>     ["file"]=>
>
>     string(43) "/var/www/html/cqpweb/exe/metadata-admin.php"
>
>     ["line"]=>
>
>     int(3)
>
>     ["args"]=>
>
>     array(1) {
>
>       [0]=>
>
>       string(47) "/var/www/html/cqpweb/lib/metadata-admin.inc.php"
>
>     }
>
>     ["function"]=>
>
>     string(7) "require"
>
>   }
>
> }
>
>
>
> --
>
> José Manuel Martínez Martínez
>
> https://chozelinek.github.io
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180806/8c96bf35/attachment-0001.html>


More information about the CWB mailing list