[CWB] Follow up to CWB Digest, Vol 139, Issue 14: Error #1300 generating word frequency lists

Fri Dec 21 09:58:02 CET 2018

It took me a while to find the thread in question since the archive page doesn’t allow access to digest issue numbers. The thread is the first one here:

http://liste.sslmit.unibo.it/pipermail/cwb/2018-August/thread.html

I am planning the change to utf8mb4 for v3.3.0 . I hope this will follow v 3.2.32, which is the next upcoming version . 3.2.32 will be a feature upgrade (the time needed to write the new features is why there has not been a release for so very long) that also partially implements some of the big restructuring that I’ve known for ages is needed, it can be considered the release candidate for 3.3.

My hope is that no more than one or two bug-fix versions will be needed before I can branch 3.2 off and go to 3.3 which will do nothing except the mb4 changeover.

In the meantime, Gerhard, the php CLI script below will scrub out 4 byte utf8 characters from a file – replacing them with U+FFFD, question mark in little box. Call it with an input file as first argument.

best

Andrew.

<?php
if (empty($argv[1])) exit("Please specify an input file.\n");
$src = fopen($argv[1], 'r');
$dst = fopen("{$argv[1]}.mod", 'w');
while (false !== ($line = fgets($src)))
      fputs($dst, preg_replace("/[\xf0-\xf4][\x80-\xbf]{3}/", "\xef\xbf\xbd", $line));
fclose($src);
fclose($dst);

From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Gerhard Rampl
Sent: 17 December 2018 11:14
To: cwb at sslmit.unibo.it
Subject: [CWB] Follow up to CWB Digest, Vol 139, Issue 14: Error #1300 generating word frequency lists

Hi Andrew and everybody,
this is a follow up question to CWB Digest, Vol 139, Issue 14. I am running into the same Error #1300 when trying to build the frequency list of a rather large corpus of tweets in CQPweb (corpus indexed previously with CWB; using CQPweb v 3.2.31). The problem also seem to be characters that don't fit MySQL's UTF-8 encoding (that seems to be only a subset of the full UTF-8).
Since I am not a programmer I'd rather not try the solution proposed in mentioned CWB Digest (seems rather delicate and Andrew wrote he would fix the problem anyway in one of the next releases). So my question is: in the meantime is there a way to identify (and replace) the characters responsible for error #1300 in the vrt-files?
Thanks for any help,
gerhard
--
University of Innsbruck

Institute for Languages and Literatures: Linguistics

Dr. Gerhard Rampl

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20181221/a5722ae8/attachment-0001.html>