<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /></head><body style='font-size: 10pt; font-family: Verdana,Geneva,sans-serif'>
<p>This is a known shortcoming of the Database (mysql/MariaDB). It can only handle characters in the Basic Monolingual Plane (BMP). Since emojis (I suspect these because of the provenance of your data, there are also additional Chinese characters, mathematical and alchemy symbols, and many historic scripts in that area) are outside that plane with codepoint >= 0x10000 they cause this error.</p>
<p>You can write a script searching for all characters with such codes, and you can try to replace them with some replacement strings to carry on.</p>
<p><br /></p>
<p>--Jörg Knappen</p>
<p><br /></p>
<p id="reply-intro">Am 2021-11-18 16:36, schrieb Thilo Wiertz:</p>
<blockquote type="cite" style="padding: 0 0.4em; border-left: #1010ff 2px solid; margin: 0">
<div id="replybody1">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">Dear all,
<div> </div>
<div>I fear I might be lost in encoding hell: I am trying to install a corpus on CQPweb, but get the following error message when creating word and annotation frequency tables (the last step of generating frequency lists):</div>
<div> </div>
<blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;">
<div><span style="font-size: 14px;">An SQL query did not run successfully!</span></div>
<div><span style="font-size: 14px;"> </span></div>
<div><span style="font-size: 14px;">Original query: LOAD DATA LOCAL INFILE '/var/cqpdata/temp/______tempfreq_topagrar_v4.tbl' INTO TABLE `__tempfreq_topagrar_v4` FIELDS ESCAPED BY '' /* from User: thilo | Function: corpus_make_freqtables() | 2021-Nov-18 16:05 */</span></div>
<div><span style="font-size: 14px;"> </span></div>
<div><span style="font-size: 14px;">Error # 1300: Invalid utf8 character string: '</span><span style="font-size: 14px;">'</span></div>
<div> </div>
</blockquote>
The corpus contains texts parsed from a web blog. I write an xml-file using python lxml and run the result through treetagger before installing it on cqpweb. It sounds like an encoding problem, although I am doing my best to remove anything potentially broken in python (e.g. running all strings through bytes(string, 'utf-8').decode('utf-8', 'ignore')).
<div><br />
<div>Checking for invalid UTF-8 characters in the input xml-file using grep (grep -axv '.*' file.txt) yields no results. Converting the file with iconv -f utf-8 -t utf-8 -c file.xml > newfile.xml makes no difference.
<div> </div>
<div>Any suggestion how to solve or narrow down the problem (e.g. finding the line or text id causing the issue)?
<div> </div>
<div>Thanks a lot!</div>
<div>Thilo</div>
</div>
<div> </div>
<div>Server Setup:</div>
<div>OS: Ubuntu 18.04</div>
<div>DB: MariaDB 10.1</div>
<div>CQPweb v3.2.43</div>
<div>PHP: 7.2</div>
<div> </div>
<div>PHP debugging backtrace:</div>
<blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;">
<div>
<div><span style="font-size: 14px;">array(6) {</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [1]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> array(4) {</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["file"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(43) "/var/www/html/diskurs/lib/exiterror-lib.php"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["line"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> int(367)</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["function"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(9) "exiterror"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["args"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> array(3) {</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [0]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> array(3) {</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [0]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(38) "An SQL query did not run successfully!"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [1]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(232) "Original query: </span></div>
</div>
<div>
<div><span style="font-size: 14px;"> </span></div>
</div>
<div>
<div><span style="font-size: 14px;">LOAD DATA LOCAL INFILE '/var/cqpdata/temp/______tempfreq_topagrar_v4.tbl' INTO TABLE `__tempfreq_topagrar_v4` FIELDS ESCAPED BY '' </span></div>
</div>
<div>
<div><span style="font-size: 14px;"><span class="v1Apple-tab-span" style="white-space: pre;"> </span>/* from User: thilo | Function: corpus_make_freqtables() | 2021-Nov-18 16:05 */</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> </span></div>
</div>
<div>
<div><span style="font-size: 14px;">"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [2]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(48) "Error # 1300: Invalid utf8 character string: '' "</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> }</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [1]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> NULL</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [2]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> NULL</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> }</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> }</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [2]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> array(4) {</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["file"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(37) "/var/www/html/diskurs/lib/sql-lib.php"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["line"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> int(216)</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["function"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(18) "exiterror_sqlquery"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["args"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> array(3) {</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [0]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> int(1300)</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [1]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(33) "Invalid utf8 character string: ''"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [2]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(212) "LOAD DATA LOCAL INFILE '/var/cqpdata/temp/______tempfreq_topagrar_v4.tbl' INTO TABLE `__tempfreq_topagrar_v4` FIELDS ESCAPED BY '' </span></div>
</div>
<div>
<div><span style="font-size: 14px;"><span class="v1Apple-tab-span" style="white-space: pre;"> </span>/* from User: thilo | Function: corpus_make_freqtables() | 2021-Nov-18 16:05 */"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> }</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> }</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [3]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> array(4) {</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["file"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(37) "/var/www/html/diskurs/lib/sql-lib.php"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["line"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> int(350)</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["function"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(12) "do_sql_query"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["args"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> array(1) {</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [0]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(212) "LOAD DATA LOCAL INFILE '/var/cqpdata/temp/______tempfreq_topagrar_v4.tbl' INTO TABLE `__tempfreq_topagrar_v4` FIELDS ESCAPED BY '' </span></div>
</div>
<div>
<div><span style="font-size: 14px;"><span class="v1Apple-tab-span" style="white-space: pre;"> </span>/* from User: thilo | Function: corpus_make_freqtables() | 2021-Nov-18 16:05 */"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> }</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> }</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [4]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> array(4) {</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["file"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(43) "/var/www/html/diskurs/lib/freqtable-lib.php"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["line"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> int(127)</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["function"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(19) "do_sql_infile_query"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["args"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> array(3) {</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [0]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(22) "__tempfreq_topagrar_v4"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [1]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(48) "/var/cqpdata/temp/______tempfreq_topagrar_v4.tbl"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [2]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> bool(true)</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> }</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> }</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [5]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> array(4) {</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["file"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(37) "/var/www/html/diskurs/lib/execute.php"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["line"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> int(196)</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["function"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(22) "corpus_make_freqtables"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["args"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> array(1) {</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [0]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(11) "topagrar_v4"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> }</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> }</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [6]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> array(4) {</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["file"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(37) "/var/www/html/diskurs/exe/execute.php"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["line"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> int(1)</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["args"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> array(1) {</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> [0]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(37) "/var/www/html/diskurs/lib/execute.php"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> }</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> ["function"]=></span></div>
</div>
<div>
<div><span style="font-size: 14px;"> string(7) "require"</span></div>
</div>
<div>
<div><span style="font-size: 14px;"> }</span></div>
</div>
<div>
<div><span style="font-size: 14px;">}</span></div>
</div>
</blockquote>
<div> </div>
<div> </div>
<div> </div>
<div> </div>
<div> </div>
<div> </div>
</div>
</div>
</div>
</div>
<br />
<div class="pre" style="margin: 0; padding: 0; font-family: monospace">_______________________________________________<br />CWB mailing list<br /><a href="mailto:CWB@sslmit.unibo.it">CWB@sslmit.unibo.it</a><br /><a href="http://liste.sslmit.unibo.it/mailman/listinfo/cwb" target="_blank" rel="noopener noreferrer">http://liste.sslmit.unibo.it/mailman/listinfo/cwb</a></div>
</blockquote>
<p><br /></p>
</body></html>