<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Thanks Jörg, this saved my week! <div class="">Thanks Stephanie, while not needed for this corpus, that patch is likely to become very handy soon!<div class=""><br class=""></div><div class="">If anyone is interested in the solution, a simple regex does the trick:</div><div class=""><br class=""></div><div class="">re_pattern = re.compile(u'[^\u0000-\uFFFF]', re.UNICODE)<br class="">return re_pattern.sub(u'\uFFFD’, str_var_with_wicked_chars)</div><div class=""><br class=""></div><div class="">(Replaces all unicode code points outside the range 0-FFFF by <span style="caret-color: rgb(32, 45, 74); color: rgb(32, 45, 74); font-family: "helvetica neue", Arial, Helvetica, Geneva, sans-serif; font-size: 16.639999389648438px; text-align: center;" class="">�)</span><div class=""><div><br class=""></div><div>Best,</div><div>Thilo</div><div><br class=""><blockquote type="cite" class=""><div class="">Am 18.11.2021 um 20:04 schrieb Stefan Evert <<a href="mailto:stefanML@collocations.de" class="">stefanML@collocations.de</a>>:</div><br class="Apple-interchange-newline"><div class=""><div class="">The CQPweb v3.2 server running on my laptop has the following patch around line #347 of file lib/sql-lib.php, in function do_sql_infile_query():<br class=""><br class=""><span class="Apple-tab-span" style="white-space:pre">        </span><span class="Apple-tab-span" style="white-space:pre">        </span><br class=""><span class="Apple-tab-span" style="white-space:pre">        </span><span class="Apple-tab-span" style="white-space:pre">        </span>$sql = "{$Config->mysql_LOAD_DATA_INFILE_command} '$filepath' INTO TABLE `$table`";<br class=""><br class=""><span class="Apple-tab-span" style="white-space:pre">        </span><span class="Apple-tab-span" style="white-space:pre">        </span>$sql .= " CHARACTER SET utf8mb4"; /* PATCH to handle characters outside BMP */<br class=""><span class="Apple-tab-span" style="white-space:pre">        </span><span class="Apple-tab-span" style="white-space:pre">        </span>if ($no_escapes)<br class=""><span class="Apple-tab-span" style="white-space:pre">        </span><span class="Apple-tab-span" style="white-space:pre">        </span><span class="Apple-tab-span" style="white-space:pre">        </span>$sql .= ' FIELDS ESCAPED BY \'\'';<br class=""><span class="Apple-tab-span" style="white-space:pre">        </span><span class="Apple-tab-span" style="white-space:pre">        </span><br class=""><span class="Apple-tab-span" style="white-space:pre">        </span><span class="Apple-tab-span" style="white-space:pre">        </span>return do_sql_query($sql);<br class=""><br class="">This has helped me get Twitter corpora into CQPweb, but I don't know if it is sufficient for your data.<br class=""><br class="">Best,<br class="">Stephanie<br class=""><br class=""><br class=""><blockquote type="cite" class="">On 18 Nov 2021, at 17:47, Jörg Knappen <<a href="mailto:j.knappen@mx.uni-saarland.de" class="">j.knappen@mx.uni-saarland.de</a>> wrote:<br class=""><br class="">This is a known shortcoming of the Database (mysql/MariaDB). It can only handle characters in the Basic Monolingual Plane (BMP). Since emojis (I suspect these because of the provenance of your data, there are also additional Chinese characters, mathematical and alchemy symbols, and many historic scripts in that area) are outside that plane with codepoint >= 0x10000 they cause this error.<br class=""><br class="">You can write a script searching for all characters with such codes, and you can try to replace them with some replacement strings to carry on.<br class=""><br class=""><br class=""></blockquote><br class=""></div></div></blockquote></div><br class=""></div></div></div></body></html>