[CWB] Accents and codification

Stefan Evert stefanML at collocations.de
Wed Dec 16 10:01:58 CET 2015


> On 16 Dec 2015, at 01:28, Daniel Renau <alphak87 at gmail.com> wrote:
> 
> There's a way to solve this problem with accents and apostrophe?
> 
> Pic related: http://i.imgur.com/OAVDzuG.png
> 
> At cqp via command line, the accents show OK (ssh connection or local terminal)
> At cqpWEB the accents are not displayed correctly.
> The apostrophe -> ' <- isn't shown properly anywhere, it shows <80><99>
> At UTF8 table is named like "RIGHT SINGLE QUOTATION MARK"

After taking a very close look at the screenshot, it would appear that your corpus is mostly encoded in UTF-8, but you have set CQPweb and/or your browser to interpret it as latin1.  If you change these settings to be consistent with your actual corpus encoding, the text should display fine.

The encoding of RIGHT SINGLE QUOTATION mark might actually be broken in your input data, since it appears as a sequence of three bytes in your VM terminal and especially given the absolute mess of characters (and control codes??) showing up via ssh.

Did "cwb-encode -c utf8" actually accept this input as well-formed UTF-8?

Best,
Stefan




More information about the CWB mailing list