[CWB] Different character encoding problems

Stefan Evert stefanML at collocations.de
Sat Aug 18 23:20:18 CEST 2012


> Hi, after managing to install the latest version of CWB in order to solve the visualization problems I had with UTF-8 encoded corpora I just stumbled with another encoding "problem".
> 
> I installed an old corpus that was encoded in Latin-1 and when I do a search I see the typical symbols that appear for the characters with tilde when the parameters for character encoding are not set properly. I have tried to find information on this in the CQP manual but I have not been able to find any relevant information. The question is: is there any command or parameter in CQP that will allow me to visualize the characters properly?

CQP doesn't automatically re-encode character sets (yet -- Andrew and I have different opinions whether it should).

If you want to query an old Latin1-encoded corpus, you have to open a Terminal window with Latin1 character set and run CQP there.  This should also enable you to enter accented characters in CQP queries.

In case you've never done this before, in the Terminal app you need to do the following:

  - open preferences (Cmd-,)
  - Settings tab
  - click "+" to create a new Terminal preset (I recommend to name the new preset "Latin 1" or so)
  - for this preset, go to "Advanced" sub-tab, then set Character Encoding to "Western (ISO Latin 1)"
  - I prefer to change background and/or font colour slightly for this preset, so it's easier to see whether you've opened a Unicode or a Latin1 terminal
  - now you can select "New Window | Latin 1" (or whatever you called the new preset) to open a Terminal session for Latin1-encoded corpora

NB: these instructions are for Terminal.app on OS X 10.7 "Lion". If I recall correctly, menu and preferences structure was somewhat different on Leopard and Snow Leopard, but you should be able to locate the equivalent options.


Of course, we strongly recommend to encode all new corpora in UTF-8 once you've switched to CWB 3.4 / 3.5.

Best,
Stefan



More information about the CWB mailing list