[CWB] Different character encoding problems

Josep M. Fontana josepm.fontana at upf.edu
Sun Aug 19 04:33:08 CEST 2012


OK. Thanks. Yes, all the new corpora we are creating are UTF-8 encoded 
but this was one that we had a lot of problems trying to reencode it 
with 'iconv' and we left it as it is.

I use iTerm 2 (highly recommendable http://www.iterm2.com/) instead of 
the default terminal but I'll figure out how to change the settings for 
it to work with Latin-1.

JM
>> Hi, after managing to install the latest version of CWB in order to solve the visualization problems I had with UTF-8 encoded corpora I just stumbled with another encoding "problem".
>>
>> I installed an old corpus that was encoded in Latin-1 and when I do a search I see the typical symbols that appear for the characters with tilde when the parameters for character encoding are not set properly. I have tried to find information on this in the CQP manual but I have not been able to find any relevant information. The question is: is there any command or parameter in CQP that will allow me to visualize the characters properly?
> CQP doesn't automatically re-encode character sets (yet -- Andrew and I have different opinions whether it should).
>
> If you want to query an old Latin1-encoded corpus, you have to open a Terminal window with Latin1 character set and run CQP there.  This should also enable you to enter accented characters in CQP queries.
>
> In case you've never done this before, in the Terminal app you need to do the following:
>
>    - open preferences (Cmd-,)
>    - Settings tab
>    - click "+" to create a new Terminal preset (I recommend to name the new preset "Latin 1" or so)
>    - for this preset, go to "Advanced" sub-tab, then set Character Encoding to "Western (ISO Latin 1)"
>    - I prefer to change background and/or font colour slightly for this preset, so it's easier to see whether you've opened a Unicode or a Latin1 terminal
>    - now you can select "New Window | Latin 1" (or whatever you called the new preset) to open a Terminal session for Latin1-encoded corpora
>
> NB: these instructions are for Terminal.app on OS X 10.7 "Lion". If I recall correctly, menu and preferences structure was somewhat different on Leopard and Snow Leopard, but you should be able to locate the equivalent options.
>
>
> Of course, we strongly recommend to encode all new corpora in UTF-8 once you've switched to CWB 3.4 / 3.5.
>
> Best,
> Stefan
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb



More information about the CWB mailing list