[CWB] Character encoding problems when transferring corpus

Sun Aug 12 19:28:37 CEST 2012

Hi Andrew,

Thanks again for the prompt response.

Something changes but the basic problem remains. The results after "set 
Paging no;" look like this:

2    pesta?feres  [#2-#3]
1    pesta?fer  [#0]
1    pesta?fera  [#1]
1    pesta?ffera  [#4]

as opposed to this (without "set Paging no;"):

2       pesta<AD>feres  [#2-#3]
1       pesta<AD>fer  [#0]
1       pesta<AD>fera  [#1]
1       pesta<AD>ffera  [#4]

These should be all variants of "pestífera" where 'í' is not displayed 
properly.

One thing that I see, though, is that the results of the query vary 
considerably after issuing the "set Paging no;" command. I get a 
drastically reduced list of result after issuing this command. What does 
this do exactly?

The other thing that I had not realized before I sent my initial message 
and that I think might be important to identify the problem is that the 
problems with the display of accented characters occur only when viewing 
the results of 'count'. That is the problems only ensue with:

 > count Last by word %cd on match;

If I do a regular query such as:

 > [(pos="A.*")&(word="pest.*")];

all the accented characters are displayed properly whether I do "set 
Paging no;" or not.

JM

> This looks like an issue with less. It seems to be "eating" the first half of the utf8 sequence (converting it to an accentless "a") leaving the second half to appear as a bare binary character (thus the hex codes in angle brackets).
>
> You can check this by turning off the use of a pager for query output:
>
> set Paging no;
>
> If queries print OK with this setting, then it is definitely an issue with less. If not, then the problem is somewhere else.
>
> Best
>
> Andrew.
>
>> -----Original Message-----
>> From: cwb-bounces at sslmit.unibo.it
>> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Josep M. Fontana
>> Sent: 12 August 2012 17:11
>> To: Open source development of the Corpus WorkBench
>> Subject: [CWB] Character encoding problems when transferring corpus
>>
>> Hi,
>>
>> I'm not sure this is really a CWB problem (in fact I'm pretty
>> sure it is
>> not) but since there might be other users that have CWB
>> running on a Mac perhaps I can get some help in this list.
>>
>> I installed CWB on my Mac and then transferred all the
>> relevant directories and registry files from the CWB
>> installation we have running on a LAMP server. Everything
>> seems to be working fine except that the results of a query
>> on the terminal come out like this (the words as they should
>> be displayed are within parentheses)
>>
>>
>> 60      ila<B7>lustra<AD>ssim  [#41507-#41566]  (--> 'il·lustríssim')
>> 58      fama<B3>s  [#24851-#24908] ( --> 'famós')
>>
>> The corpus is encoded as UTF-8 and my terminal (iTerm) is set
>> up properly to view UTF-8 encoded texts. I have no problems
>> viewing other
>> UTF-8 encoded texts on this computer and I don't have these
>> problems when accessing the corpus remotely.
>>
>> Why would UTF-8 encoded texts from the transferred corpus not
>> be displayed properly? Is there any way to fix this? Any help
>> would be greatly appreciated.
>>
>> Josep M.
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb