[CWB] Character encoding problems when transferring corpus

Hardie, Andrew a.hardie at lancaster.ac.uk
Sun Aug 12 19:41:11 CEST 2012


Ah, this begins to make sense.

"set Paging no" turns off the use of less to view results. When Paging is off, results are piped directly to the terminal. When Paging is on, a less process is started up and the results are piped to that.

The fact that the bytes are still wrong without the pager makes things more transparent, and I think I know what is wrong. Due to your use of the %d flag with "count", accent folding is applied; but it would appear to be the case that ISO-8859-1 accent-folding is being used. This breaks the UTF8 sequences. Clearly, the count command is using some unauthorised character handling instead of calling the proper CL string functions.

What's puzzling, hwoever, is that if this was a bug, it should have been present on Linux as well as OS X. Did you actuall use the "count" command on Linux? If so, and it worked OK, can you check which versions f CWB you have on both platforms (cqp -v)? If the version you have installed on OS X is an older one, then the solution is to upgrade to 3.4.

*If*, on the other hand, this really is a bug in the latest version, then I need to find the relevant count code and fix it. I can't see where it is at a glance, so a hunt will be needed. (Stefan, if you happen to be reading this, and can say quickly where I should look, that would be a help!)

Best

Andrew.
 

> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it 
> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Josep M. Fontana
> Sent: 12 August 2012 18:29
> To: Open source development of the Corpus WorkBench
> Subject: Re: [CWB] Character encoding problems when 
> transferring corpus
> 
> Hi Andrew,
> 
> Thanks again for the prompt response.
> 
> Something changes but the basic problem remains. The results 
> after "set Paging no;" look like this:
> 
> 2    pesta?feres  [#2-#3]
> 1    pesta?fer  [#0]
> 1    pesta?fera  [#1]
> 1    pesta?ffera  [#4]
> 
> as opposed to this (without "set Paging no;"):
> 
> 2       pesta<AD>feres  [#2-#3]
> 1       pesta<AD>fer  [#0]
> 1       pesta<AD>fera  [#1]
> 1       pesta<AD>ffera  [#4]
> 
> These should be all variants of "pestífera" where 'í' is not 
> displayed properly.
> 
> One thing that I see, though, is that the results of the 
> query vary considerably after issuing the "set Paging no;" 
> command. I get a drastically reduced list of result after 
> issuing this command. What does this do exactly?
> 
> The other thing that I had not realized before I sent my 
> initial message and that I think might be important to 
> identify the problem is that the problems with the display of 
> accented characters occur only when viewing the results of 
> 'count'. That is the problems only ensue with:
> 
>  > count Last by word %cd on match;
> 
> If I do a regular query such as:
> 
>  > [(pos="A.*")&(word="pest.*")];
> 
> all the accented characters are displayed properly whether I 
> do "set Paging no;" or not.
> 
> 
> JM
> 
> 
> 
> 
> > This looks like an issue with less. It seems to be "eating" 
> the first half of the utf8 sequence (converting it to an 
> accentless "a") leaving the second half to appear as a bare 
> binary character (thus the hex codes in angle brackets).
> >
> > You can check this by turning off the use of a pager for 
> query output:
> >
> > set Paging no;
> >
> > If queries print OK with this setting, then it is 
> definitely an issue with less. If not, then the problem is 
> somewhere else.
> >
> > Best
> >
> > Andrew.
> >
> >> -----Original Message-----
> >> From: cwb-bounces at sslmit.unibo.it
> >> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Josep M. Fontana
> >> Sent: 12 August 2012 17:11
> >> To: Open source development of the Corpus WorkBench
> >> Subject: [CWB] Character encoding problems when transferring corpus
> >>
> >> Hi,
> >>
> >> I'm not sure this is really a CWB problem (in fact I'm pretty
> >> sure it is
> >> not) but since there might be other users that have CWB
> >> running on a Mac perhaps I can get some help in this list.
> >>
> >> I installed CWB on my Mac and then transferred all the
> >> relevant directories and registry files from the CWB
> >> installation we have running on a LAMP server. Everything
> >> seems to be working fine except that the results of a query
> >> on the terminal come out like this (the words as they should
> >> be displayed are within parentheses)
> >>
> >>
> >> 60      ila<B7>lustra<AD>ssim  [#41507-#41566]  (--> 
> 'il·lustríssim')
> >> 58      fama<B3>s  [#24851-#24908] ( --> 'famós')
> >>
> >> The corpus is encoded as UTF-8 and my terminal (iTerm) is set
> >> up properly to view UTF-8 encoded texts. I have no problems
> >> viewing other
> >> UTF-8 encoded texts on this computer and I don't have these
> >> problems when accessing the corpus remotely.
> >>
> >> Why would UTF-8 encoded texts from the transferred corpus not
> >> be displayed properly? Is there any way to fix this? Any help
> >> would be greatly appreciated.
> >>
> >> Josep M.
> >> _______________________________________________
> >> CWB mailing list
> >> CWB at sslmit.unibo.it
> >> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> >>
> > _______________________________________________
> > CWB mailing list
> > CWB at sslmit.unibo.it
> > http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> 


More information about the CWB mailing list