[CWB] Character encoding problems when transferring corpus

Hardie, Andrew a.hardie at lancaster.ac.uk
Sun Aug 12 20:08:47 CEST 2012


3.0 doesn't have any Unicode support at all - so that explains that! Yes, you should upgrade to 3.4. The easiest way is to check out the code via Subversion, then build following the instructions in the INSTALL file. 

Instructions are here: http://cwb.sourceforge.net/developers.php#svn

You will probably need to set the CWB_LIVE_DANGEROUSLY environment variable to overwrite your existing installation.

I trust that the character set issue will disappear then, but let us know if not.

As for the disappearance of some results depending on whether the pager is on or off... That is more worrying. Using less / not using less ought indeed not to change the number of results. Can you confirm that the issue persists on v3.4, and if it does, send a full example of the two outputs (full and reduced)? Thanks.

Best

Andrew. 

> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it 
> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Josep M. Fontana
> Sent: 12 August 2012 19:00
> To: Open source development of the Corpus WorkBench
> Subject: Re: [CWB] Character encoding problems when 
> transferring corpus
> 
> 
> > Ah, this begins to make sense.
> Glad to hear that!
> > "set Paging no" turns off the use of less to view results. 
> When Paging is off, results are piped directly to the 
> terminal. When Paging is on, a less process is started up and 
> the results are piped to that.
> Mmm, that's what I thought it did but what I would have 
> expected is that the only change after issuing this command 
> would be that all the results would appear in a single 
> screen. But the change goes beyond that. Not only are the 
> results dumped into a single screen or page but they are also 
> considerably reduced. After "set Paging no;" the list of 
> results is considerably shorter than before issuing this 
> command. Is this supposed to happen? I thought less only 
> affected how the results are displayed, not what results you get.
> > The fact that the bytes are still wrong without the pager 
> makes things more transparent, and I think I know what is 
> wrong. Due to your use of the %d flag with "count", accent 
> folding is applied; but it would appear to be the case that 
> ISO-8859-1 accent-folding is being used. This breaks the UTF8 
> sequences. Clearly, the count command is using some 
> unauthorised character handling instead of calling the proper 
> CL string functions.
> >
> > What's puzzling, hwoever, is that if this was a bug, it 
> should have been present on Linux as well as OS X. Did you 
> actuall use the "count" command on Linux? If so, and it 
> worked OK, can you check which versions f CWB you have on 
> both platforms (cqp -v)? If the version you have installed on 
> OS X is an older one, then the solution is to upgrade to 3.4.
> OK. I think this is good news because probably this is not 
> caused by a bug in the current version. I thought the version 
> I was using on my Mac was the most current one as I 
> downloaded it from the official CWB site. 
> It turns out, though, that the version I'm running is 3.0.0 
> whereas the version in our server is 3.0.2.
> 
> How do I upgrade it? By simply downloading 3.4 (where do I 
> get it?) and repeating the installation process?
> 
> Josep M.
> 
> 
> 
> 
> *If*, on the other hand, this really is a bug in the latest 
> version, then I need to find the relevant count code and fix 
> it. I can't see where it is at a glance, so a hunt will be 
> needed. (Stefan, if you happen to be reading this, and can 
> say quickly where I should look, that would be a help!) Best Andrew.
> >> -----Original Message-----
> >> From: cwb-bounces at sslmit.unibo.it
> >> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Josep M. Fontana
> >> Sent: 12 August 2012 18:29
> >> To: Open source development of the Corpus WorkBench
> >> Subject: Re: [CWB] Character encoding problems when transferring 
> >> corpus
> >>
> >> Hi Andrew,
> >>
> >> Thanks again for the prompt response.
> >>
> >> Something changes but the basic problem remains. The results after 
> >> "set Paging no;" look like this:
> >>
> >> 2    pesta?feres  [#2-#3]
> >> 1    pesta?fer  [#0]
> >> 1    pesta?fera  [#1]
> >> 1    pesta?ffera  [#4]
> >>
> >> as opposed to this (without "set Paging no;"):
> >>
> >> 2       pesta<AD>feres  [#2-#3]
> >> 1       pesta<AD>fer  [#0]
> >> 1       pesta<AD>fera  [#1]
> >> 1       pesta<AD>ffera  [#4]
> >>
> >> These should be all variants of "pestífera" where 'í' is not 
> >> displayed properly.
> >>
> >> One thing that I see, though, is that the results of the 
> query vary 
> >> considerably after issuing the "set Paging no;"
> >> command. I get a drastically reduced list of result after issuing 
> >> this command. What does this do exactly?
> >>
> >> The other thing that I had not realized before I sent my initial 
> >> message and that I think might be important to identify 
> the problem 
> >> is that the problems with the display of accented characters occur 
> >> only when viewing the results of 'count'. That is the 
> problems only 
> >> ensue with:
> >>
> >>   > count Last by word %cd on match;
> >>
> >> If I do a regular query such as:
> >>
> >>   > [(pos="A.*")&(word="pest.*")];
> >>
> >> all the accented characters are displayed properly whether 
> I do "set 
> >> Paging no;" or not.
> >>
> >>
> >> JM
> >>
> >>
> >>
> >>
> >>> This looks like an issue with less. It seems to be "eating"
> >> the first half of the utf8 sequence (converting it to an 
> accentless 
> >> "a") leaving the second half to appear as a bare binary character 
> >> (thus the hex codes in angle brackets).
> >>> You can check this by turning off the use of a pager for
> >> query output:
> >>> set Paging no;
> >>>
> >>> If queries print OK with this setting, then it is
> >> definitely an issue with less. If not, then the problem is 
> somewhere 
> >> else.
> >>> Best
> >>>
> >>> Andrew.
> >>>
> >>>> -----Original Message-----
> >>>> From: cwb-bounces at sslmit.unibo.it
> >>>> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Josep 
> M. Fontana
> >>>> Sent: 12 August 2012 17:11
> >>>> To: Open source development of the Corpus WorkBench
> >>>> Subject: [CWB] Character encoding problems when 
> transferring corpus
> >>>>
> >>>> Hi,
> >>>>
> >>>> I'm not sure this is really a CWB problem (in fact I'm 
> pretty sure 
> >>>> it is
> >>>> not) but since there might be other users that have CWB 
> running on 
> >>>> a Mac perhaps I can get some help in this list.
> >>>>
> >>>> I installed CWB on my Mac and then transferred all the relevant 
> >>>> directories and registry files from the CWB installation we have 
> >>>> running on a LAMP server. Everything seems to be working fine 
> >>>> except that the results of a query on the terminal come out like 
> >>>> this (the words as they should be displayed are within 
> parentheses)
> >>>>
> >>>>
> >>>> 60      ila<B7>lustra<AD>ssim  [#41507-#41566]  (-->
> >> 'il·lustríssim')
> >>>> 58      fama<B3>s  [#24851-#24908] ( --> 'famós')
> >>>>
> >>>> The corpus is encoded as UTF-8 and my terminal (iTerm) is set up 
> >>>> properly to view UTF-8 encoded texts. I have no problems viewing 
> >>>> other
> >>>> UTF-8 encoded texts on this computer and I don't have these 
> >>>> problems when accessing the corpus remotely.
> >>>>
> >>>> Why would UTF-8 encoded texts from the transferred corpus not be 
> >>>> displayed properly? Is there any way to fix this? Any 
> help would be 
> >>>> greatly appreciated.
> >>>>
> >>>> Josep M.
> >>>> _______________________________________________
> >>>> CWB mailing list
> >>>> CWB at sslmit.unibo.it
> >>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> >>>>
> >>> _______________________________________________
> >>> CWB mailing list
> >>> CWB at sslmit.unibo.it
> >>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> >> _______________________________________________
> >> CWB mailing list
> >> CWB at sslmit.unibo.it
> >> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> >>
> > _______________________________________________
> > CWB mailing list
> > CWB at sslmit.unibo.it
> > http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> 


More information about the CWB mailing list