[CWB] Character encoding problems when transferring corpus

Josep M. Fontana josepm.fontana at upf.edu
Sun Aug 12 20:13:19 CEST 2012


OK. I have to leave now so I won't be able to answer in a while. I'll 
check this as soon as possible and report. Thanks a lot for your help!

JM
> 3.0 doesn't have any Unicode support at all - so that explains that! Yes, you should upgrade to 3.4. The easiest way is to check out the code via Subversion, then build following the instructions in the INSTALL file.
>
> Instructions are here: http://cwb.sourceforge.net/developers.php#svn
>
> You will probably need to set the CWB_LIVE_DANGEROUSLY environment variable to overwrite your existing installation.
>
> I trust that the character set issue will disappear then, but let us know if not.
>
> As for the disappearance of some results depending on whether the pager is on or off... That is more worrying. Using less / not using less ought indeed not to change the number of results. Can you confirm that the issue persists on v3.4, and if it does, send a full example of the two outputs (full and reduced)? Thanks.
>
> Best
>
> Andrew.
>
>> -----Original Message-----
>> From: cwb-bounces at sslmit.unibo.it
>> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Josep M. Fontana
>> Sent: 12 August 2012 19:00
>> To: Open source development of the Corpus WorkBench
>> Subject: Re: [CWB] Character encoding problems when
>> transferring corpus
>>
>>
>>> Ah, this begins to make sense.
>> Glad to hear that!
>>> "set Paging no" turns off the use of less to view results.
>> When Paging is off, results are piped directly to the
>> terminal. When Paging is on, a less process is started up and
>> the results are piped to that.
>> Mmm, that's what I thought it did but what I would have
>> expected is that the only change after issuing this command
>> would be that all the results would appear in a single
>> screen. But the change goes beyond that. Not only are the
>> results dumped into a single screen or page but they are also
>> considerably reduced. After "set Paging no;" the list of
>> results is considerably shorter than before issuing this
>> command. Is this supposed to happen? I thought less only
>> affected how the results are displayed, not what results you get.
>>> The fact that the bytes are still wrong without the pager
>> makes things more transparent, and I think I know what is
>> wrong. Due to your use of the %d flag with "count", accent
>> folding is applied; but it would appear to be the case that
>> ISO-8859-1 accent-folding is being used. This breaks the UTF8
>> sequences. Clearly, the count command is using some
>> unauthorised character handling instead of calling the proper
>> CL string functions.
>>> What's puzzling, hwoever, is that if this was a bug, it
>> should have been present on Linux as well as OS X. Did you
>> actuall use the "count" command on Linux? If so, and it
>> worked OK, can you check which versions f CWB you have on
>> both platforms (cqp -v)? If the version you have installed on
>> OS X is an older one, then the solution is to upgrade to 3.4.
>> OK. I think this is good news because probably this is not
>> caused by a bug in the current version. I thought the version
>> I was using on my Mac was the most current one as I
>> downloaded it from the official CWB site.
>> It turns out, though, that the version I'm running is 3.0.0
>> whereas the version in our server is 3.0.2.
>>
>> How do I upgrade it? By simply downloading 3.4 (where do I
>> get it?) and repeating the installation process?
>>
>> Josep M.
>>
>>
>>
>>
>> *If*, on the other hand, this really is a bug in the latest
>> version, then I need to find the relevant count code and fix
>> it. I can't see where it is at a glance, so a hunt will be
>> needed. (Stefan, if you happen to be reading this, and can
>> say quickly where I should look, that would be a help!) Best Andrew.
>>>> -----Original Message-----
>>>> From: cwb-bounces at sslmit.unibo.it
>>>> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Josep M. Fontana
>>>> Sent: 12 August 2012 18:29
>>>> To: Open source development of the Corpus WorkBench
>>>> Subject: Re: [CWB] Character encoding problems when transferring
>>>> corpus
>>>>
>>>> Hi Andrew,
>>>>
>>>> Thanks again for the prompt response.
>>>>
>>>> Something changes but the basic problem remains. The results after
>>>> "set Paging no;" look like this:
>>>>
>>>> 2    pesta?feres  [#2-#3]
>>>> 1    pesta?fer  [#0]
>>>> 1    pesta?fera  [#1]
>>>> 1    pesta?ffera  [#4]
>>>>
>>>> as opposed to this (without "set Paging no;"):
>>>>
>>>> 2       pesta<AD>feres  [#2-#3]
>>>> 1       pesta<AD>fer  [#0]
>>>> 1       pesta<AD>fera  [#1]
>>>> 1       pesta<AD>ffera  [#4]
>>>>
>>>> These should be all variants of "pestífera" where 'í' is not
>>>> displayed properly.
>>>>
>>>> One thing that I see, though, is that the results of the
>> query vary
>>>> considerably after issuing the "set Paging no;"
>>>> command. I get a drastically reduced list of result after issuing
>>>> this command. What does this do exactly?
>>>>
>>>> The other thing that I had not realized before I sent my initial
>>>> message and that I think might be important to identify
>> the problem
>>>> is that the problems with the display of accented characters occur
>>>> only when viewing the results of 'count'. That is the
>> problems only
>>>> ensue with:
>>>>
>>>>    > count Last by word %cd on match;
>>>>
>>>> If I do a regular query such as:
>>>>
>>>>    > [(pos="A.*")&(word="pest.*")];
>>>>
>>>> all the accented characters are displayed properly whether
>> I do "set
>>>> Paging no;" or not.
>>>>
>>>>
>>>> JM
>>>>
>>>>
>>>>
>>>>
>>>>> This looks like an issue with less. It seems to be "eating"
>>>> the first half of the utf8 sequence (converting it to an
>> accentless
>>>> "a") leaving the second half to appear as a bare binary character
>>>> (thus the hex codes in angle brackets).
>>>>> You can check this by turning off the use of a pager for
>>>> query output:
>>>>> set Paging no;
>>>>>
>>>>> If queries print OK with this setting, then it is
>>>> definitely an issue with less. If not, then the problem is
>> somewhere
>>>> else.
>>>>> Best
>>>>>
>>>>> Andrew.
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: cwb-bounces at sslmit.unibo.it
>>>>>> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Josep
>> M. Fontana
>>>>>> Sent: 12 August 2012 17:11
>>>>>> To: Open source development of the Corpus WorkBench
>>>>>> Subject: [CWB] Character encoding problems when
>> transferring corpus
>>>>>> Hi,
>>>>>>
>>>>>> I'm not sure this is really a CWB problem (in fact I'm
>> pretty sure
>>>>>> it is
>>>>>> not) but since there might be other users that have CWB
>> running on
>>>>>> a Mac perhaps I can get some help in this list.
>>>>>>
>>>>>> I installed CWB on my Mac and then transferred all the relevant
>>>>>> directories and registry files from the CWB installation we have
>>>>>> running on a LAMP server. Everything seems to be working fine
>>>>>> except that the results of a query on the terminal come out like
>>>>>> this (the words as they should be displayed are within
>> parentheses)
>>>>>>
>>>>>> 60      ila<B7>lustra<AD>ssim  [#41507-#41566]  (-->
>>>> 'il·lustríssim')
>>>>>> 58      fama<B3>s  [#24851-#24908] ( --> 'famós')
>>>>>>
>>>>>> The corpus is encoded as UTF-8 and my terminal (iTerm) is set up
>>>>>> properly to view UTF-8 encoded texts. I have no problems viewing
>>>>>> other
>>>>>> UTF-8 encoded texts on this computer and I don't have these
>>>>>> problems when accessing the corpus remotely.
>>>>>>
>>>>>> Why would UTF-8 encoded texts from the transferred corpus not be
>>>>>> displayed properly? Is there any way to fix this? Any
>> help would be
>>>>>> greatly appreciated.
>>>>>>
>>>>>> Josep M.
>>>>>> _______________________________________________
>>>>>> CWB mailing list
>>>>>> CWB at sslmit.unibo.it
>>>>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>>>>>
>>>>> _______________________________________________
>>>>> CWB mailing list
>>>>> CWB at sslmit.unibo.it
>>>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>>> _______________________________________________
>>>> CWB mailing list
>>>> CWB at sslmit.unibo.it
>>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>>>
>>> _______________________________________________
>>> CWB mailing list
>>> CWB at sslmit.unibo.it
>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb



More information about the CWB mailing list