[CWB] Character encoding revisited (plus question about query)

Fri Jun 27 19:51:26 CEST 2014

Damn. I thought we were running the last and the best :-(
> Oh that's easy then. The 3.0 series does not actually support UTF-8!
>
> Install 3.4 from the svn repo and let us know if the problem recurs.
>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Josep M. Fontana
> Sent: 27 June 2014 17:42
> To: cwb at sslmit.unibo.it
> Subject: Re: [CWB] Character encoding revisited (plus question about query)
>
>
> Actually, I got the information you required. It turns out the version
> of CQP I'm using to do searches via terminal is newer:
>
> Compiled:  dc nov  7 14:28:35 CET 2012
> Version:   3.0.2
>
>
> JM
>> OK. Thanks for the prompt response and sorry about taking so long to
>> answer.
>>
>> I'm issuing the following two commands:
>>
>> $ PRE13 = [word="se"][pos="V.*"]::match.text_century="13";
>> $ count PRE13 by lema %cd match > "preverbal_se-13.txt";
>>
>> The version I see on our CQPweb page is CQPweb v3.0.16 (c) 2008-2013.
>> I'm working via terminal, however. I don't know whether the version of
>> CQPweb tells you something informative about the query engine itself.
>>
>> While I'm at it, let me ask potential readers of this message about
>> something else. If you notice, the goal with the combination of
>> commands I'm using is to get a list of the frequencies of verbs that
>> occur after the Spanish reflexive pronoun 'se' in a particular
>> century. I'm actually interested in finding out whether there were any
>> significant changes in the types of verbs that appear with this
>> pronoun across time.
>>
>> Since word order was rather flexible in medieval Spanish, I'm also
>> interested in finding out what verbs appear preceding SE so I don't
>> really care what the relative order is between the verb and the
>> pronoun. The problem is that since I need to get the frequencies and
>> I'm using 'count' to do this (is there a better way?), the only way I
>> know how to get the frequency of the verb is by specifying its
>> position via 'match' or 'matchend'. So, besides the combination of
>> commands above, I have to use the following combination and then merge
>> the results manually.
>>
>> $ POST13 = [pos="V.*"][word="se"]::match.text_century="13";
>> $ count POST13 by lema %cd match > "postverbal_se-13.txt";
>>
>>
>> I have the feeling this is a pretty crude and inefficient way of doing
>> this but I haven't been able to figure out any other way to do it by
>> looking at the CQP manual. Can anybody tell how to get the total
>> frequencies independently of the relative order the verb occupies with
>> respect to SE if that is actually possible?
>>
>>
>> Thanks in advance,
>>
>> JM
>>
>>
>>
>>> What particular output?
>>>
>>> e.g., concordance with context width defined in characters,
>>> concordance with context width defined in words, tabulation, group,
>>> ... ?
>>>
>>> Depending on which it is, the cause could be rather different.
>>>
>>> Also, where in the lines do the broken UTF-8 characters occur? At the
>>> beginning, at the end, in the middle, or a combination?
>>>
>>> Lastly, what version are you running?
>>>
>>> best
>>>
>>> Andrew.
>>> ________________________________________
>>> From: cwb-bounces at sslmit.unibo.it [cwb-bounces at sslmit.unibo.it] on
>>> behalf of Josep M. Fontana [josepm.fontana at upf.edu]
>>> Sent: 25 June 2014 17:41
>>> To: cwb at sslmit.unibo.it
>>> Subject: [CWB] Character encoding revisited
>>>
>>> Hi,
>>>
>>> Our corpus is encoded in UTF-8 but when I create a text file with the
>>> output of some search I get the typical odd characters one gets when the
>>> conversion has gone wrong. I used the 'file' command and I saw that the
>>> text files are sometimes encoded as ISO-8859 and some other times as
>>> ASCII. Is there anyway to configure things so that the UTF-8 character
>>> set is maintained? Thanks.
>>>
>>>
>>> Josep M.
>>> _______________________________________________
>>> CWB mailing list
>>> CWB at sslmit.unibo.it
>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>> _______________________________________________
>>> CWB mailing list
>>> CWB at sslmit.unibo.it
>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb