[CWB] Character encoding revisited (plus question about query)

Hardie, Andrew a.hardie at lancaster.ac.uk
Fri Jun 27 18:45:39 CEST 2014


Oh that's easy then. The 3.0 series does not actually support UTF-8!

Install 3.4 from the svn repo and let us know if the problem recurs.

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Josep M. Fontana
Sent: 27 June 2014 17:42
To: cwb at sslmit.unibo.it
Subject: Re: [CWB] Character encoding revisited (plus question about query)


Actually, I got the information you required. It turns out the version 
of CQP I'm using to do searches via terminal is newer:

Compiled:  dc nov  7 14:28:35 CET 2012
Version:   3.0.2


JM
>
> OK. Thanks for the prompt response and sorry about taking so long to 
> answer.
>
> I'm issuing the following two commands:
>
> $ PRE13 = [word="se"][pos="V.*"]::match.text_century="13";
> $ count PRE13 by lema %cd match > "preverbal_se-13.txt";
>
> The version I see on our CQPweb page is CQPweb v3.0.16 (c) 2008-2013. 
> I'm working via terminal, however. I don't know whether the version of 
> CQPweb tells you something informative about the query engine itself.
>
> While I'm at it, let me ask potential readers of this message about 
> something else. If you notice, the goal with the combination of 
> commands I'm using is to get a list of the frequencies of verbs that 
> occur after the Spanish reflexive pronoun 'se' in a particular 
> century. I'm actually interested in finding out whether there were any 
> significant changes in the types of verbs that appear with this 
> pronoun across time.
>
> Since word order was rather flexible in medieval Spanish, I'm also 
> interested in finding out what verbs appear preceding SE so I don't 
> really care what the relative order is between the verb and the 
> pronoun. The problem is that since I need to get the frequencies and 
> I'm using 'count' to do this (is there a better way?), the only way I 
> know how to get the frequency of the verb is by specifying its 
> position via 'match' or 'matchend'. So, besides the combination of 
> commands above, I have to use the following combination and then merge 
> the results manually.
>
> $ POST13 = [pos="V.*"][word="se"]::match.text_century="13";
> $ count POST13 by lema %cd match > "postverbal_se-13.txt";
>
>
> I have the feeling this is a pretty crude and inefficient way of doing 
> this but I haven't been able to figure out any other way to do it by 
> looking at the CQP manual. Can anybody tell how to get the total 
> frequencies independently of the relative order the verb occupies with 
> respect to SE if that is actually possible?
>
>
> Thanks in advance,
>
> JM
>
>
>
>> What particular output?
>>
>> e.g., concordance with context width defined in characters, 
>> concordance with context width defined in words, tabulation, group, 
>> ... ?
>>
>> Depending on which it is, the cause could be rather different.
>>
>> Also, where in the lines do the broken UTF-8 characters occur? At the 
>> beginning, at the end, in the middle, or a combination?
>>
>> Lastly, what version are you running?
>>
>> best
>>
>> Andrew.
>> ________________________________________
>> From: cwb-bounces at sslmit.unibo.it [cwb-bounces at sslmit.unibo.it] on 
>> behalf of Josep M. Fontana [josepm.fontana at upf.edu]
>> Sent: 25 June 2014 17:41
>> To: cwb at sslmit.unibo.it
>> Subject: [CWB] Character encoding revisited
>>
>> Hi,
>>
>> Our corpus is encoded in UTF-8 but when I create a text file with the
>> output of some search I get the typical odd characters one gets when the
>> conversion has gone wrong. I used the 'file' command and I saw that the
>> text files are sometimes encoded as ISO-8859 and some other times as
>> ASCII. Is there anyway to configure things so that the UTF-8 character
>> set is maintained? Thanks.
>>
>>
>> Josep M.
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list