[CWB] Character encoding revisited (plus question about query)

Josep M. Fontana josepm.fontana at upf.edu
Fri Jun 27 18:40:05 CEST 2014


OK. Thanks for the prompt response and sorry about taking so long to 
answer.

I'm issuing the following two commands:

$ PRE13 = [word="se"][pos="V.*"]::match.text_century="13";
$ count PRE13 by lema %cd match > "preverbal_se-13.txt";

The version I see on our CQPweb page is CQPweb v3.0.16 © 2008-2013. I'm 
working via terminal, however. I don't know whether the version of 
CQPweb tells you something informative about the query engine itself.

While I'm at it, let me ask potential readers of this message about 
something else. If you notice, the goal with the combination of commands 
I'm using is to get a list of the frequencies of verbs that occur after 
the Spanish reflexive pronoun 'se' in a particular century. I'm actually 
interested in finding out whether there were any significant changes in 
the types of verbs that appear with this pronoun across time.

Since word order was rather flexible in medieval Spanish, I'm also 
interested in finding out what verbs appear preceding SE so I don't 
really care what the relative order is between the verb and the pronoun. 
The problem is that since I need to get the frequencies and I'm using 
'count' to do this (is there a better way?), the only way I know how to 
get the frequency of the verb is by specifying its position via 'match' 
or 'matchend'. So, besides the combination of commands above, I have to 
use the following combination and then merge the results manually.

$ POST13 = [pos="V.*"][word="se"]::match.text_century="13";
$ count POST13 by lema %cd match > "postverbal_se-13.txt";


I have the feeling this is a pretty crude and inefficient way of doing 
this but I haven't been able to figure out any other way to do it by 
looking at the CQP manual. Can anybody tell how to get the total 
frequencies independently of the relative order the verb occupies with 
respect to SE if that is actually possible?


Thanks in advance,

JM



> What particular output?
>
> e.g., concordance with context width defined in characters, concordance with context width defined in words, tabulation, group, ... ?
>
> Depending on which it is, the cause could be rather different.
>
> Also, where in the lines do the broken UTF-8 characters occur? At the beginning, at the end, in the middle, or a combination?
>
> Lastly, what version are you running?
>
> best
>
> Andrew.
> ________________________________________
> From: cwb-bounces at sslmit.unibo.it [cwb-bounces at sslmit.unibo.it] on behalf of Josep M. Fontana [josepm.fontana at upf.edu]
> Sent: 25 June 2014 17:41
> To: cwb at sslmit.unibo.it
> Subject: [CWB] Character encoding revisited
>
> Hi,
>
> Our corpus is encoded in UTF-8 but when I create a text file with the
> output of some search I get the typical odd characters one gets when the
> conversion has gone wrong. I used the 'file' command and I saw that the
> text files are sometimes encoded as ISO-8859 and some other times as
> ASCII. Is there anyway to configure things so that the UTF-8 character
> set is maintained? Thanks.
>
>
> Josep M.
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb



More information about the CWB mailing list