[CWB] Character encoding revisited (plus question about query)
Josep M. Fontana
josepm.fontana at upf.edu
Fri Jun 27 18:40:05 CEST 2014
OK. Thanks for the prompt response and sorry about taking so long to
answer.
I'm issuing the following two commands:
$ PRE13 = [word="se"][pos="V.*"]::match.text_century="13";
$ count PRE13 by lema %cd match > "preverbal_se-13.txt";
The version I see on our CQPweb page is CQPweb v3.0.16 © 2008-2013. I'm
working via terminal, however. I don't know whether the version of
CQPweb tells you something informative about the query engine itself.
While I'm at it, let me ask potential readers of this message about
something else. If you notice, the goal with the combination of commands
I'm using is to get a list of the frequencies of verbs that occur after
the Spanish reflexive pronoun 'se' in a particular century. I'm actually
interested in finding out whether there were any significant changes in
the types of verbs that appear with this pronoun across time.
Since word order was rather flexible in medieval Spanish, I'm also
interested in finding out what verbs appear preceding SE so I don't
really care what the relative order is between the verb and the pronoun.
The problem is that since I need to get the frequencies and I'm using
'count' to do this (is there a better way?), the only way I know how to
get the frequency of the verb is by specifying its position via 'match'
or 'matchend'. So, besides the combination of commands above, I have to
use the following combination and then merge the results manually.
$ POST13 = [pos="V.*"][word="se"]::match.text_century="13";
$ count POST13 by lema %cd match > "postverbal_se-13.txt";
I have the feeling this is a pretty crude and inefficient way of doing
this but I haven't been able to figure out any other way to do it by
looking at the CQP manual. Can anybody tell how to get the total
frequencies independently of the relative order the verb occupies with
respect to SE if that is actually possible?
Thanks in advance,
JM
> What particular output?
>
> e.g., concordance with context width defined in characters, concordance with context width defined in words, tabulation, group, ... ?
>
> Depending on which it is, the cause could be rather different.
>
> Also, where in the lines do the broken UTF-8 characters occur? At the beginning, at the end, in the middle, or a combination?
>
> Lastly, what version are you running?
>
> best
>
> Andrew.
> ________________________________________
> From: cwb-bounces at sslmit.unibo.it [cwb-bounces at sslmit.unibo.it] on behalf of Josep M. Fontana [josepm.fontana at upf.edu]
> Sent: 25 June 2014 17:41
> To: cwb at sslmit.unibo.it
> Subject: [CWB] Character encoding revisited
>
> Hi,
>
> Our corpus is encoded in UTF-8 but when I create a text file with the
> output of some search I get the typical odd characters one gets when the
> conversion has gone wrong. I used the 'file' command and I saw that the
> text files are sometimes encoded as ISO-8859 and some other times as
> ASCII. Is there anyway to configure things so that the UTF-8 character
> set is maintained? Thanks.
>
>
> Josep M.
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
More information about the CWB
mailing list