[CWB] encoding BNC-BABY

Stefan Evert stefanML at collocations.de
Mon Jan 14 08:14:02 CET 2019


Hi John,

the BNC encoder script is designed to work with the full BNC (XML edition).  If BNC-BABY (or the BNC sampler) were just subsets of the full corpus, the script should work properly; but apparently there are also some format differences.

Unfortunately, I don't have either of these corpora so I wasn't able to test and adapt the script.

The error message suggests that BNC-BABY writes its catRef's in lowercase (alltim3), whereas they are uppercase in the full BNC (e.g. ALLTIM3) and hardcoded as such in the encoder script.  In principle, this should be fixable, but I wouldn't be able to look at this anytime soon.

If this is the only difference, you might try changing the hardcoded BNC metadata tables at the end of file lib/BNC/Meta.pm.

An easier solution might be to create a subcorpus from the full BNC (I have a Perl script to do so, but haven't been able to write documentation yet and you will need very up-to-date versions of CWB and the Perl modules) – or you could just try running the encoder script on a subset of the BNC XML files (i.e. copy those into your data directory instead of the full BNC).

Best,
Stefan





> On 11 Jan 2019, at 17:33, John Hale <jthale at uga.edu> wrote:
> 
> CWB Gurus —  I am about to teach a course module on corpus searching…naturally using the CWB. Since the server hardware is shared, I am thinking it is better to use BNC-BABY rather the the full BNC so that queries finish quickly.  I successfully encoded the full BNC using Stefan Evert’s excellent script. But when I try to apply the same script to the BABY I get this error in red:
> 
> 
>> perl EncodeBNC.perl -f --name="BNC-BABY" /data/corpora/cwb/bncbaby  ../fromota
>> IMS Open Corpus Workbench:
>> Encoder for the British National Corpus (XML edition), version 0.9.2.
>> 
>> Converting source files to CWB format ...
>> BNC::Meta: Unknown catRef code 'alltim3' -- program aborted
> 
> In this command, “fromota” comes fresh from http://ota.ox.ac.uk/desc/2553
> and contains
> the file bncHdr.xml
> and 
> the directory Texts
> with subdirectories aca dem fic news… et cetera
> 
> 
> Is EncodeBNC actually meant to work with BABY? Or is there another way to get it in encoded?
> 
> 
> grateful for any tips,
> -john
> 
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb



More information about the CWB mailing list