[CWB] Encoded corpus shows hits for [word=".*"] but not for any real word---not even for [word="a.*"]

Hardie, Andrew a.hardie at lancaster.ac.uk
Wed May 7 10:51:01 CEST 2014


Can I suggest you take a look at the output of cwb-decode and/or cwb-lexdecode to try and see what is actually in there? 

One possibility is that there is some rogue whitespace or other non-printing character at the start of every word (this would explain why a query starting with any actual letter gets zero results, but a query starting with . gets results)

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Jörg Knappen
Sent: 07 May 2014 07:47
To: CWB at sslmit.unibo.it
Subject: [CWB] Encoded corpus shows hits for [word=".*"] but not for any real word---not even for [word="a.*"]


I have encoded a corpus with cqp-3.0 and found that the corpus query [word="*."]; gives lot of results, but any other query I tried gave 0 results.

I suspect that there is something in the raw data causing this behaviour, but I don't know what to look for. The data is not very clean, it comes from OCR and not all OCR errors are corrected. Encoding throws some warnings like

Malformed tag <, inserted literally (file lat2-vrt//0006752_lat2.vrt, line #6).

However, the same kind of warning occurred with a preivious installment of the same corpus and there the cqp query worked fine.

Any hints?

--Jörg Knappen

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list