[CWB] Encoded corpus shows hits for [word=".*"] but not for any real word---not even for [word="a.*"]

Jörg Knappen j.knappen at mx.uni-saarland.de
Wed May 7 08:47:17 CEST 2014


I have encoded a corpus with cqp-3.0 and found that the corpus query
[word="*."];
gives lot of results, but any other query I tried gave 0 results.

I suspect that there is something in the raw data causing this  
behaviour, but I don't
know what to look for. The data is not very clean, it comes from OCR  
and not all OCR
errors are corrected. Encoding throws some warnings like

Malformed tag <, inserted literally (file lat2-vrt//0006752_lat2.vrt,  
line #6).

However, the same kind of warning occurred with a preivious  
installment of the same corpus
and there the cqp query worked fine.

Any hints?

--Jörg Knappen



More information about the CWB mailing list