[CWB] Encoded corpus shows hits for [word=".*"] but not for any real word---not even for [word="a.*"]

Jörg Knappen j.knappen at mx.uni-saarland.de
Wed May 7 11:02:30 CEST 2014


More information I got by inspecting the files:

the query [word=".*"]; sometimes matches __UNDEF__ (the string comes from cqp,
it is not in the corpus; there we find empty lines containing only a  
<TAB> character.


An example hit looks like

   101: nim Zmtrzcknoici ep = 1 ' <__UNDEF__> " SM - J " * " " * eo W

(the  " * " " * stuff comes from the corpus---maybe some decoration
found on a title page?)

--Jörg Knappen

Zitat von Jörg Knappen <j.knappen at mx.uni-saarland.de>:

> I have encoded a corpus with cqp-3.0 and found that the corpus query
> [word="*."];
> gives lot of results, but any other query I tried gave 0 results.
>
> I suspect that there is something in the raw data causing this  
> behaviour, but I don't
> know what to look for. The data is not very clean, it comes from OCR  
> and not all OCR
> errors are corrected. Encoding throws some warnings like
>
> Malformed tag <, inserted literally (file  
> lat2-vrt//0006752_lat2.vrt, line #6).
>
> However, the same kind of warning occurred with a preivious  
> installment of the same corpus
> and there the cqp query worked fine.
>
> Any hints?
>
> --Jörg Knappen
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb





More information about the CWB mailing list