[CWB] Encoded corpus shows hits for [word=".*"] but not for any real word---not even for [word="a.*"]
    Hardie, Andrew 
    a.hardie at lancaster.ac.uk
       
    Wed May  7 10:51:01 CEST 2014
    
    
  
Can I suggest you take a look at the output of cwb-decode and/or cwb-lexdecode to try and see what is actually in there? 
One possibility is that there is some rogue whitespace or other non-printing character at the start of every word (this would explain why a query starting with any actual letter gets zero results, but a query starting with . gets results)
best
Andrew.
-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Jörg Knappen
Sent: 07 May 2014 07:47
To: CWB at sslmit.unibo.it
Subject: [CWB] Encoded corpus shows hits for [word=".*"] but not for any real word---not even for [word="a.*"]
I have encoded a corpus with cqp-3.0 and found that the corpus query [word="*."]; gives lot of results, but any other query I tried gave 0 results.
I suspect that there is something in the raw data causing this behaviour, but I don't know what to look for. The data is not very clean, it comes from OCR and not all OCR errors are corrected. Encoding throws some warnings like
Malformed tag <, inserted literally (file lat2-vrt//0006752_lat2.vrt, line #6).
However, the same kind of warning occurred with a preivious installment of the same corpus and there the cqp query worked fine.
Any hints?
--Jörg Knappen
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
    
    
More information about the CWB
mailing list