[CWB] Fwd: Ngrams in CWB

Stefan Evert stefanML at collocations.de
Fri Mar 16 19:02:18 CET 2018


[I'm blocked from posting to CWBdev once again, so here's a direct re-send]

> Begin forwarded message:
> 
> From: Stefan Evert <stefanML at collocations.de>
> Subject: Re: [CWB] Ngrams in CWB
> Date: 16 March 2018 at 18:24:07 CET
> To: CWBdev Mailing List <cwb at sslmit.unibo.it>
> 
>> in CWB, is there a way to detect ngrams (e.g. trigrams) with one defined lexical item but without defining its exact position? I could merge the results of, say,
>> 
>> "word"[][]
>> []"word"[]
>> [][]"word"
> 
> There is no convenient way of doing this: you'll have to run three queries (or three passes with cwb-scan-corpus) and merge the results.  You could formally write it as a single query
> 
> 	A = "word" [] [] | [] "word" [] | [] [] "word"
> 
> but that's horribly inefficient for larger corpora.
> 
> If "word" is relatively rare, the most efficient approach should be as follows (assuming that "word" isn't at the start or end of the corpus, in which case it would create additional bigram and unigram entries in your trigram table):
> 
> 	W = "word";
> 	Tri = W;
> 	set Tri matchend rightmost [] within right 2 words; # "word" [] []
> 	A = W;
> 	set A match leftmost [] within left 1 word;  # [] "word" []
> 	set A matchend rightmost [] within right 1 word;
> 	Tri = union Tri A;
> 	A = W;
> 	set A match leftmost [] within left 2 words; # [] [] "word"
> 	Tri = union Tri A;
> 
> I agree this isn't the most intuitive and convenient code …
> 
> Stefan

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20180316/f572a0bd/attachment.html>


More information about the CWB mailing list