[Sigwac] Domain-classified corpora from the web

Fri May 9 19:17:41 CEST 2008

All,

I'd like to automate the development of big clean corpora with domain and
text type classification.  I'm wondering about augmenting BootCaT techniques
with automatic document classification, as follows:

* start with seed words and general reference corpus
* BootCaT gives specialist corpus
* build (but don't overtrain) a classifier to distinguish specialist corpus
docs from general docs
* use classifier to weed out off-target docs from specialist corpus, and to
add on-target docs from reference corpus
* merge corpora, with appropriate headers, to give bigger corpus with
specialist-domain subcorpora
Has anyone tried this sort of thing?  Opinions on whether it is likely to
work also appreciated,

Best

Adam

-- 
================================================
Adam Kilgarriff http://www.kilgarriff.co.uk
Lexical Computing Ltd http://www.sketchengine.co.uk
Lexicography MasterClass Ltd http://www.lexmasterclass.com
Universities of Leeds and Sussex adam at lexmasterclass.com
================================================