[Sigwac] Domain-classified corpora from the web

Serge Sharoff S.Sharoff at leeds.ac.uk
Sun May 11 09:18:21 CEST 2008


Adam,
There should be no major problems with implementing this for domains (use any reasonable keyword selection mechanisms for machine learning).  However, you have to take care of the coverage of your reference corpus: if it doesn't contain texts extected in your specialised corpus, then it could be of little help.  Consider world-affairs texts in the BNC, described by keywords from the Thatcher era.

As for text types, your procedure is less feasible: it's difficult to collect genre-specific collections using keyword queries.  What I did, I classified a subset of my target corpus manually into text types I want, and then applied the classes to the bigger corpus.  This gives me a genre-classified ukWac:
http://corpus.leeds.ac.uk/serge/webgenres/

Cheers,
Serge


-----Original Message-----
From: sigwac-bounces at sslmit.unibo.it on behalf of Adam Kilgarriff
Sent: Fri 5/9/2008 6:17 PM
To: sigwac at sslmit.unibo.it; Marco Baroni; Pavel Rychly; Jan Pomikálek; Niels Ott
Subject: [Sigwac] Domain-classified corpora from the web
 
All,

I'd like to automate the development of big clean corpora with domain and
text type classification.  I'm wondering about augmenting BootCaT techniques
with automatic document classification, as follows:

* start with seed words and general reference corpus
* BootCaT gives specialist corpus
* build (but don't overtrain) a classifier to distinguish specialist corpus
docs from general docs
* use classifier to weed out off-target docs from specialist corpus, and to
add on-target docs from reference corpus
* merge corpora, with appropriate headers, to give bigger corpus with
specialist-domain subcorpora
Has anyone tried this sort of thing?  Opinions on whether it is likely to
work also appreciated,

Best

Adam


-- 
================================================
Adam Kilgarriff http://www.kilgarriff.co.uk
Lexical Computing Ltd http://www.sketchengine.co.uk
Lexicography MasterClass Ltd http://www.lexmasterclass.com
Universities of Leeds and Sussex adam at lexmasterclass.com
================================================
_______________________________________________
Sigwac mailing list
Sigwac at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/sigwac



More information about the Sigwac mailing list