[Sigwac] Domain-classified corpora from the web

Adam Kilgarriff adam at lexmasterclass.com
Mon May 12 12:22:30 CEST 2008


Thanks Serge (and yes, I have other means in mind as starting points for
text-type corpora - we have the Oxford Children's Corpus, for example, for
the text type "writing for children".  So in this case it is a
"corpus-growing" exercise.)

adam

2008/5/11 Serge Sharoff <S.Sharoff at leeds.ac.uk>:

> Adam,
> There should be no major problems with implementing this for domains (use
> any reasonable keyword selection mechanisms for machine learning).  However,
> you have to take care of the coverage of your reference corpus: if it
> doesn't contain texts extected in your specialised corpus, then it could be
> of little help.  Consider world-affairs texts in the BNC, described by
> keywords from the Thatcher era.
>
> As for text types, your procedure is less feasible: it's difficult to
> collect genre-specific collections using keyword queries.  What I did, I
> classified a subset of my target corpus manually into text types I want, and
> then applied the classes to the bigger corpus.  This gives me a
> genre-classified ukWac:
> http://corpus.leeds.ac.uk/serge/webgenres/
>
> Cheers,
> Serge
>
>
> -----Original Message-----
> From: sigwac-bounces at sslmit.unibo.it on behalf of Adam Kilgarriff
> Sent: Fri 5/9/2008 6:17 PM
> To: sigwac at sslmit.unibo.it; Marco Baroni; Pavel Rychly; Jan Pomikálek;
> Niels Ott
> Subject: [Sigwac] Domain-classified corpora from the web
>
> All,
>
> I'd like to automate the development of big clean corpora with domain and
> text type classification.  I'm wondering about augmenting BootCaT
> techniques
> with automatic document classification, as follows:
>
> * start with seed words and general reference corpus
> * BootCaT gives specialist corpus
> * build (but don't overtrain) a classifier to distinguish specialist
> corpus
> docs from general docs
> * use classifier to weed out off-target docs from specialist corpus, and
> to
> add on-target docs from reference corpus
> * merge corpora, with appropriate headers, to give bigger corpus with
> specialist-domain subcorpora
> Has anyone tried this sort of thing?  Opinions on whether it is likely to
> work also appreciated,
>
> Best
>
> Adam
>
>
> --
> ================================================
> Adam Kilgarriff http://www.kilgarriff.co.uk
> Lexical Computing Ltd http://www.sketchengine.co.uk
> Lexicography MasterClass Ltd http://www.lexmasterclass.com
> Universities of Leeds and Sussex adam at lexmasterclass.com
> ================================================
> _______________________________________________
> Sigwac mailing list
> Sigwac at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
>
> _______________________________________________
> Sigwac mailing list
> Sigwac at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
>



-- 
================================================
Adam Kilgarriff http://www.kilgarriff.co.uk
Lexical Computing Ltd http://www.sketchengine.co.uk
Lexicography MasterClass Ltd http://www.lexmasterclass.com
Universities of Leeds and Sussex adam at lexmasterclass.com
================================================


More information about the Sigwac mailing list