[Sigwac] Is the list alive?

Serge Sharoff s.sharoff at leeds.ac.uk
Mon Aug 14 10:15:58 CEST 2006


Hi Robert,

as you see, there's an 'all', even if there's little discussion going on
on the list.  Apart from being on holidays, we're busy with formulating
guidelines for corpus preparation to ensure that people cleaning our
test and development sets produce comparable results.  The idea is that
we clean small test sets ourselves (we: Adam, Marco and me), identify
potential problems and formulate rules for solving them.  'All' is
welcome to join.  The current set for ENglish is:
http://corpus.leeds.ac.uk/serge/english-crawl.tgz

We're also committed to cleaning Chinese webpages for two reasons:
     1. unlike ENglish and other Latin1 languages, many other languages
        have a variety of encodings for their scripts, so encoding
        identification is an important part of the cleaning algorithm
     2. Chinese is an example of a language without explicit
        orthographic word boundaries, while many cleaning methods rely
        on word statistics, so it's interesting to see how they can cope
        with this problem


The negative side is that at the moment we don't have money for the
cleaning activity.  My application for internal funding in Leeds failed.
We'll be applying to EPSRC, but cannot guarantee the success.  So any
contribution to data cleaning is appreciated.

Serge

On Sat, 2006-08-12 at 18:14 +1000, Robert Dale wrote:
> Hi all [I'm hoping there's an 'all' to say Hi too!]
> 
> I don't seem to have received any mail on the SIGWAC list.  Has there really
> been no discussion?
> 
> R
> 
> _______________________________________________
> Sigwac mailing list
> Sigwac at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac


More information about the Sigwac mailing list