[Sigwac] Is the list alive?
Serge Sharoff
s.sharoff at leeds.ac.uk
Mon Aug 14 10:15:58 CEST 2006
Hi Robert,
as you see, there's an 'all', even if there's little discussion going on
on the list. Apart from being on holidays, we're busy with formulating
guidelines for corpus preparation to ensure that people cleaning our
test and development sets produce comparable results. The idea is that
we clean small test sets ourselves (we: Adam, Marco and me), identify
potential problems and formulate rules for solving them. 'All' is
welcome to join. The current set for ENglish is:
http://corpus.leeds.ac.uk/serge/english-crawl.tgz
We're also committed to cleaning Chinese webpages for two reasons:
1. unlike ENglish and other Latin1 languages, many other languages
have a variety of encodings for their scripts, so encoding
identification is an important part of the cleaning algorithm
2. Chinese is an example of a language without explicit
orthographic word boundaries, while many cleaning methods rely
on word statistics, so it's interesting to see how they can cope
with this problem
The negative side is that at the moment we don't have money for the
cleaning activity. My application for internal funding in Leeds failed.
We'll be applying to EPSRC, but cannot guarantee the success. So any
contribution to data cleaning is appreciated.
Serge
On Sat, 2006-08-12 at 18:14 +1000, Robert Dale wrote:
> Hi all [I'm hoping there's an 'all' to say Hi too!]
>
> I don't seem to have received any mail on the SIGWAC list. Has there really
> been no discussion?
>
> R
>
> _______________________________________________
> Sigwac mailing list
> Sigwac at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
More information about the Sigwac
mailing list