[Sigwac] orthographic errors in web pages paper

Thu Oct 12 17:52:35 CEST 2006

Hi Marco, 
thanks, again.

I think you noticed the following passage:
--
     To achieve a more realistic scenario we randomly generated
quintuples, each col-
lecting five terms of the 10,000 top frequent German words. We used
Google to retrieve 10 pages per query (quintuple) until we obtained
1,000 pages.  A considerable number
of the URLs were found to be inactive. After conversion to ASCII and a
preliminary
analysis of error rates with methods described below, some of the
remaining pages were
found to contain very large lists of general keywords, including many
orthographic
errors. 
--
I'm amazed that they managed to collect a reasonable corpus at all.
Zipf's law naturally drives such searches towards finding "very large
lists of general keywords".

S

On Thu, 2006-10-12 at 17:36 +0200, Marco Baroni wrote:
> Dear All,
> 
> Perhaps you already noticed, but, in case you didn't, the September issue 
> of Computational Linguistics features a paper that seems very relevant to 
> what some of us are doing:
> 
> Christoph Ringlstetter, Klaus U. Schulz and Stoyan Mihov: Orthographic 
> Errors in Web Pages - Towards Cleaner Web Corpora. CL 32(3): 295-340.
> 
> Regards,
> 
> Marco
> _______________________________________________
> Sigwac mailing list
> Sigwac at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac