[Sigwac] Today's bit of interesting trivia ...

Serge Sharoff S.Sharoff at leeds.ac.uk
Tue Sep 9 22:13:43 CEST 2008


my guess is that the n-gram frequency comes from the total count before duplicate detection, such as library lists:
http://www.mountsihighschool.com/library/AR4.0-4.4.htm
while the query output filters the majority of nearly identical lists.
Still I didn't find the n-gram database terribly useful for my tasks.
Serge

-----Original Message-----
From: sigwac-bounces at sslmit.unibo.it on behalf of Stefan Evert
Sent: Tue 09/09/2008 21:03
To: SIGWAC Mailing List
Subject: [Sigwac] Today's bit of interesting trivia ...
 
Here's and entry from Google's 5-gram database -- you may remember how  
enthusiastic people were on and off the corpora mailing list about its  
release two years ago:

	Healing Time of Hickeys The	3915

Now, if I type that into Google today:

	"Healing Time of Hickeys The"

I get approximately 30 hits (at least in Germany, perhaps that's the  
name of a terribly subversive group that the Chinese government  
doesn't like at all, so Google removed all references from its servers  
in order to get better business opportunities over in Beijing).

Talk about reliability and stability of Web counts ...

:o)
Stefan
_______________________________________________
Sigwac mailing list
Sigwac at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/sigwac



More information about the Sigwac mailing list