[Sigwac] Today's bit of interesting trivia ...
Serge Sharoff
S.Sharoff at leeds.ac.uk
Tue Sep 9 22:13:43 CEST 2008
my guess is that the n-gram frequency comes from the total count before duplicate detection, such as library lists:
http://www.mountsihighschool.com/library/AR4.0-4.4.htm
while the query output filters the majority of nearly identical lists.
Still I didn't find the n-gram database terribly useful for my tasks.
Serge
-----Original Message-----
From: sigwac-bounces at sslmit.unibo.it on behalf of Stefan Evert
Sent: Tue 09/09/2008 21:03
To: SIGWAC Mailing List
Subject: [Sigwac] Today's bit of interesting trivia ...
Here's and entry from Google's 5-gram database -- you may remember how
enthusiastic people were on and off the corpora mailing list about its
release two years ago:
Healing Time of Hickeys The 3915
Now, if I type that into Google today:
"Healing Time of Hickeys The"
I get approximately 30 hits (at least in Germany, perhaps that's the
name of a terribly subversive group that the Chinese government
doesn't like at all, so Google removed all references from its servers
in order to get better business opportunities over in Beijing).
Talk about reliability and stability of Web counts ...
:o)
Stefan
_______________________________________________
Sigwac mailing list
Sigwac at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
More information about the Sigwac
mailing list