[Sigwac] Call for discussion: The SIGWAC crisis (instead, of an announcement of WAC-XI)

Tue Aug 1 11:45:04 CEST 2017

Hello again Silvia, Miloš, and everyone,

this is a bit off topic, but still...

>> Now for our take on the issues currently on the table. We like to
>> think (or hope) that the BootCaT era is not completely over. As you
>> will probably remember, what originally got many of us interested in
>> the WaC approach was the idea (chimera?) of being one day able to
>> build a linguist's search engine, a free alternative to Google for
>> building corpora from the web and/or for conducting web-based
>> research stirring clear form the pitfalls of Googleology (to use
>> Adam's term). Crucially, from our perspective as

Great to hear that there are still BootCaT (free version) users! I
really thought people had given up on it after the shutdown of the free
APIs by all major search engines.

As opposed to the standard BootCaT approach of ad-hoc corpus creation, I
have many problems with mere indices, however. Mainly:

1. In my corpus studies, I frequently run thousands of scripted queries,
automatically processing, filtering, and sampling-from hundreds of
thousands of hits to obtain the final concordances. With an index, I do
not see this happen.

2. Many of the meta data we are currently creating (such as grammatical
profiles of documents and topical classification), and which should make
web corpora/web data more attractive, would have to be stored as some
kind of stand-off data. This really complicates matters beyond what
(most likely) anybody is willing to implement in a user-friendly way.
Also, generating this type of data (heck, even using the Stanford
parser!) either requires a true (not buzzword) big data infrastructure
(some Map-Reduce framework with many nodes constantly available) or a
LOT of time for off-line processing, even on traditional high
performance clusters. It would be difficult to implement this
effectively under an indexing approach.

3. More importantly, indices do not lead to reproducible results (which
was AFAIR one of Adam's main points in his seminal paper). Under the
current guidelines of the German Research Council (DFG, the main
third-party funding agency in DE) on textual resources, for example,
mentioning the planned use of results obtained from web data using an
index in a grant application should theoretically stand in the way of
approving the grant.

On 01.08.17 10:26, Miloš Jakubíček wrote:
> Dear Silvia,
> 
> thanks for raising the issue of BootCaT, some notes to that
> 
> 1) in Sketch Engine, Bing-based BootCaT (paid by us) is still free even for
> trial users. The pricing model of Bing was quite modest until the beginning
> of this year, when Microsoft changed their pricing plan, so now we pay for
> Bing queries about 10 times more than we used to, it still increases and I
> wonder where it will all end.
> 
> 2) we tried to switch to Google, but their policy is so strict that even
> with paying (a lot's of) there is still a very low hard upper bound on the
> number of queries, so Google search is basically out of the game

Wow! I was aware that you have this paid BootCaT service, but I wasn't
aware of Bing's new pricing policy. Do users really need the BootCaT
function, though? (If that is not confidential information.) Given the
size of the TenTen corpora, isn't there enough material in them for
everyone? (I don't really know how lexicographers work. From my
morpho-syntactic perspective, there is nothing I cannot find in the
COW16 corpora).

> An alternative to online-BootCaT might be something like offline-BootCaT
> where people build subcorpora from an existing large web corpus already
> crawled (yes this is already possible at the moment to some extent).
> But for that, we need to work on improving the crawling (which gets harder
> and harder as the web gets more javascript singlepage-based) and cleaning.
> Then still, one looses lots of benefits like getting up-to-date results
> scored by pagerank-based techniques etc. (hm, has any one tried to
> calculate some sort of a pagerank on a crawled web corpus documents?)

I my current project, I am doing similar things on DECOW16 and a German
web corpus corrected for crawling bias acquired by Random Walks
(RandyCOW). However, I am not allowed to disclose the results... Just
kidding, I'm simply not done yet.

Btw, we also published the full link data extracted from our massive
COW11, COW12, and COW14 crawls
(https://www.webcorpora.org/opendata/links/), even including our quality
and paragraph metrics for the document/paragraph containing each of the
links. So, anyone could have performed similar analyses. I am not aware,
however, that anyone ever downloaded these data sets.

In any case, I can confirm that crawling is becoming less and less fun,
and I never really liked it to begin with. This is why we switched to
using CommonCrawl data (which also solved some other problems for us).
There will be a massive 'TenEleven' ENCOCO in 2017. For less popular
languages (say, with less than roughly 50 million speakers in developed
countries), this does not work, though. Still, moving towards large
curated raw crawl data might be the way to go for the future. I mean, if
only archive.org released their crawl data! Imagine what amazing
'historic' web corpora could be created from that. We contacted them
many times, however, asking for diverse types of collaboration under
some of their programmes, and they never even replied.

Best regards,
Roland