[Sigwac] Call for discussion: The SIGWAC crisis (instead, of an announcement of WAC-XI)

Miloš Jakubíček milos.jakubicek at sketchengine.co.uk
Wed Aug 2 10:54:00 CEST 2017

Hi again,

On 1 August 2017 at 11:45, Roland Schäfer <roland.schaefer at fu-berlin.de>

> Do users really need the BootCaT
> function, though? (If that is not confidential information.) Given the
> size of the TenTen corpora, isn't there enough material in them for
> everyone? (I don't really know how lexicographers work. From my
> morpho-syntactic perspective, there is nothing I cannot find in the
> COW16 corpora).

Well, they certainly do use it -- a brief check of our logs shows almost
200,000 queries issued to Bing and 2,500 thousands corpora being built
(many of them iteratively) in the past six months (from beg Feb).

Now, whether they do need to use it: I know you like saying Sketch Engine
is mainly for lexicographers,
but as a matter of fact vast majority of our users are not lexicographers,
it's just linguists, researchers, students, translators, terminologists,
copywriters. Each of them has quite a different workflow and goals.
While I can imagine that as for finding domain texts (defined in easy ways,
such as "biology", to make our world simple ;), creating a subcorpus from a
webcorpus would/should work just fine, there are many cases where this does
not work, such as when people look for current topics and trends (not yet
covered in the web corpus; say recent articles on Trump) or where their
domain of interest if very narrow and specific (that's often the case of
translators and terminologists). I simple believe the search engines still
do -- in terms of searching the internet -- much better job than we do ;)

> Btw, we also published the full link data extracted from our massive
> COW11, COW12, and COW14 crawls
> (https://www.webcorpora.org/opendata/links/), even including our quality
> and paragraph metrics for the document/paragraph containing each of the
> links. So, anyone could have performed similar analyses. I am not aware,
> however, that anyone ever downloaded these data sets.

Great, I didn't know -- you will get some downloads shortly ;)
I will definitely have somebody to compare - it might be interesting to see
which web corners our crawlers differ at, actually.


More information about the Sigwac mailing list