[Sigwac] Call for discussion: The SIGWAC crisis (instead, of an announcement of WAC-XI)

Miloš Jakubíček milos.jakubicek at sketchengine.co.uk
Tue Aug 1 12:22:41 CEST 2017


Hi Roland,

On 1 August 2017 at 11:45, Roland Schäfer <roland.schaefer at fu-berlin.de>
wrote:

>
> As opposed to the standard BootCaT approach of ad-hoc corpus creation, I
> have many problems with mere indices, however. Mainly:
>
> 1. In my corpus studies, I frequently run thousands of scripted queries,
> automatically processing, filtering, and sampling-from hundreds of
> thousands of hits to obtain the final concordances. With an index, I do
> not see this happen.
>

Sorry I do not follow - what do you mean by index here, can you please
explain?


> 2. Many of the meta data we are currently creating (such as grammatical
> profiles of documents and topical classification), and which should make
> web corpora/web data more attractive, would have to be stored as some
> kind of stand-off data. This really complicates matters beyond what
> (most likely) anybody is willing to implement in a user-friendly way.
> Also, generating this type of data (heck, even using the Stanford
> parser!) either requires a true (not buzzword) big data infrastructure
> (some Map-Reduce framework with many nodes constantly available) or a
> LOT of time for off-line processing, even on traditional high
> performance clusters. It would be difficult to implement this
> effectively under an indexing approach.
>

In NoSketch Engine (which is, just to repeat, free and open source, and
will remain so:)
 -- and by the way I think also in CQP workbench -- indices are fairly
independent. Absolutely independent as for data vs. metadata.
So you can just move around metadata index files without harming corpus
text.
If that is what you meant - not sure.


> 3. More importantly, indices do not lead to reproducible results (which
> was AFAIR one of Adam's main points in his seminal paper). Under the
> current guidelines of the German Research Council (DFG, the main
> third-party funding agency in DE) on textual resources, for example,
> mentioning the planned use of results obtained from web data using an
> index in a grant application should theoretically stand in the way of
> approving the grant.
>

I think I still don't get what the indices stand for here, probably not an
index as in computer/database terms?
(At least I don't understand why would that stand in the way of any
funding...)

You make me wondering! ;)

Best
Milos


More information about the Sigwac mailing list