[Sigwac] Call for discussion: The SIGWAC crisis (instead, of an announcement of WAC-XI)

Tue Aug 1 10:26:31 CEST 2017

Dear Silvia,

thanks for raising the issue of BootCaT, some notes to that

1) in Sketch Engine, Bing-based BootCaT (paid by us) is still free even for
trial users. The pricing model of Bing was quite modest until the beginning
of this year, when Microsoft changed their pricing plan, so now we pay for
Bing queries about 10 times more than we used to, it still increases and I
wonder where it will all end.

2) we tried to switch to Google, but their policy is so strict that even
with paying (a lot's of) there is still a very low hard upper bound on the
number of queries, so Google search is basically out of the game

While I business-wise understand both the Google policy (they do not want
sb else to make what they do) and M$ pricing (the service was underpriced
for years), you are definitely right that the current situation does not
look very stable.
Current BootCaT is basically a hostage of Bing, which might just pull the
plug at any point.
While there are some other services, I'm not aware of any providing the
same kind of API and quality and speed etc.

An alternative to online-BootCaT might be something like offline-BootCaT
where people build subcorpora from an existing large web corpus already
crawled (yes this is already possible at the moment to some extent).
But for that, we need to work on improving the crawling (which gets harder
and harder as the web gets more javascript singlepage-based) and cleaning.
Then still, one looses lots of benefits like getting up-to-date results
scored by pagerank-based techniques etc. (hm, has any one tried to
calculate some sort of a pagerank on a crawled web corpus documents?)

As for Yacy -- technologically I like that, but I'm a bit afraid at the
moment its index might be too small and therefore too biased -- this might
be a builtin issue: if a big corporation starts indexing their data with
Yacy, will it be able to skew the results?

Best
Milos

Milos Jakubicek

CEO, Lexical Computing
Brno, CZ | Brighton UK
http://www.lexicalcomputing.com
http://www.sketchengine.co.uk

On 31 July 2017 at 11:39, Silvia Bernardini <silvia.bernardini at unibo.it>
wrote:

> Dear all,
>
> apologies for this late and possibly slightly off-topic reply from the
> Forlì-Bologna group. We have been following the exchange so far, and
> believe it is in itself a very positive thing, obliging us to reflect on
> our interests (past present and future) and the time we can devote to them.
> So we are grateful to Roland in the first place for taking the initiative,
> and to everyone else for sharing their thoughts.
>
> Now for our take on the issues currently on the table. We like to think
> (or hope) that the BootCaT era is not completely over. As you will probably
> remember, what originally got many of us interested in the WaC approach was
> the idea (chimera?) of being one day able to build a linguist's search
> engine, a free alternative to Google for building corpora from the web
> and/or for conducting web-based research stirring clear form the pitfalls
> of Googleology (to use Adam's term). Crucially, from our perspective as
> translators/terminologists and translator trainers (as well as corpus
> linguists), what mattered the most were not so much the very large WaC
> corpora (however useful) but the small DIY corpora that single users could
> build for themselves. Hence BootCaT.
>
> BootCaT still has a large and keen user community, which is currently
> worrying (as we are) about the future of the tool, now that Bing has
> virtually stopped giving out free search APIs. We have been thinking for
> some time about what to do next: give up on BootCaT completely, or go back
> to where we began, the linguist's search engine.
>
> In the past couple of months we have been experimenting with this tool:
> http://yacy.net/en/index.html <http://yacy.net/en/index.html>, and while
> it is too early to say whether anything good will come out of it (or indeed
> of similar tools we don't know about), the whole idea of a shared search
> engine for corpus linguists seems fascinating (and not only, we hope, due
> to nostalgia for our age of innocence). It could bring back interest from
> the corpus linguistics community, as well as help us to reach out to
> academic communities of informed users (applied linguists, language
> professionals, discourse and media studies people), that we in Forlì also
> belong to.
>
> I am not saying of course that this strand would be enough to
> single-handedly revive interest in WaC, nor that all the members of the WaC
> community would warm to it. But it might be worth taking it on board,
> together with the other research topics mentioned in previous emails.
>
> As I said, possibly off-topic, but hopefully not completely irrelevant.
>
> silvia (and the Forlì group)
>
>
> > On 26 Jul 2017, at 13:34, Roland Schäfer <roland.schaefer at fu-berlin.de>
> wrote:
> >
> > Dear Chris,
> >
> > thanks a lot (to you and everyone) for contributing to the discussion!
> >
> > On 26.07.17 12:00, sigwac-request at sslmit.unibo.it wrote:
> >>
> >> Today's Topics:
> >>
> >>   1. Re: Call for discussion: The SIGWAC crisis (instead of an
> >>      announcement of WAC-XI) (chris brew)
> >>
> >> It makes complete sense for someone to study the problem of devising
> >> web-based corpora that are useful for scientific investigations that go
> >> beyond the purely technological. The Google Books collection and the
> >> various instances of BNC and ANC are excellent examples of what is
> needed.
> >> The existence of CQP and SketchEngine is a wonderful thing.
> >>
> >> What is readily available now, but was not in the early days of SIGWAC,
> is
> >> large volumes of lightly-curated text suitable for use in building word
> >> vectors or for training language models that have good perplexity. The
> ACL
> >> community loves and uses these (rightly, in my view) but they are not at
> >> all the same thing as carefully thought out and designed corpora like
> the
> >> ones I mentioned in the previous paragraph.
> >
> > Actually, this is what roughly 50% of COW users seem to be using them
> > for, even though we do not even advertise our corpora among CL people.
> > However, these users (as I said in my original post) just download the
> > data and disappear forever. (I am also pretty sure that most of them
> > start by discarding all the levels of linguistic annotation that we
> > spend most of our time working on. Maybe except sentence splitting and
> > POS tags.) They do not usually contribute to improving the quality of
> > the data because they never provide feedback and would most surely never
> > consider attending WAC events (except maybe if their paper got rejected
> > at the main conference). Funnily enough, they either never publish their
> > results, or they provide only very sparse feedback about their
> > publications. (We ask users to notify us of their publications based on
> > our data such that we can obtain funding and make sure our corpora
> > remain "curated resources".)
> >
> > (This is not true of those CL researchers/groups with whom/which we
> > collaborate more closely, for example Stefan Evert or Sabine Schulte im
> > Walde's group at IMS.)
> >
> > As a potential SIWGAC community, CL people who use web data are thus as
> > irrelevant as our purely linguistic users who do not care about corpus
> > design or the lexicographers who use SketchEngine but do not submit
> > anything to WAC (as in 2015). While I agree with contributors to this
> > discussion that focusing on linguistics might not be an optimal and even
> > a risky solution, it is just one of two (or many) sub-optimal/risky
> > solutions.
> >
> >> I'm not sure about continuing to co-locate with ACL. The proportion of
> >> regular attendees at ACL who have deep background in any form of
> >> linguistics continues to decline, and the proportion with an
> understanding
> >> of corpus linguistics has never been high. I suspect that
> >> the number of young attendees who have even heard of the BNC is very low
> >
> > And I'm almost 100% sure that none of our linguistic users would
> > consider traveling to ACL (not even those few who have heard about it).
> > Pragmatically speaking, how should they even get their papers accepted
> > given the technocratic turn that CL has taken? It's simply impossible
> > for 99.9% (corpus) linguists to keep up with the developments in CL
> > because it would take up too much time, and because the results would
> > not matter enough for their normal research.
> >
> >> indeed. So the number of ACL people who would be drawn to WAC is
> probably
> >> fairly small. Added to which, the conference has fee schedules that are
> not
> >> really compatible with attendance by researchers who do not have the
> luxury
> >> (or, to an extent, burden) of large-money engineering-style grants.
> >
> > I agree. But the situation is even worse: Co-locating with ACL is out of
> > the question if the ACL continues to have the member survey as part of
> > the workshop selection process. It is highly unlikely due to chance that
> > for 2017, basically nobody expressed their interest in attending WAC.
> >
> > The fact that we had a large number of participants and talks in Berlin
> > (https://www.sigwac.org.uk/wiki/WAC-X) was, I think, due to the
> > following reasons:
> >
> > – most importantly, we had the EmpiriST shared task which brought in its
> > own (German) community; I am sure that a shared task on tokenisation and
> > POS tagging of German would not have attracted most regular ACL members
> >
> > – because it took place in Berlin and we advertised it to (corpus)
> > linguists, some (German) linguists (such as Krause or Würschinger et
> > al.), showed up (who unanimously complained about the absurd fees, by
> > the way); these contributors might NOT have traveled to LREC in Turkey
> > or ACL in the US, etc.
> >
> > – some of the regular WAC contributors actually contributed (Adrien,
> > Serge, Felix and I), possibly because they were going to ACL anyway (I
> > don't know, of course, whether the co-location played a crucial role for
> > them) or because they were among the organisers
> >
> >> The overlap between SIGWAC and the ACL community was stronger when there
> >> were many carefully curated annotated corpora being built by NLP teams.
> >
> > And SIGWAC was strong when a lot of linguists did the opposite and just
> > experimented with web data (BootCaT era).
> >
> >> This is not happening so much now. To the extent that I understand what
> the
> >> people who did this are now doing, it seems to me that crowdsourcing has
> >> risen, which usually implies shallower annotation. At the same time,
> some
> >> of those people are doing more with transformations and re-use of
> existing
> >> annotated corpora, as well as pushing towards methods that learn
> everything
> >> from raw text. This doesn't mesh well with SIGWAC's mission. The
> synergy is
> >> less than it was.
> >
> > I couldn't agree more. As I said, the average CL researcher is a user of
> > web corpora at best.
> >
> >> So I think the primary task is to identify a large enough community and
> >> co-locate with conferences that are compatible with that community.
> >
> > Agreed. The planned CleanerEval ST on text quality evaluation at
> > different levels (paragraph and text, maybe even sentence) would be a
> > way to search for a community. However, CL researchers might simply see
> > it as a way to test some machinery that is en vogue in CL, which is
> > AFAIU the purpose of shared tasks. Whether they would attend subsequent
> > WAC workshops would remain to be seen.
> >
> > However, the problem is IMO not the machinery, but the definition and
> > operationalisation of notions such as "text quality". This is something
> > I think should be discussed by linguists and computational linguists.
> > While I think that operationalisations like "text is good if it can be
> > used successfully to [solve some CL task that is en vogue]" are highly
> > useful and should by all means be used to evaluate resources, linguists
> > might be interested in additional levels of evaluation. (For example, a
> > few colleagues and I are currently writing a series of papers where we
> > correlate COW-derived models of grammatical alternation phenomena in
> > German with experimental findings under a cognitive linguistic
> > perspective. This was the kind of research I had in mind for the
> > linguistic part of WAC-XI.) Ideally, the two approaches should converge,
> > of course. If they don't, finding out why would be highly stimulating.
> >
> > At the same time I have to admit that (corpus) linguists expressed their
> > lack of interest by not submitting anything to WAC-XI, where CleanerEval
> > was supposed to be discussed.
> >
> > Well, I don't really have a perfect solution to offer, which is why I
> > want to thank everybody who has contributed to the discussion so far and
> > encourage anybody to continue the discussion. My strategy of choice
> > would be to try to talk to a smaller community of corpus linguists whom
> > we (= anyone who is interested) know personally in order to see whether
> > a substantial interest in CleanerEval could be raised. After all, a
> > similar approach worked for EmpiriST. Whether such a shared task, even
> > if successful, would lead to a sustained interest in WAC is impossible
> > to tell, of course.
> >
> > Best,
> > Roland
> > _______________________________________________
> > Sigwac mailing list
> > Sigwac at sslmit.unibo.it
> > http://liste.sslmit.unibo.it/mailman/listinfo/sigwac
>
> _______________________________________________
> Sigwac mailing list
> Sigwac at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/sigwac
>