[Sigwac] Call for discussion: The SIGWAC crisis (instead, of an announcement of WAC-XI)

Wed Jul 26 13:34:19 CEST 2017

Dear Chris,

thanks a lot (to you and everyone) for contributing to the discussion!

On 26.07.17 12:00, sigwac-request at sslmit.unibo.it wrote:
>
> Today's Topics:
> 
>    1. Re: Call for discussion: The SIGWAC crisis (instead of an
>       announcement of WAC-XI) (chris brew)
> 
> It makes complete sense for someone to study the problem of devising
> web-based corpora that are useful for scientific investigations that go
> beyond the purely technological. The Google Books collection and the
> various instances of BNC and ANC are excellent examples of what is needed.
> The existence of CQP and SketchEngine is a wonderful thing.
> 
> What is readily available now, but was not in the early days of SIGWAC, is
> large volumes of lightly-curated text suitable for use in building word
> vectors or for training language models that have good perplexity. The ACL
> community loves and uses these (rightly, in my view) but they are not at
> all the same thing as carefully thought out and designed corpora like the
> ones I mentioned in the previous paragraph.

Actually, this is what roughly 50% of COW users seem to be using them
for, even though we do not even advertise our corpora among CL people.
However, these users (as I said in my original post) just download the
data and disappear forever. (I am also pretty sure that most of them
start by discarding all the levels of linguistic annotation that we
spend most of our time working on. Maybe except sentence splitting and
POS tags.) They do not usually contribute to improving the quality of
the data because they never provide feedback and would most surely never
consider attending WAC events (except maybe if their paper got rejected
at the main conference). Funnily enough, they either never publish their
results, or they provide only very sparse feedback about their
publications. (We ask users to notify us of their publications based on
our data such that we can obtain funding and make sure our corpora
remain "curated resources".)

(This is not true of those CL researchers/groups with whom/which we
collaborate more closely, for example Stefan Evert or Sabine Schulte im
Walde's group at IMS.)

As a potential SIWGAC community, CL people who use web data are thus as
irrelevant as our purely linguistic users who do not care about corpus
design or the lexicographers who use SketchEngine but do not submit
anything to WAC (as in 2015). While I agree with contributors to this
discussion that focusing on linguistics might not be an optimal and even
a risky solution, it is just one of two (or many) sub-optimal/risky
solutions.

> I'm not sure about continuing to co-locate with ACL. The proportion of
> regular attendees at ACL who have deep background in any form of
> linguistics continues to decline, and the proportion with an understanding
> of corpus linguistics has never been high. I suspect that
> the number of young attendees who have even heard of the BNC is very low

And I'm almost 100% sure that none of our linguistic users would
consider traveling to ACL (not even those few who have heard about it).
Pragmatically speaking, how should they even get their papers accepted
given the technocratic turn that CL has taken? It's simply impossible
for 99.9% (corpus) linguists to keep up with the developments in CL
because it would take up too much time, and because the results would
not matter enough for their normal research.

> indeed. So the number of ACL people who would be drawn to WAC is probably
> fairly small. Added to which, the conference has fee schedules that are not
> really compatible with attendance by researchers who do not have the luxury
> (or, to an extent, burden) of large-money engineering-style grants.

I agree. But the situation is even worse: Co-locating with ACL is out of
the question if the ACL continues to have the member survey as part of
the workshop selection process. It is highly unlikely due to chance that
for 2017, basically nobody expressed their interest in attending WAC.

The fact that we had a large number of participants and talks in Berlin
(https://www.sigwac.org.uk/wiki/WAC-X) was, I think, due to the
following reasons:

– most importantly, we had the EmpiriST shared task which brought in its
own (German) community; I am sure that a shared task on tokenisation and
POS tagging of German would not have attracted most regular ACL members

– because it took place in Berlin and we advertised it to (corpus)
linguists, some (German) linguists (such as Krause or Würschinger et
al.), showed up (who unanimously complained about the absurd fees, by
the way); these contributors might NOT have traveled to LREC in Turkey
or ACL in the US, etc.

– some of the regular WAC contributors actually contributed (Adrien,
Serge, Felix and I), possibly because they were going to ACL anyway (I
don't know, of course, whether the co-location played a crucial role for
them) or because they were among the organisers

> The overlap between SIGWAC and the ACL community was stronger when there
> were many carefully curated annotated corpora being built by NLP teams.

And SIGWAC was strong when a lot of linguists did the opposite and just
experimented with web data (BootCaT era).

> This is not happening so much now. To the extent that I understand what the
> people who did this are now doing, it seems to me that crowdsourcing has
> risen, which usually implies shallower annotation. At the same time, some
> of those people are doing more with transformations and re-use of existing
> annotated corpora, as well as pushing towards methods that learn everything
> from raw text. This doesn't mesh well with SIGWAC's mission. The synergy is
> less than it was.

I couldn't agree more. As I said, the average CL researcher is a user of
web corpora at best.

> So I think the primary task is to identify a large enough community and
> co-locate with conferences that are compatible with that community.

Agreed. The planned CleanerEval ST on text quality evaluation at
different levels (paragraph and text, maybe even sentence) would be a
way to search for a community. However, CL researchers might simply see
it as a way to test some machinery that is en vogue in CL, which is
AFAIU the purpose of shared tasks. Whether they would attend subsequent
WAC workshops would remain to be seen.

However, the problem is IMO not the machinery, but the definition and
operationalisation of notions such as "text quality". This is something
I think should be discussed by linguists and computational linguists.
While I think that operationalisations like "text is good if it can be
used successfully to [solve some CL task that is en vogue]" are highly
useful and should by all means be used to evaluate resources, linguists
might be interested in additional levels of evaluation. (For example, a
few colleagues and I are currently writing a series of papers where we
correlate COW-derived models of grammatical alternation phenomena in
German with experimental findings under a cognitive linguistic
perspective. This was the kind of research I had in mind for the
linguistic part of WAC-XI.) Ideally, the two approaches should converge,
of course. If they don't, finding out why would be highly stimulating.

At the same time I have to admit that (corpus) linguists expressed their
lack of interest by not submitting anything to WAC-XI, where CleanerEval
was supposed to be discussed.

Well, I don't really have a perfect solution to offer, which is why I
want to thank everybody who has contributed to the discussion so far and
encourage anybody to continue the discussion. My strategy of choice
would be to try to talk to a smaller community of corpus linguists whom
we (= anyone who is interested) know personally in order to see whether
a substantial interest in CleanerEval could be raised. After all, a
similar approach worked for EmpiriST. Whether such a shared task, even
if successful, would lead to a sustained interest in WAC is impossible
to tell, of course.

Best,
Roland