[Sigwac] Call for discussion: The SIGWAC crisis (instead, of an announcement of WAC-XI)

Mon Jul 31 11:39:06 CEST 2017

Dear all,

apologies for this late and possibly slightly off-topic reply from the Forlì-Bologna group. We have been following the exchange so far, and believe it is in itself a very positive thing, obliging us to reflect on our interests (past present and future) and the time we can devote to them. So we are grateful to Roland in the first place for taking the initiative, and to everyone else for sharing their thoughts.

Now for our take on the issues currently on the table. We like to think (or hope) that the BootCaT era is not completely over. As you will probably remember, what originally got many of us interested in the WaC approach was the idea (chimera?) of being one day able to build a linguist's search engine, a free alternative to Google for building corpora from the web and/or for conducting web-based research stirring clear form the pitfalls of Googleology (to use Adam's term). Crucially, from our perspective as translators/terminologists and translator trainers (as well as corpus linguists), what mattered the most were not so much the very large WaC corpora (however useful) but the small DIY corpora that single users could build for themselves. Hence BootCaT. 

BootCaT still has a large and keen user community, which is currently worrying (as we are) about the future of the tool, now that Bing has virtually stopped giving out free search APIs. We have been thinking for some time about what to do next: give up on BootCaT completely, or go back to where we began, the linguist's search engine.

In the past couple of months we have been experimenting with this tool: http://yacy.net/en/index.html <http://yacy.net/en/index.html>, and while it is too early to say whether anything good will come out of it (or indeed of similar tools we don't know about), the whole idea of a shared search engine for corpus linguists seems fascinating (and not only, we hope, due to nostalgia for our age of innocence). It could bring back interest from the corpus linguistics community, as well as help us to reach out to academic communities of informed users (applied linguists, language professionals, discourse and media studies people), that we in Forlì also belong to. 

I am not saying of course that this strand would be enough to single-handedly revive interest in WaC, nor that all the members of the WaC community would warm to it. But it might be worth taking it on board, together with the other research topics mentioned in previous emails.

As I said, possibly off-topic, but hopefully not completely irrelevant.

silvia (and the Forlì group)

> On 26 Jul 2017, at 13:34, Roland Schäfer <roland.schaefer at fu-berlin.de> wrote:
> 
> Dear Chris,
> 
> thanks a lot (to you and everyone) for contributing to the discussion!
> 
> On 26.07.17 12:00, sigwac-request at sslmit.unibo.it wrote:
>> 
>> Today's Topics:
>> 
>>   1. Re: Call for discussion: The SIGWAC crisis (instead of an
>>      announcement of WAC-XI) (chris brew)
>> 
>> It makes complete sense for someone to study the problem of devising
>> web-based corpora that are useful for scientific investigations that go
>> beyond the purely technological. The Google Books collection and the
>> various instances of BNC and ANC are excellent examples of what is needed.
>> The existence of CQP and SketchEngine is a wonderful thing.
>> 
>> What is readily available now, but was not in the early days of SIGWAC, is
>> large volumes of lightly-curated text suitable for use in building word
>> vectors or for training language models that have good perplexity. The ACL
>> community loves and uses these (rightly, in my view) but they are not at
>> all the same thing as carefully thought out and designed corpora like the
>> ones I mentioned in the previous paragraph.
> 
> Actually, this is what roughly 50% of COW users seem to be using them
> for, even though we do not even advertise our corpora among CL people.
> However, these users (as I said in my original post) just download the
> data and disappear forever. (I am also pretty sure that most of them
> start by discarding all the levels of linguistic annotation that we
> spend most of our time working on. Maybe except sentence splitting and
> POS tags.) They do not usually contribute to improving the quality of
> the data because they never provide feedback and would most surely never
> consider attending WAC events (except maybe if their paper got rejected
> at the main conference). Funnily enough, they either never publish their
> results, or they provide only very sparse feedback about their
> publications. (We ask users to notify us of their publications based on
> our data such that we can obtain funding and make sure our corpora
> remain "curated resources".)
> 
> (This is not true of those CL researchers/groups with whom/which we
> collaborate more closely, for example Stefan Evert or Sabine Schulte im
> Walde's group at IMS.)
> 
> As a potential SIWGAC community, CL people who use web data are thus as
> irrelevant as our purely linguistic users who do not care about corpus
> design or the lexicographers who use SketchEngine but do not submit
> anything to WAC (as in 2015). While I agree with contributors to this
> discussion that focusing on linguistics might not be an optimal and even
> a risky solution, it is just one of two (or many) sub-optimal/risky
> solutions.
> 
>> I'm not sure about continuing to co-locate with ACL. The proportion of
>> regular attendees at ACL who have deep background in any form of
>> linguistics continues to decline, and the proportion with an understanding
>> of corpus linguistics has never been high. I suspect that
>> the number of young attendees who have even heard of the BNC is very low
> 
> And I'm almost 100% sure that none of our linguistic users would
> consider traveling to ACL (not even those few who have heard about it).
> Pragmatically speaking, how should they even get their papers accepted
> given the technocratic turn that CL has taken? It's simply impossible
> for 99.9% (corpus) linguists to keep up with the developments in CL
> because it would take up too much time, and because the results would
> not matter enough for their normal research.
> 
>> indeed. So the number of ACL people who would be drawn to WAC is probably
>> fairly small. Added to which, the conference has fee schedules that are not
>> really compatible with attendance by researchers who do not have the luxury
>> (or, to an extent, burden) of large-money engineering-style grants.
> 
> I agree. But the situation is even worse: Co-locating with ACL is out of
> the question if the ACL continues to have the member survey as part of
> the workshop selection process. It is highly unlikely due to chance that
> for 2017, basically nobody expressed their interest in attending WAC.
> 
> The fact that we had a large number of participants and talks in Berlin
> (https://www.sigwac.org.uk/wiki/WAC-X) was, I think, due to the
> following reasons:
> 
> – most importantly, we had the EmpiriST shared task which brought in its
> own (German) community; I am sure that a shared task on tokenisation and
> POS tagging of German would not have attracted most regular ACL members
> 
> – because it took place in Berlin and we advertised it to (corpus)
> linguists, some (German) linguists (such as Krause or Würschinger et
> al.), showed up (who unanimously complained about the absurd fees, by
> the way); these contributors might NOT have traveled to LREC in Turkey
> or ACL in the US, etc.
> 
> – some of the regular WAC contributors actually contributed (Adrien,
> Serge, Felix and I), possibly because they were going to ACL anyway (I
> don't know, of course, whether the co-location played a crucial role for
> them) or because they were among the organisers
> 
>> The overlap between SIGWAC and the ACL community was stronger when there
>> were many carefully curated annotated corpora being built by NLP teams.
> 
> And SIGWAC was strong when a lot of linguists did the opposite and just
> experimented with web data (BootCaT era).
> 
>> This is not happening so much now. To the extent that I understand what the
>> people who did this are now doing, it seems to me that crowdsourcing has
>> risen, which usually implies shallower annotation. At the same time, some
>> of those people are doing more with transformations and re-use of existing
>> annotated corpora, as well as pushing towards methods that learn everything
>> from raw text. This doesn't mesh well with SIGWAC's mission. The synergy is
>> less than it was.
> 
> I couldn't agree more. As I said, the average CL researcher is a user of
> web corpora at best.
> 
>> So I think the primary task is to identify a large enough community and
>> co-locate with conferences that are compatible with that community.
> 
> Agreed. The planned CleanerEval ST on text quality evaluation at
> different levels (paragraph and text, maybe even sentence) would be a
> way to search for a community. However, CL researchers might simply see
> it as a way to test some machinery that is en vogue in CL, which is
> AFAIU the purpose of shared tasks. Whether they would attend subsequent
> WAC workshops would remain to be seen.
> 
> However, the problem is IMO not the machinery, but the definition and
> operationalisation of notions such as "text quality". This is something
> I think should be discussed by linguists and computational linguists.
> While I think that operationalisations like "text is good if it can be
> used successfully to [solve some CL task that is en vogue]" are highly
> useful and should by all means be used to evaluate resources, linguists
> might be interested in additional levels of evaluation. (For example, a
> few colleagues and I are currently writing a series of papers where we
> correlate COW-derived models of grammatical alternation phenomena in
> German with experimental findings under a cognitive linguistic
> perspective. This was the kind of research I had in mind for the
> linguistic part of WAC-XI.) Ideally, the two approaches should converge,
> of course. If they don't, finding out why would be highly stimulating.
> 
> At the same time I have to admit that (corpus) linguists expressed their
> lack of interest by not submitting anything to WAC-XI, where CleanerEval
> was supposed to be discussed.
> 
> Well, I don't really have a perfect solution to offer, which is why I
> want to thank everybody who has contributed to the discussion so far and
> encourage anybody to continue the discussion. My strategy of choice
> would be to try to talk to a smaller community of corpus linguists whom
> we (= anyone who is interested) know personally in order to see whether
> a substantial interest in CleanerEval could be raised. After all, a
> similar approach worked for EmpiriST. Whether such a shared task, even
> if successful, would lead to a sustained interest in WAC is impossible
> to tell, of course.
> 
> Best,
> Roland
> _______________________________________________
> Sigwac mailing list
> Sigwac at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/sigwac