[Sigwac] Call for discussion: The SIGWAC crisis (instead, of an announcement of WAC-XI)

Serge Sharoff S.Sharoff at leeds.ac.uk
Tue Aug 1 00:53:06 CEST 2017


Dear Roland et al,

It is very natural that we as the Wacky people have something in common with the
linguistic community, but it's equally natural that we're aligned with the CL
community. I don't like the idea of submitting to "bullying" by the CL people
(following Adam's metaphor quoted in my last message). However, turning away
from them is equally unattractive. For example, the methods to be used the new
round of CleanerEval are likely to come from the CL community. In each of the
areas of interest for us (corpus collection, cleaning, processing) we benefit
from the results published at ACL or LREC. 

The opposite direction of influence is also important.  Chris mentions that a
very small minority of the CL people is likely to know about the BNC. That's
probably true. I agree that the ACL-related community overall pays little
attention to their data. However, this doesn't mean they shouldn't. I've just
looked into the Common Crawl corpus which is commonly used in Machine
Translation as a source of monolingual data:  
http://www.statmt.org/wmt17/translation-task.html#download

It has no document boundaries, no provenance indication, while it does contain
lots of spam. I can't resist from quoting the very first lines from that corpus:

> by Lefty on Sep.29, 2010, under Free Porn Movies
> Paul Bunyan
> Comment added on 13:52 June 03, 2010 by Muriel
> Nothing villages also signaled into the fine next cell, power point viewer.
> Girls drinking left that students equality family like this should say to
> sweden, where the women are family and common!
> We present sexy twinks XXX movies!
> September 2009   (55)
> October 2008   (15)
> Default
> STL
> The Ultimate Joomla Collection
> What is the Torah?
> Though it contains laws and commands, the Torah is better understood as G-d's
> teaching and instructions on life rather than some divine municipal
> governance. The Torah teaches us how to do what is right and by doing so, find
> blessing.
> Silva Timber
> Western Red Cedar Rainscreen

It is reasonable that using this resource is neverthless better than using no
monolingual data at all (that's why it's popular). However, this is precisely
the line of communication with the CL community as we do address this problem in
the SIGWAC community and we can demonstrate that cleaner corpora can improve
downstream tasks.

Therefore I tend to both agree and disagree with:
> > from raw text. This doesn't mesh well with SIGWAC's mission. The synergy is
> > less than it was.
> 
> I couldn't agree more. As I said, the average CL researcher is a user of
> web corpora at best.Yes, they are the users, in this respect very much similar to people like
linguists and lexicographers.  At the same time, they are contributors of new
methods, e.g., sequence tagging (both for POS and for corpus cleaning).  

Still this leaves the question of co-location of the next WAC events open. I
don't have an answer here. Yes, I don't think many people like overpriced
ACL/LREC events. However, many (in the CL community) commit themselves to going
there. As mentioned, the previous events co-located with those conferences never
failed because of the lack of submissions. 

Cheers,
Serge



On Mon, 2017-07-31 at 11:39 +0200, Silvia Bernardini wrote:
> Dear all,
> 
> apologies for this late and possibly slightly off-topic reply from the Forlì-
> Bologna group. We have been following the exchange so far, and believe it is
> in itself a very positive thing, obliging us to reflect on our interests (past
> present and future) and the time we can devote to them. So we are grateful to
> Roland in the first place for taking the initiative, and to everyone else for
> sharing their thoughts.
> 
> Now for our take on the issues currently on the table. We like to think (or
> hope) that the BootCaT era is not completely over. As you will probably
> remember, what originally got many of us interested in the WaC approach was
> the idea (chimera?) of being one day able to build a linguist's search engine,
> a free alternative to Google for building corpora from the web and/or for
> conducting web-based research stirring clear form the pitfalls of Googleology
> (to use Adam's term). Crucially, from our perspective as
> translators/terminologists and translator trainers (as well as corpus
> linguists), what mattered the most were not so much the very large WaC corpora
> (however useful) but the small DIY corpora that single users could build for
> themselves. Hence BootCaT. 
> 
> BootCaT still has a large and keen user community, which is currently worrying
> (as we are) about the future of the tool, now that Bing has virtually stopped
> giving out free search APIs. We have been thinking for some time about what to
> do next: give up on BootCaT completely, or go back to where we began, the
> linguist's search engine.
> 
> In the past couple of months we have been experimenting with this tool: http:/
> /yacy.net/en/index.html <http://yacy.net/en/index.html>, and while it is too
> early to say whether anything good will come out of it (or indeed of similar
> tools we don't know about), the whole idea of a shared search engine for
> corpus linguists seems fascinating (and not only, we hope, due to nostalgia
> for our age of innocence). It could bring back interest from the corpus
> linguistics community, as well as help us to reach out to academic communities
> of informed users (applied linguists, language professionals, discourse and
> media studies people), that we in Forlì also belong to. 
> 
> I am not saying of course that this strand would be enough to single-handedly
> revive interest in WaC, nor that all the members of the WaC community would
> warm to it. But it might be worth taking it on board, together with the other
> research topics mentioned in previous emails.
> 
> As I said, possibly off-topic, but hopefully not completely irrelevant.
> 
> silvia (and the Forlì group)
> 
> 
> > 
> > On 26 Jul 2017, at 13:34, Roland Schäfer <roland.schaefer at fu-berlin.de>
> > wrote:
> > 
> > Dear Chris,
> > 
> > thanks a lot (to you and everyone) for contributing to the discussion!
> > 
> > On 26.07.17 12:00, sigwac-request at sslmit.unibo.it wrote:
> > > 
> > > 
> > > Today's Topics:
> > > 
> > >   1. Re: Call for discussion: The SIGWAC crisis (instead of an
> > >      announcement of WAC-XI) (chris brew)
> > > 
> > > It makes complete sense for someone to study the problem of devising
> > > web-based corpora that are useful for scientific investigations that go
> > > beyond the purely technological. The Google Books collection and the
> > > various instances of BNC and ANC are excellent examples of what is needed.
> > > The existence of CQP and SketchEngine is a wonderful thing.
> > > 
> > > What is readily available now, but was not in the early days of SIGWAC, is
> > > large volumes of lightly-curated text suitable for use in building word
> > > vectors or for training language models that have good perplexity. The ACL
> > > community loves and uses these (rightly, in my view) but they are not at
> > > all the same thing as carefully thought out and designed corpora like the
> > > ones I mentioned in the previous paragraph.
> > Actually, this is what roughly 50% of COW users seem to be using them
> > for, even though we do not even advertise our corpora among CL people.
> > However, these users (as I said in my original post) just download the
> > data and disappear forever. (I am also pretty sure that most of them
> > start by discarding all the levels of linguistic annotation that we
> > spend most of our time working on. Maybe except sentence splitting and
> > POS tags.) They do not usually contribute to improving the quality of
> > the data because they never provide feedback and would most surely never
> > consider attending WAC events (except maybe if their paper got rejected
> > at the main conference). Funnily enough, they either never publish their
> > results, or they provide only very sparse feedback about their
> > publications. (We ask users to notify us of their publications based on
> > our data such that we can obtain funding and make sure our corpora
> > remain "curated resources".)
> > 
> > (This is not true of those CL researchers/groups with whom/which we
> > collaborate more closely, for example Stefan Evert or Sabine Schulte im
> > Walde's group at IMS.)
> > 
> > As a potential SIWGAC community, CL people who use web data are thus as
> > irrelevant as our purely linguistic users who do not care about corpus
> > design or the lexicographers who use SketchEngine but do not submit
> > anything to WAC (as in 2015). While I agree with contributors to this
> > discussion that focusing on linguistics might not be an optimal and even
> > a risky solution, it is just one of two (or many) sub-optimal/risky
> > solutions.
> > 
> > > 
> > > I'm not sure about continuing to co-locate with ACL. The proportion of
> > > regular attendees at ACL who have deep background in any form of
> > > linguistics continues to decline, and the proportion with an understanding
> > > of corpus linguistics has never been high. I suspect that
> > > the number of young attendees who have even heard of the BNC is very low
> > And I'm almost 100% sure that none of our linguistic users would
> > consider traveling to ACL (not even those few who have heard about it).
> > Pragmatically speaking, how should they even get their papers accepted
> > given the technocratic turn that CL has taken? It's simply impossible
> > for 99.9% (corpus) linguists to keep up with the developments in CL
> > because it would take up too much time, and because the results would
> > not matter enough for their normal research.
> > 
> > > 
> > > indeed. So the number of ACL people who would be drawn to WAC is probably
> > > fairly small. Added to which, the conference has fee schedules that are
> > > not
> > > really compatible with attendance by researchers who do not have the
> > > luxury
> > > (or, to an extent, burden) of large-money engineering-style grants.
> > I agree. But the situation is even worse: Co-locating with ACL is out of
> > the question if the ACL continues to have the member survey as part of
> > the workshop selection process. It is highly unlikely due to chance that
> > for 2017, basically nobody expressed their interest in attending WAC.
> > 
> > The fact that we had a large number of participants and talks in Berlin
> > (https://www.sigwac.org.uk/wiki/WAC-X) was, I think, due to the
> > following reasons:
> > 
> > – most importantly, we had the EmpiriST shared task which brought in its
> > own (German) community; I am sure that a shared task on tokenisation and
> > POS tagging of German would not have attracted most regular ACL members
> > 
> > – because it took place in Berlin and we advertised it to (corpus)
> > linguists, some (German) linguists (such as Krause or Würschinger et
> > al.), showed up (who unanimously complained about the absurd fees, by
> > the way); these contributors might NOT have traveled to LREC in Turkey
> > or ACL in the US, etc.
> > 
> > – some of the regular WAC contributors actually contributed (Adrien,
> > Serge, Felix and I), possibly because they were going to ACL anyway (I
> > don't know, of course, whether the co-location played a crucial role for
> > them) or because they were among the organisers
> > 
> > > 
> > > The overlap between SIGWAC and the ACL community was stronger when there
> > > were many carefully curated annotated corpora being built by NLP teams.
> > And SIGWAC was strong when a lot of linguists did the opposite and just
> > experimented with web data (BootCaT era).
> > 
> > > 
> > > This is not happening so much now. To the extent that I understand what
> > > the
> > > people who did this are now doing, it seems to me that crowdsourcing has
> > > risen, which usually implies shallower annotation. At the same time, some
> > > of those people are doing more with transformations and re-use of existing
> > > annotated corpora, as well as pushing towards methods that learn
> > > everything
> > > from raw text. This doesn't mesh well with SIGWAC's mission. The synergy
> > > is
> > > less than it was.
> > I couldn't agree more. As I said, the average CL researcher is a user of
> > web corpora at best.
> > 
> > > 
> > > So I think the primary task is to identify a large enough community and
> > > co-locate with conferences that are compatible with that community.
> > Agreed. The planned CleanerEval ST on text quality evaluation at
> > different levels (paragraph and text, maybe even sentence) would be a
> > way to search for a community. However, CL researchers might simply see
> > it as a way to test some machinery that is en vogue in CL, which is
> > AFAIU the purpose of shared tasks. Whether they would attend subsequent
> > WAC workshops would remain to be seen.
> > 
> > However, the problem is IMO not the machinery, but the definition and
> > operationalisation of notions such as "text quality". This is something
> > I think should be discussed by linguists and computational linguists.
> > While I think that operationalisations like "text is good if it can be
> > used successfully to [solve some CL task that is en vogue]" are highly
> > useful and should by all means be used to evaluate resources, linguists
> > might be interested in additional levels of evaluation. (For example, a
> > few colleagues and I are currently writing a series of papers where we
> > correlate COW-derived models of grammatical alternation phenomena in
> > German with experimental findings under a cognitive linguistic
> > perspective. This was the kind of research I had in mind for the
> > linguistic part of WAC-XI.) Ideally, the two approaches should converge,
> > of course. If they don't, finding out why would be highly stimulating.
> > 
> > At the same time I have to admit that (corpus) linguists expressed their
> > lack of interest by not submitting anything to WAC-XI, where CleanerEval
> > was supposed to be discussed.
> > 
> > Well, I don't really have a perfect solution to offer, which is why I
> > want to thank everybody who has contributed to the discussion so far and
> > encourage anybody to continue the discussion. My strategy of choice
> > would be to try to talk to a smaller community of corpus linguists whom
> > we (= anyone who is interested) know personally in order to see whether
> > a substantial interest in CleanerEval could be raised. After all, a
> > similar approach worked for EmpiriST. Whether such a shared task, even
> > if successful, would lead to a sustained interest in WAC is impossible
> > to tell, of course.
> > 
> > Best,
> > Roland
> > _______________________________________________
> > Sigwac mailing list
> > Sigwac at sslmit.unibo.it
> > http://liste.sslmit.unibo.it/mailman/listinfo/sigwac
> _______________________________________________
> Sigwac mailing list
> Sigwac at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/sigwac


More information about the Sigwac mailing list