[Sigwac] Call for discussion: The SIGWAC crisis (instead of an announcement of WAC-XI)

Serge Sharoff S.Sharoff at leeds.ac.uk
Thu Jul 6 19:56:07 CEST 2017


Dear Roland,


your elaborate request deserves an elaborate response, while time is at short supply (as usual).


Just two topics to mention. One is a reference to genre on the web research. I'm still interested in it. There was a presentation at the last WAC workshop. I have a position paper forthcoming:

http://corpus.leeds.ac.uk/serge/publications/2018-ftd.pdf


However, I agree that with the exception of few other pockets of linguistic resistance:
https://scholar.google.co.uk/scholar?as_ylo=2011&q=corpus+text+classification+%22genre%22

the question of genres does not get the amount of attention it deserves in the computational community. Quite often in ACL papers the performance is measured via the Wall Street Journal corpus or a similarly limited resource. I feel it's my duty to educate. I point this out in every paper review when it's relevant. I'd like to see more people joining me in this genre crusade.


This brings me to the second topic. First, I agree with you that we need to maintain the balance between NLP and traditional linguistics. There was a nice reference to this in Adam Kilgariff's posting to the Corpora list 10 years ago:

"computational linguistics, which has come into the field like a schoolyard bully, forcing everything that's not computational into submission, collusion or the margins."

http://mailman.uib.no/public/corpora/2007-January/003842.html


This suggests our duty to protect the non-computational aspect, which is being bullied at the moment. With the rise of neural approaches the need to keep the less-computational aspect visible became even more relevant.


However, in terms of our SIGWAC activities it looks like more computational events tend to bring more submissions and more people coming. When the WAC workshops co-located with ACL, LREC or WWW, we had no shortage of submissions. Cancellations and small workshops tend to happen when they were co-located with less-computational events.  This might be just because the computational field is bigger and better funded, so computational people were looking for an opportunity to publish their research at a less competitive venue (in my life of the SIGWAC chair I the acceptance rate was about 50-75%).


However, the greater popularity of computational events speaks against the claim that "it would be hard to make WAC attractive for computational linguists again".  I firmly believe in the first part of the same sentence: "mixed computational/corpus linguistic focus of SIGWAC was its strength". This is precisely what makes attending the WAC workshops attractive for me in the first instance.


In spite of this difference in opinions, thanks for raising this issue.  I fully agree that we need to discuss ways for making SIGWAC more popular.


Best wishes,

Serge


________________________________
From: sigwac-bounces at sslmit.unibo.it <sigwac-bounces at sslmit.unibo.it> on behalf of Roland Schäfer <roland.schaefer at fu-berlin.de>
Sent: 04 July 2017 10:43:56
To: list sigwac
Subject: [Sigwac] Call for discussion: The SIGWAC crisis (instead of an announcement of WAC-XI)

Dear SIGWAC members,

as the current chairman of SIGWAC, I am writing this email to stimulate
a discussion about the future of SIGWAC and the WAC workshops. Please
read this as a personal and subjective assessment and interpretation by
me. I welcome your comments.

We received only 5 submissions for this year's WAC-XI workshop in
Birmingham (where, by the way, the first WAC workshop took place 12
years ago). There will be a WAC-XI guest session as part of CMLC +
BigNLP instead of a full WAC-XI. Huge thanks to the organisers of CMLC +
BigNLP (especially Piotr Bański) as well as Stefan Evert for their help
and patience! However, such a low number of submissions is a clear sign
that SIGWAC is in trouble. WAC-10 at eLex had a similarly low number of
submissions and was cancelled, which is why we had WAC-X (instead of
WAC-10) a year later. I believe that a SIG can only survive if there is
an active community behind it. Please let me point out some problems and
possible solutions.


---  TL;DR  -----------------------------------------------------------

I suggest that the SIGWAC community has been eroding because (i) the
early BootCaT/WaCky era is definitely over and high-quality web corpora
are provided at a very professional level of service by research groups
and companies, not all of which contribute to SIGWAC, (ii) WAC has
become unattractive for computational linguists, and (iii) corpus
construction is generally unattractive for linguists. I argue that the
mixed computational/corpus linguistic focus of SIGWAC was its strength,
but that it would be hard to make WAC attractive for computational
linguists again. Also, doing so would likely mean cutting the cord with
the linguistics community although the interesting open questions for
WAC are, in my view, (corpus) linguistic questions.

In order to rescue SIGWAC, I propose to give SIGWAC a strong linguistic
turn towards the use and linguistic analysis of web corpora. However, I
also enumerate other (more computationally oriented) options.

In any case, if we want to keep SIGWAC alive, we need a discussion about
its future NOW, and we need to advertise it more strongly to our colleagues.

-----------------------------------------------------------------------


I. Is there a SIGWAC community?

1. Many (not all) members from the early days no longer participate in
WAC workshops. For example, the WaCky people have mostly moved on to
other areas, and the BootCaT era is over. Most sadly, of course, Adam's
death has robbed the community of its most important and charismatic
leading figure.

2. High-quality web corpora are provided by several research groups and
companies, and everybody can access them for free or by paying
relatively low fees. The involvement of individual corpus linguists in
the web corpus creation process, which was characteristic of the early
WAC day, is thus no longer necessary (but see 5 below). However, some of
the major web corpus projects do not contribute (steadily) to SIGWAC. As
a 'major web corpus project', I define any project which lasts longer
than 5 years and which develops its own technologies at least partially
(crawlers, annotation tools, interfaces, etc.). For example, the people
from the Leipzig Corpora Collection, the Darmstadt lab, WebCorp, and
certain transatlantic projects have occasionally appeared at WAC
workshops but tend to present their work elsewhere. For example, see the
WebCorp list of publications here:
http://www.webcorp.org.uk/live/publications.jsp
WebCorp: The Web as Corpus - Publications<http://www.webcorp.org.uk/live/publications.jsp>
www.webcorp.org.uk
WebCorp: Using the World Wide Web as a corpus - a rich source of linguistic information.




3. The huge community of lexicographers working with web data (mostly
thanks to SketchEngine) doesn't seem to be interested in presenting
their work at WAC workshops. This became very clear with the failed
WAC-10 co-located with eLex, where we did not receive a significant
number of submissions by lexicographers.

4. I say this as one of the two main contributors to the COW project: We
have two types of users. First, there are computer scientists who bulk
download the shuffle corpora and are never seen again. Second, there are
linguists who do not necessarily regard themselves as CORPUS linguists,
and who use COW to search for linguistically interesting patterns. Many
of them (in the specific case of our own linguistic user base: most of
them) have no idea how corpora are constructed, aren't interested in the
process, and would never ever consider traveling to Birmingham to a
conference called 'Corpus Linguistics', which they might not even have
heard of before. This is basically the same as 3, although maybe even
worse. Involvement in corpus construction simply doesn't further your
career in linguistics a lot.

5. Some (corpus) linguists still build their own corpora from web data
for specific purposes. Sometimes, they present at WAC but they maybe
have no deeper commitment to the WAC community.

6. The Twitter/social media ('CMC') community has never really
participated in WAC. There were single contributions, but computational
linguists/computer scientists in particular who do interesting things
with Twitter and similar data present their work at higher-profile CL
conferences and workshops.

7. Somehow, WAC has neglected to make itself noticed strongly by
linguists except for lexicographers (again, Sketch Engine has done a
very good job in that area). It is remarkable that people who have never
contributed to WAC publish books targeting a linguistic audience and
entitled "Web as corpus: Theory and practice" with major publishers. The
void that this publication filled could and should have been filled by
the WAC community, but was not. While Felix and I have been planning to
write/edit such a book (and publish it as open access) for some time, we
have only been involved in WAC since 2012 and simply haven't found the
time yet.

8. The web as a source of data is no longer considered as exciting as it
was a decade ago, with Twitter/social media data taking its place. I
feel that most computer scientists and computational linguists consider
the specific problems of collecting and processing web data solved, and
linguists (probably including lexicographers) simply trust the existing
web corpora. Those corpus linguists who believe in balanced,
Biber-representative corpora, etc., on the other hand, will always
ignore web corpora.

9. What happened to the web genre scene?

10. The mixed computational/corpus linguistics orientation of WAC
workshops creates very practical incompatibilities. CL people expect
full paper submissions and high-quality proceedings publications,
whereas linguists expect to be asked for short abstracts and irrelevant
(or no) proceedings. With regard to WAC-XI, a CL colleague complained to
me that we did not ask for full paper submissions, whereas a colleague
from linguistics stated that asking for anything longer than a 500 word
abstract is considered a nuisance.

This is NOT an exhaustive list. Please contradict, discuss, or add
relevant points!


II. Why I think Web as Corpus is a "thing" – but no longer a
computational linguistics thing.

In (I), I have stated why I think the SIGWAC community is eroding. Now,
I would like to say why I think this is a truly bad thing, and why I
think WAC is important.

1. We (esp. established projects like Sketch Engine and COW) know very
well how to collect large amounts of textual data from the web, clean
it, and annotate it with standard or adapted tools. However, I firmly
believe that we still know very little about the place and value of web
data in the exploration of the human language faculty. We do not even
know anything substantial from a linguistic point of view about the
composition of the web or web corpora, even though such claims are made
sporadically.

2. That said, I am convinced that web corpora are unique and potentially
even better sources of data for (synchronic) linguistic research than
traditionally compiled corpora. They contain extreme amounts of
variation AND are very large, and thus have the potential of achieving
"representativeness" (for lack of a better word) in a much more
straightforward way than balanced corpora. However, a lot of linguistic
work with web corpora would still be required: careful assessment of
corpus composition, linguistically guided development of annotation
tools, and above all correlation of findings from web corpora with
experimental work – also in comparison to traditional corpora. By the
way, this is something the CMC and DigHum communities usually miss
because they are not cognitively oriented. The CfP for WAC-XI was an
attempt to attract research from this direction, and it failed.

3. In (2), I mentioned the linguistically guided development of
annotation tools. While this is, of course, a computational linguistics
thing, computational linguists will most likely find it unattractive.
>From a linguistic point of view, the unsolved problems start at very low
levels, such as sentence splitting and tokenisation. Computational
linguists have long moved away from such tasks, considering them solved,
though the EmpiriST shared task (https://www.sigwac.org.uk/wiki/WAC-X)
WAC-X – ACL SIGWAC<https://www.sigwac.org.uk/wiki/WAC-X>
www.sigwac.org.uk
10th Web as Corpus Workshop (WAC-X) Endorsed by the Special Interest Group of the ACL on Web as Corpus (SIGWAC) Co-located with ACL 2016 August 12, 2016, Berlin



was a welcome exception. Also, at least in our experience at COW, tools
often do not deliver the annotations that linguists would expect. The
collaboration of linguists and computational linguists/computer
scientists would be required to create tools that solve "solved" tasks
in new ways. Don't get me wrong, I'm not saying computational linguistic
approaches are the problem, but instead, linguists' low awareness of and
interest in the corpus creation and annotation process.

WAC has to become more of a substantially linguistic thing. I do not see
any other way of making WAC attractive again. As explained, WAC as a CL
thing might have become stale. Of course, computational linguists might
see this differently and I'd appreciate them weighing in.


III. Options

Given the problems mentioned in (I) and the fact that ACL now conducts
member surveys to help select workshops, ACL conferences will likely
never be hosting events for WAC workshops again. WAC-XI came last in the
ACL 2017 survey with extremely low numbers of expressions of interest.
Also, as explained, linguists aren't interested (yet/anymore) in
contributing. Here is an incomplete list of things we could try in the
hope of gaining momentum again:

1. broaden SIGWAC's scope towards more (corpus) linguistic questions,
raising linguists' awareness of the potential of web data

2. turn SIGWAC into a purely computational thing, possibly broadening
its scope (SIGCMC, SIGND 'noisy data', etc.).

3. (worse than 2) merge with another SIG/workshop,

4. (1 and 2) split SIGWAC into SIGWAC/CL and something like an
experimental WAC4Ling,

In addition, I would like to propose that regardless of our decision,
the WAC community should come up with something that defines its vision
for the future of WAC, preferably a special issue of some journal or an
edited volume which makes it clear what WAC is and why it is important.
Also, it is obvious that, as SIGWAC members committed to keeping the SIG
alive, we have to actively advertise future workshops to colleagues, and
we have to contribute to them ourselves.

If it turns out that the SIGWAC members want to take it into a more
computational direction, I propose that we elect a chairwoman or
chairman with stronger ties to the computational linguistics community.
However, I am highly committed to SIGWAC because I believe in the
importance of Web as Corpus. Therefore, I do not intend to resign before
my term of office is over without a new candidate having been elected.

What do you think? Any contributions are greatly appreciated!

Best regards,
Roland Schäfer
Chairman of SIGWAC
_______________________________________________
Sigwac mailing list
Sigwac at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/sigwac


More information about the Sigwac mailing list