[Sigwac] CLEANEVAL annotation guidelines: feedback request

Wed Oct 11 11:35:47 CEST 2006

Dear All,

As many of you know, SIGWAC is preparing a competitive task on automated
webpage cleaning named CLEANEVAL (http://cleaneval.sigwac.org.uk/). The
first CLEANEVAL challenge will take place in mid-August 2007.

Currently, we are planning English and Chinese tracks. However, if anybody
is able to provide test sets for other languages, please do get in touch
with us.

As part of the preparation, we need to develop guidelines for the
annotators who will create the manually cleaned sets of pages to be used as
test sets in the competition. This is a delicate task, as we have to strike
a good balance between our own (not always perfectly overlapping) ideas of
how a clean page should look like and the formulation of guidelines that
the average undergrad annotator will be able to follow.

After some collaborative experimentation, we came up with the preliminary
version of the guidelines attached to this mail.

We would greatly appreciate your feedback on this proposal.

Regards,

Marco Baroni, Serge Sharoff, Tony Hartley and Adam Kilgarriff

-------------- next part --------------
                     *****************************
                     * GUIDELINES FOR ANNOTATORS *
                     *****************************

INTRODUCTION
============

Your task is to "clean up" a set of webpages so that their contents
can be easily used for further linguistic processing and analysis. In
short, this implies

1) removing all HTML/Javascript code and "boilerplate" (headers,
copyright notices, link lists, materials repeated across most pages of
a site, etc.);

2) adding a basic encoding of the structure of the page using a
minimal set of symbols to mark the beginning of headers, paragraphs
and list elements.

INPUT
=====

You will start with a set of links to webpages and numbered text files
that contain preprocessed versions of those pages with some code
already removed.  Please open the webpage in your browser and the
corresponding text file in a text editor (preferably Notepad++
[http://notepad-plus.sourceforge.net] and ABSOLUTELY NOT MS Word) and
clean the text file paying attention to the formatting displayed in
the browser.

CODE REMOVAL
============

Despite the preliminary cleaning we perform, some code from HTML pages
might remain.  It is possible to detect it as text that does not
appear on the webpage. It will often look like this:

text-decoration:	none;
color:	#33F;
background:	#FFFFF5;

or 

start += "9";
var end = allcookies.indexOf(';', start);

Please remove such fragments if this text is not displayed on the
webpage itself.

BOILERPLATE REMOVAL
===================

Most webpages contain what we call "boilerplate", i.e., textual
materials that, intuitively, are extraneous to the proper, coherent
contents of the page.

Boilerplate is often machine-generated, and includes (but it is not
necessarily limited to):

- Navigation information

- Internal and external link lists

- Copyright notices and other legal information

- Standard header, footer and template materials that are repeated
  across (a subset of) the pages of the same site

- Advertisements

- Web-spam, such as automated postings by spammers to blogs

If you clean a webpage from a discussion forum, you can find replies
that quote substantial portions of other postings, either in boxes or
after '>'.  Delete such fragments as well.

Boilerplate must be removed from all the pages in the corpus.

STRUCTURAL ANNOTATION
=====================

We would like to preserve some basic information about the structure
of the page.

In particular, you should use the symbol <h> at the beginning of each
section which looks, on the original page viewed in your browser, like
a header of some sort or other information "about" the text, rather
than part of the text itself (a title-like sequence at the beginning
of the document, the title of a section, the page author, etc.)

Insert the symbol <p> before any paragraph in the document (a
paragraph might look like a traditional printed paragraph, or it might
be a textual/typographic unit more specific of the Web, such as a post
to a bulletin board or a comment to a blog entry).

Finally, use the symbol <l> for any element of a list.

An (artificially simple) annotated page might look like this:

*******************************************************************
<h>My blog

<p>Hi guys!

<p>Today, it's been a very productive day. In the morning, I did the
following three things:

<l>Brush teeth

<l>Take shower

<l>Shave

<h>Comments

<h>Becksy

<p>Great, man!
*******************************************************************

Do not worry about white-spaces and newlines: we will normalize them
afterward, using your annotation as the only reliable source of
information about the structure of the page.

TROUBLESHOOTING
===============

LAUNDRY LIST OF DIRECTIONS ABOUT SPECIAL CASES