[Sigwac] CLEANEVAL corpus

tsmande at tycho.ncsc.mil tsmande at tycho.ncsc.mil
Wed Jun 30 19:12:39 CEST 2010


Hi,

I'm a researcher who is interested in using the CLEANEVAL corpus to test
template-removal algorithms on.  In order to do so, however, I need to
convert the data into a format in which the text in the page is tagged
instead of the template being removed.  This should have been
straightforward, however, I encountered some difficulty since contrary
to the annotation guidelines the annotators frequently added and
replaced characters in the text, and sometimes even classified alt-text
(which shouldn't even show up on the page) as text.  However, I had at
least partial success with the English files.

The Chinese files are proving much more difficult, because the
annotators appear to have replaced Chinese symbols for no apparent
reason.  For example, in the third line of the second <p> tag of
11-cleaned.html, the annotators replaced 的新车型就多达, as it appears
in the original page, with 男鲁敌途投啻.  Though I don't know Chinese,
putting these statements through an online translator gives me two
completely unrelated translations.

The problem is not limited to this webpage, there are several of these
in each page, where short pieces of Chinese text are replaced by 
unrelated text.  This occurs both in the "stripped" and in the "cleaned" 
versions.

Is there any explanation for this discrepancy?  I know it's been a while
since the competition, but do you still have the versions of the HTML
files that you ran your scripts on?

Any help would be appreciated.

Thanks,
--Travis




More information about the Sigwac mailing list