[Sigwac] CLEANEVAL corpus
tsmande at tycho.ncsc.mil
tsmande at tycho.ncsc.mil
Wed Jun 30 19:12:39 CEST 2010
Hi,
I'm a researcher who is interested in using the CLEANEVAL corpus to test
template-removal algorithms on. In order to do so, however, I need to
convert the data into a format in which the text in the page is tagged
instead of the template being removed. This should have been
straightforward, however, I encountered some difficulty since contrary
to the annotation guidelines the annotators frequently added and
replaced characters in the text, and sometimes even classified alt-text
(which shouldn't even show up on the page) as text. However, I had at
least partial success with the English files.
The Chinese files are proving much more difficult, because the
annotators appear to have replaced Chinese symbols for no apparent
reason. For example, in the third line of the second <p> tag of
11-cleaned.html, the annotators replaced 的新车型就多达, as it appears
in the original page, with 男鲁敌途投啻. Though I don't know Chinese,
putting these statements through an online translator gives me two
completely unrelated translations.
The problem is not limited to this webpage, there are several of these
in each page, where short pieces of Chinese text are replaced by
unrelated text. This occurs both in the "stripped" and in the "cleaned"
versions.
Is there any explanation for this discrepancy? I know it's been a while
since the competition, but do you still have the versions of the HTML
files that you ran your scripts on?
Any help would be appreciated.
Thanks,
--Travis
More information about the Sigwac
mailing list