[Sigwac] Gold standard data set corrupted - missing 103.txt

Tom Morris tfmorris at gmail.com
Mon Mar 7 18:59:35 CET 2016


Thanks very much for digging that up, Miloš! If anyone is interested in the
patched corpus, there's a Github repo which has a mirror of it and I've
just created a pull request to add the file that Milos provided.
https://github.com/dkpro/dkpro-c4corpus/pull/16

That repo also has a Java re-implementation of the Python JusText
boilerplate removal program.

Tom



On Mon, Mar 7, 2016 at 11:25 AM, Miloš Jakubíček <
milos.jakubicek at sketchengine.co.uk> wrote:

> Ah, just found out my previous message was rejected because of the
> attachment.
>
> You can download the file here instead:
>
> https://downloads.sketchengine.co.uk/103.txt
>
> Best
> Milos
>
> 2016-03-04 15:08 GMT+01:00 Miloš Jakubíček <
> milos.jakubicek at sketchengine.co.uk>:
>
> > ...unbelievable but I think I did find the right file -- see attached.
> >
> > Can you confirm it looks like the right file to you as well -- I will
> > update the archive then.
> >
> > Best
> > Milos
> >
> >
> >
> > Milos Jakubicek
> >
> > CEO, Lexical Computing
> > Brighton, UK | Brno, CZ
> > http://www.lexicalcomputing.com
> > http://www.sketchengine.co.uk
> >
> > 2016-03-04 14:41 GMT+01:00 Miloš Jakubíček <
> > milos.jakubicek at sketchengine.co.uk>:
> >
> >> Hi Tom,
> >>
> >> I just checked and on the server we only have the archive -- I will try
> >> to see whether we have any old backups, but the file comes from 2007, so
> >> the chances are not very high :(
> >>
> >> Best
> >> Milos
> >>
> >> Milos Jakubicek
> >>
> >> CEO, Lexical Computing
> >> Brighton, UK | Brno, CZ
> >> http://www.lexicalcomputing.com
> >> http://www.sketchengine.co.uk
> >>
> >> 2016-03-03 19:47 GMT+01:00 Tom Morris <tfmorris at gmail.com>:
> >>
> >>> Does anyone have the 103.txt which is supposed to be in the Gold
> Standard
> >>> data set (http://cleaneval.sigwac.org.uk/GoldStandard.tar.gz) ?
> >>>
> >>> The current 103.txt is, despite it's name, actually a tar file made up
> of
> >>> all the other files. My guess is that someone typed:
> >>>
> >>>     $ tar cvf *.txt
> >>>
> >>> and the shell expanded that to
> >>>
> >>>     $ tar cvf 103.txt 104.txt 105.txt ...
> >>>
> >>> overwriting the original contents of the file with the tar containing
> all
> >>> the other files.
> >>>
> >>> If a corrected version of GoldStandard.tar.gz could be made available,
> >>> that
> >>> would be great.
> >>>
> >>> Best regards,
> >>> Tom Morris
> >>> _______________________________________________
> >>> Sigwac mailing list
> >>> Sigwac at sslmit.unibo.it
> >>> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
> >>>
> >>
> >>
> >
> _______________________________________________
> Sigwac mailing list
> Sigwac at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
>


More information about the Sigwac mailing list