[Sigwac] Gold standard data set corrupted - missing 103.txt

Miloš Jakubíček milos.jakubicek at sketchengine.co.uk
Tue Mar 8 09:50:44 CET 2016


Good, I have updated the original archive as well.

Milos

Milos Jakubicek

CEO, Lexical Computing
Brighton, UK | Brno, CZ
http://www.lexicalcomputing.com
http://www.sketchengine.co.uk

2016-03-07 18:59 GMT+01:00 Tom Morris <tfmorris at gmail.com>:

> Thanks very much for digging that up, Miloš! If anyone is interested in the
> patched corpus, there's a Github repo which has a mirror of it and I've
> just created a pull request to add the file that Milos provided.
> https://github.com/dkpro/dkpro-c4corpus/pull/16
>
> That repo also has a Java re-implementation of the Python JusText
> boilerplate removal program.
>
> Tom
>
>
>
> On Mon, Mar 7, 2016 at 11:25 AM, Miloš Jakubíček <
> milos.jakubicek at sketchengine.co.uk> wrote:
>
> > Ah, just found out my previous message was rejected because of the
> > attachment.
> >
> > You can download the file here instead:
> >
> > https://downloads.sketchengine.co.uk/103.txt
> >
> > Best
> > Milos
> >
> > 2016-03-04 15:08 GMT+01:00 Miloš Jakubíček <
> > milos.jakubicek at sketchengine.co.uk>:
> >
> > > ...unbelievable but I think I did find the right file -- see attached.
> > >
> > > Can you confirm it looks like the right file to you as well -- I will
> > > update the archive then.
> > >
> > > Best
> > > Milos
> > >
> > >
> > >
> > > Milos Jakubicek
> > >
> > > CEO, Lexical Computing
> > > Brighton, UK | Brno, CZ
> > > http://www.lexicalcomputing.com
> > > http://www.sketchengine.co.uk
> > >
> > > 2016-03-04 14:41 GMT+01:00 Miloš Jakubíček <
> > > milos.jakubicek at sketchengine.co.uk>:
> > >
> > >> Hi Tom,
> > >>
> > >> I just checked and on the server we only have the archive -- I will
> try
> > >> to see whether we have any old backups, but the file comes from 2007,
> so
> > >> the chances are not very high :(
> > >>
> > >> Best
> > >> Milos
> > >>
> > >> Milos Jakubicek
> > >>
> > >> CEO, Lexical Computing
> > >> Brighton, UK | Brno, CZ
> > >> http://www.lexicalcomputing.com
> > >> http://www.sketchengine.co.uk
> > >>
> > >> 2016-03-03 19:47 GMT+01:00 Tom Morris <tfmorris at gmail.com>:
> > >>
> > >>> Does anyone have the 103.txt which is supposed to be in the Gold
> > Standard
> > >>> data set (http://cleaneval.sigwac.org.uk/GoldStandard.tar.gz) ?
> > >>>
> > >>> The current 103.txt is, despite it's name, actually a tar file made
> up
> > of
> > >>> all the other files. My guess is that someone typed:
> > >>>
> > >>>     $ tar cvf *.txt
> > >>>
> > >>> and the shell expanded that to
> > >>>
> > >>>     $ tar cvf 103.txt 104.txt 105.txt ...
> > >>>
> > >>> overwriting the original contents of the file with the tar containing
> > all
> > >>> the other files.
> > >>>
> > >>> If a corrected version of GoldStandard.tar.gz could be made
> available,
> > >>> that
> > >>> would be great.
> > >>>
> > >>> Best regards,
> > >>> Tom Morris
> > >>> _______________________________________________
> > >>> Sigwac mailing list
> > >>> Sigwac at sslmit.unibo.it
> > >>> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
> > >>>
> > >>
> > >>
> > >
> > _______________________________________________
> > Sigwac mailing list
> > Sigwac at sslmit.unibo.it
> > http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
> >
> _______________________________________________
> Sigwac mailing list
> Sigwac at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/sigwac
>


More information about the Sigwac mailing list