[Sigwac] 4.8 billion token Swedish web corpus available (SVCOW14)

Roland Schäfer roland.schaefer at fu-berlin.de
Sat Aug 30 14:04:53 CEST 2014


* Apologies for multiple postings *

As the culmination of more than two years of work on the next generation
COW web corpora, a series of giga-token COWs in Dutch, English, French,
German, Spanish, Swedish is now leaving the processing tool chain. The
Swedish corpus is the first to become available. It is a 4.8 billion
token sentence shuffle corpus derived from an unshuffled 8.6 billion
token corpus. Next in line are (in this order) Dutch, English, German.

Website:       http://hpsg.fu-berlin.de/cow/
Download:      http://hpsg.fu-berlin.de/cow/download/
Web interface: http://hpsg.fu-berlin.de/cow/colibri/

SVCOW14AX maintainer: Roland Schäfer <mail at rolandschaefer.net>
COW initiative 2011-2014: Felix Bildhauer, Roland Schäfer

Best regards,
Roland


===== SUMMARY OF SVCOW14AX CORPUS PROPERTIES =====

* freely available under a restrictive academic license
* crawled in 2012 and 2014 in the TLDs .se and .fi
* vertical format with token/POS/lemma columns in minimal XML
* ready for encoding in versions of CWB which have UTF-8 support
* processed with texrex (http://texrex.sourceforge.net/) for:

  + markup stripping
  + UTF-8 transcoding and checking
  + entity conversion
  + heuristic repairs of broken encodings
  + document quality assessment using frequencies of short words:
    Schäfer et al. (2013) [http://bit.ly/VSmK6M]
  + boilerplate status classification for text blocks:
    Schäfer (2014, draft) [http://bit.ly/VSmK6M]
  + document de-duplication using classic w-shingling:
    Schäfer & Bildhauer (2012) [http://bit.ly/1zJIqiT]

* run-together sentences fixed with rofl (included in texrex)
* hard-coded hyphenation removed with HyDRA (included in texrex)
* tokenization with ucto and custom scripts
* POS tagging with HunPos
* lemmatization with custom tools
* meta data encoded in the released version:

  + document ID
  + document URL
  + server geolocation from GeoLite by MaxMind (http://www.maxmind.com)
  + document quality score
  + boilerplate score
  + crawl date
  + last-modified (if available)


More information about the Sigwac mailing list