[Sigwac] 4.7 billion token Dutch web corpus available (NLCOW14)

Roland Schäfer roland.schaefer at fu-berlin.de
Sun Sep 7 02:54:22 CEST 2014


* Apologies for multiple postings *

As the culmination of more than two years of work on the next generation
COW web corpora, a series of giga-token COWs in Dutch, English, French,
German, Spanish, and Swedish is now leaving the processing tool chain.
The Dutch corpus is the second to become available. It is a 4.7 billion
token sentence shuffle corpus derived from an unshuffled 6.9 billion
token corpus. Next in line are (in this order) English and German.

Website:              http://hpsg.fu-berlin.de/cow/
Download:             http://hpsg.fu-berlin.de/cow/download/
Simple web interface: http://hpsg.fu-berlin.de/cow/colibri/

NLCOW14AX maintainer: Enrique Manjavacas <enrique.manjavacas at gmail.com>
COW initiative 2011-2014: Felix Bildhauer, Roland Schäfer

Best regards,

Enrique
Roland


===== SUMMARY OF CORPUS PROPERTIES =====

* crawled in 2012 and 2014 in the TLDs .nl and .be
* processed with texrex (http://texrex.sourceforge.net/) for:

  + markup stripping
  + UTF-8 transcoding and checking
  + entity conversion
  + heuristic repairs of broken encodings
  + document quality assessment using frequencies of short words:
    Schäfer et al. (2013) http://bit.ly/VSmK6M
  + boilerplate status classification for text blocks:
    Schäfer (2014, draft) http://bit.ly/VSmK6M
  + document de-duplication using classic w-shingling:
    Schäfer & Bildhauer (2012) http://bit.ly/1zJIqiT

* run-together sentences fixed with rofl (included in texrex)
* hard-coded hyphenation removed with HyDRA (included in texrex)
* tokenization with ucto and custom scripts
* POS tagging and lemmatization with TreeTagger
* meta data in the released sentence shuffle version:

  + document ID
  + document URL
  + server geolocation from GeoLite by MaxMind (http://www.maxmind.com)
  + document quality score
  + boilerplate score
  + crawldate
  + last-modified (if available)



More information about the Sigwac mailing list