[CWB] Experience encoding FreeLing-tagged corpora?

Stefan Evert stefanML at collocations.de
Mon Jul 18 08:41:19 CEST 2016


> The one thing I can't get rid of are the empty lines between each line of verticalized text. I'd banged my head against this issue before, until I finally realized that all the tools involved (sed and such) work on a line-by-line basis, and so you apparently can't process \n\n in order to convert it to just \n like this -- I'd have to write the file and then read it all into memory, perform that operation and then write it again. Big performance hit there!

As Graham suggested, for line-by-line processing you can recognize empty lines with the regexp /^$/  (in Perl, I often use /^\s*$/ so I don't stumble over a few stray blanks) and then simply skip printing them in the output.

> Below is what my script is currently outputting. Is it valid CWB input text?

Looks good to me. Just make sure to pass the -s and -B flags to cwb-encode to make it skip blank lines.

> By the way, while I'm here, what's the best and most up to date info (tutorials, manuals, etc.) on encoding with CWB?

The official manuals are the "tutorials" you can find at

	http://cwb.sourceforge.net/documentation.php

They are slightly out of date, but we haven't added that much in the meantime.  You can also download PDFs of the latest versions directly from the SVN repository:

	https://sourceforge.net/p/cwb/code/HEAD/tree/doc/tutorials/CQP_Tutorial.pdf?format=raw

	https://sourceforge.net/p/cwb/code/HEAD/tree/doc/tutorials/CWB_Encoding_Tutorial.pdf?format=raw

Best,
Stefan


More information about the CWB mailing list