[CWB] Experience encoding FreeLing-tagged corpora?

Stefan Evert stefanML at collocations.de
Sun Jul 17 12:17:18 CEST 2016


The answer to both questions is: not directly, but it's easy to write a small pre-processing script.  I'm sure that many CWB users have written similar scripts over the years and someone may be willing to share a script that works with the FreeLing output format.


> 1. FreeLing's plain text vertical output separates sentences with a blank line, rather than enclosing them in any sort of tag (e.g. <s>...</s>). Can CWB be configured to recognize this type of sentence encoding?

Simply write a script that inserts a start tag <s> at the beginning of the corpus and then replaces every blank line with

	</s>
	<s>

(plus the final close tag </s> at the end of the text).

> 2. FreeLing's XML output looks a lot more complex than what I see in tutorials. It has more attributes, which shouldn't be a problem, but it also encodes each line in XML, as seen below. Can CWB be used with this?

This is an XML format, not one-token-per-line with XML tags as in CWB's input format. The best strategy is perhaps to write a simple Perl or Python script that parses <token> lines with a regular expression and prints the relevant information in TAB-delimited format. CWB should then be able to handle the structural XML tags as s-attributes.

Best,
Stefan



More information about the CWB mailing list