[CWB] Experience encoding FreeLing-tagged corpora?

Mon Jul 18 04:29:33 CEST 2016

Thanks, Stefan and Vladimír!

I'm dealing with about 1.3 million files, so efficiency is an issue here.
I've managed to write a bash script that both invokes the FreeLing analyzer
and edits the heck out of its output before writing it to disk, giving me
what i *think* is acceptable input for CWB.

The one thing I can't get rid of are the empty lines between each line of
verticalized text. I'd banged my head against this issue before, until I
finally realized that all the tools involved (sed and such) work on a
line-by-line basis, and so you apparently can't process \n\n in order to
convert it to just \n like this -- I'd have to write the file and then read
it all into memory, perform that operation and then write it again. Big
performance hit there!

Below is what my script is currently outputting. Is it valid CWB input text?

By the way, while I'm here, what's the best and most up to date info
(tutorials, manuals, etc.) on encoding with CWB?

Thanks!
Scott

<text corpus="test" label="PROF-ACAD-CCSS" mode="professional"
genre="academic" field="social sciences" source="misc">
<s>
La el DA0FS0 DA determiner article

abogada abogado NCFS000 NC noun common

y y CC CC conjunction coordinating

ex ex AQ0CN00 AQ adjective qualificative

fiscal fiscal NCCS000 NC noun common

</s>
<s>
La el DA0FS0 DA determiner article

secretaria secretario NCFS000 NC noun common

de de SP SP adposition preposition

Estado estado NCMS000 NC noun common

</s>

</text>

On Sun, Jul 17, 2016 at 6:17 AM, Stefan Evert <stefanML at collocations.de>
wrote:

> The answer to both questions is: not directly, but it's easy to write a
> small pre-processing script.  I'm sure that many CWB users have written
> similar scripts over the years and someone may be willing to share a script
> that works with the FreeLing output format.
>
>
> > 1. FreeLing's plain text vertical output separates sentences with a
> blank line, rather than enclosing them in any sort of tag (e.g.
> <s>...</s>). Can CWB be configured to recognize this type of sentence
> encoding?
>
> Simply write a script that inserts a start tag <s> at the beginning of the
> corpus and then replaces every blank line with
>
>         </s>
>         <s>
>
> (plus the final close tag </s> at the end of the text).
>
> > 2. FreeLing's XML output looks a lot more complex than what I see in
> tutorials. It has more attributes, which shouldn't be a problem, but it
> also encodes each line in XML, as seen below. Can CWB be used with this?
>
> This is an XML format, not one-token-per-line with XML tags as in CWB's
> input format. The best strategy is perhaps to write a simple Perl or Python
> script that parses <token> lines with a regular expression and prints the
> relevant information in TAB-delimited format. CWB should then be able to
> handle the structural XML tags as s-attributes.
>
> Best,
> Stefan
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20160717/d4d9c976/attachment.html>