[CWB] Experience encoding FreeLing-tagged corpora?

Scott Sadowsky ssadowsky at gmail.com
Mon Jul 18 15:19:53 CEST 2016


Thanks Graham, and thanks again, Stefan. I just finished tagging a test
corpus with FreeLing and encoding it with CWB, and everything worked
splendidly.

FreeLing produces UTF-8 output, but as far as I can tell CWB 3.4.9 deals
with it just fine using the -c utf8 option. Are there any gotchas I should
know about with this encoding?

Finally, I'm encoding a fair number to S attributes that describe the
source of the texts, the genre and so on, with the idea of making one big
corpus in which ad hoc sub-corpora can easily be queried (say, all
newspapers, or all forum posts, or a certain magazine, or whatever). I've
found a bit of info on page 26 of the CQP Query Language Tutorial that you
pointed me to. Is there anything else out there that might be of use for
this particular purpose?

Cheers,
Scott

On Mon, Jul 18, 2016 at 2:41 AM, Stefan Evert <stefanML at collocations.de>
wrote:

>
> > The one thing I can't get rid of are the empty lines between each line
> of verticalized text. I'd banged my head against this issue before, until I
> finally realized that all the tools involved (sed and such) work on a
> line-by-line basis, and so you apparently can't process \n\n in order to
> convert it to just \n like this -- I'd have to write the file and then read
> it all into memory, perform that operation and then write it again. Big
> performance hit there!
>
> As Graham suggested, for line-by-line processing you can recognize empty
> lines with the regexp /^$/  (in Perl, I often use /^\s*$/ so I don't
> stumble over a few stray blanks) and then simply skip printing them in the
> output.
>
> > Below is what my script is currently outputting. Is it valid CWB input
> text?
>
> Looks good to me. Just make sure to pass the -s and -B flags to cwb-encode
> to make it skip blank lines.
>
> > By the way, while I'm here, what's the best and most up to date info
> (tutorials, manuals, etc.) on encoding with CWB?
>
> The official manuals are the "tutorials" you can find at
>
>         http://cwb.sourceforge.net/documentation.php
>
> They are slightly out of date, but we haven't added that much in the
> meantime.  You can also download PDFs of the latest versions directly from
> the SVN repository:
>
>
> https://sourceforge.net/p/cwb/code/HEAD/tree/doc/tutorials/CQP_Tutorial.pdf?format=raw
>
>
> https://sourceforge.net/p/cwb/code/HEAD/tree/doc/tutorials/CWB_Encoding_Tutorial.pdf?format=raw
>
> Best,
> Stefan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20160718/0cb828cb/attachment.html>


More information about the CWB mailing list