[CWB] Experience encoding FreeLing-tagged corpora?

Graham Ranger -- UAPV graham.ranger at univ-avignon.fr
Mon Jul 18 06:44:06 CEST 2016


Hello Scott,
I have run into similar problems.
With sed, I think this ought to replace your empty lines: sed '/^$/d' 
infile > outfile (Looks for a line end, at the beginning of a line, then 
deletes it.) Sed also has the possibility of looking ahead, and 
replacing an expression if another expression is on the following line.
Otherwise a perl script would do it, slurping in the whole file and then 
replace two \n's with one.
Best,
Graham.

Le 18/07/2016 04:29, Scott Sadowsky a écrit :
> Thanks, Stefan and Vladimír!
>
> I'm dealing with about 1.3 million files, so efficiency is an issue 
> here. I've managed to write a bash script that both invokes the 
> FreeLing analyzer and edits the heck out of its output before writing 
> it to disk, giving me what i /think/ is acceptable input for CWB.
>
> The one thing I can't get rid of are the empty lines between each line 
> of verticalized text. I'd banged my head against this issue before, 
> until I finally realized that all the tools involved (sed and such) 
> work on a line-by-line basis, and so you apparently can't process \n\n 
> in order to convert it to just \n like this -- I'd have to write the 
> file and then read it all into memory, perform that operation and then 
> write it again. Big performance hit there!
>
> Below is what my script is currently outputting. Is it valid CWB input 
> text?
>
> By the way, while I'm here, what's the best and most up to date info 
> (tutorials, manuals, etc.) on encoding with CWB?
>
> Thanks!
> Scott
>
> <text corpus="test" label="PROF-ACAD-CCSS" mode="professional" 
> genre="academic" field="social sciences" source="misc">
> <s>
> LaelDA0FS0DAdeterminerarticle
>
> abogadaabogadoNCFS000NCnouncommon
>
> yyCCCCconjunctioncoordinating
>
> exexAQ0CN00AQadjectivequalificative
>
> fiscalfiscalNCCS000NCnouncommon
>
>
> </s>
> <s>
> LaelDA0FS0DAdeterminerarticle
>
> secretariasecretarioNCFS000NCnouncommon
>
> dedeSPSPadpositionpreposition
>
> EstadoestadoNCMS000NCnouncommon
>
> </s>
>
> </text>
>
>
>
>
> On Sun, Jul 17, 2016 at 6:17 AM, Stefan Evert 
> <stefanML at collocations.de <mailto:stefanML at collocations.de>> wrote:
>
>     The answer to both questions is: not directly, but it's easy to
>     write a small pre-processing script.  I'm sure that many CWB users
>     have written similar scripts over the years and someone may be
>     willing to share a script that works with the FreeLing output format.
>
>
>     > 1. FreeLing's plain text vertical output separates sentences
>     with a blank line, rather than enclosing them in any sort of tag
>     (e.g. <s>...</s>). Can CWB be configured to recognize this type of
>     sentence encoding?
>
>     Simply write a script that inserts a start tag <s> at the
>     beginning of the corpus and then replaces every blank line with
>
>             </s>
>             <s>
>
>     (plus the final close tag </s> at the end of the text).
>
>     > 2. FreeLing's XML output looks a lot more complex than what I
>     see in tutorials. It has more attributes, which shouldn't be a
>     problem, but it also encodes each line in XML, as seen below. Can
>     CWB be used with this?
>
>     This is an XML format, not one-token-per-line with XML tags as in
>     CWB's input format. The best strategy is perhaps to write a simple
>     Perl or Python script that parses <token> lines with a regular
>     expression and prints the relevant information in TAB-delimited
>     format. CWB should then be able to handle the structural XML tags
>     as s-attributes.
>
>     Best,
>     Stefan
>
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb



More information about the CWB mailing list