[CWB] Experience encoding FreeLing-tagged corpora?
Graham Ranger -- UAPV
graham.ranger at univ-avignon.fr
Mon Jul 18 06:44:06 CEST 2016
Hello Scott,
I have run into similar problems.
With sed, I think this ought to replace your empty lines: sed '/^$/d'
infile > outfile (Looks for a line end, at the beginning of a line, then
deletes it.) Sed also has the possibility of looking ahead, and
replacing an expression if another expression is on the following line.
Otherwise a perl script would do it, slurping in the whole file and then
replace two \n's with one.
Best,
Graham.
Le 18/07/2016 04:29, Scott Sadowsky a écrit :
> Thanks, Stefan and Vladimír!
>
> I'm dealing with about 1.3 million files, so efficiency is an issue
> here. I've managed to write a bash script that both invokes the
> FreeLing analyzer and edits the heck out of its output before writing
> it to disk, giving me what i /think/ is acceptable input for CWB.
>
> The one thing I can't get rid of are the empty lines between each line
> of verticalized text. I'd banged my head against this issue before,
> until I finally realized that all the tools involved (sed and such)
> work on a line-by-line basis, and so you apparently can't process \n\n
> in order to convert it to just \n like this -- I'd have to write the
> file and then read it all into memory, perform that operation and then
> write it again. Big performance hit there!
>
> Below is what my script is currently outputting. Is it valid CWB input
> text?
>
> By the way, while I'm here, what's the best and most up to date info
> (tutorials, manuals, etc.) on encoding with CWB?
>
> Thanks!
> Scott
>
> <text corpus="test" label="PROF-ACAD-CCSS" mode="professional"
> genre="academic" field="social sciences" source="misc">
> <s>
> LaelDA0FS0DAdeterminerarticle
>
> abogadaabogadoNCFS000NCnouncommon
>
> yyCCCCconjunctioncoordinating
>
> exexAQ0CN00AQadjectivequalificative
>
> fiscalfiscalNCCS000NCnouncommon
>
>
> </s>
> <s>
> LaelDA0FS0DAdeterminerarticle
>
> secretariasecretarioNCFS000NCnouncommon
>
> dedeSPSPadpositionpreposition
>
> EstadoestadoNCMS000NCnouncommon
>
> </s>
>
> </text>
>
>
>
>
> On Sun, Jul 17, 2016 at 6:17 AM, Stefan Evert
> <stefanML at collocations.de <mailto:stefanML at collocations.de>> wrote:
>
> The answer to both questions is: not directly, but it's easy to
> write a small pre-processing script. I'm sure that many CWB users
> have written similar scripts over the years and someone may be
> willing to share a script that works with the FreeLing output format.
>
>
> > 1. FreeLing's plain text vertical output separates sentences
> with a blank line, rather than enclosing them in any sort of tag
> (e.g. <s>...</s>). Can CWB be configured to recognize this type of
> sentence encoding?
>
> Simply write a script that inserts a start tag <s> at the
> beginning of the corpus and then replaces every blank line with
>
> </s>
> <s>
>
> (plus the final close tag </s> at the end of the text).
>
> > 2. FreeLing's XML output looks a lot more complex than what I
> see in tutorials. It has more attributes, which shouldn't be a
> problem, but it also encodes each line in XML, as seen below. Can
> CWB be used with this?
>
> This is an XML format, not one-token-per-line with XML tags as in
> CWB's input format. The best strategy is perhaps to write a simple
> Perl or Python script that parses <token> lines with a regular
> expression and prints the relevant information in TAB-delimited
> format. CWB should then be able to handle the structural XML tags
> as s-attributes.
>
> Best,
> Stefan
>
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
More information about the CWB
mailing list