[CWB] Trying to deal with tag problems when encoding

Scott Sadowsky ssadowsky at gmail.com
Fri May 19 21:28:45 CEST 2017


On Fri, May 19, 2017 at 7:15 AM, Stefan Evert <stefanML at collocations.de>
wrote:

Thanks again, Stefan.

No, my logic was quite simple: If there's a missing </text> tag in one of
> your files, this region isn't closed and will extend to the very end of the
> corpus (unless there is a superfluous </text> tag or a damaged <text> in
> another file).  So there was a good change that the last <text> region in
> the corpus would be the critical one.
>

Right on.


What you want is
>
>         set PrintStructures text_id;
>

Extremely useful command, this!


One of the advantages of -S text:0 is that it shows you there is a problem
> – with the "hotfix" solution, it's completely hidden.
>

Indeed. Fortunately, with the set PrintStructures command I was able to
ferret out what I hope is the last bad tag in the corpus, and I'm currently
re-encoding using -S text:0.

Thanks for all your help.

Cheers,
Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20170519/5db008f4/attachment.html>


More information about the CWB mailing list