<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Fri, May 19, 2017 at 2:01 AM, Stefan Evert <span dir="ltr"><<a href="mailto:stefanML@collocations.de" target="_blank">stefanML@collocations.de</a>></span> wrote:</div><div class="gmail_quote"><br></div><div class="gmail_quote">Hi Stefan,</div><div class="gmail_quote"><br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-">1. Can a single missing closing tag really have this massive cascading effect on all texts processed after it?</span></blockquote><span class="gmail-">
<br>
</span>Yes, because you're encoding with -S text:0, telling cwb-encode that <text> elements can be nested (but any nested ones should be ignored). So if there's a single missing </text> in one of your files, the current <text> region will stay open until the end of the corpus and all following .vrt files will be nested in it (and hence dropped).<br></blockquote><div><br></div><div>Quite right. I've never been able to stop assuming that files are a meaningful unit to CWB. As a result, I was thinking that whatever the problem, it should not extend beyond a single file, which isn't the case, as you point out.</div><div><br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">This is also how you can find the culprit quite easily. Use cwb-s-decode to look for a huge <text> region (very probably the last one in the corpus), then check it's id to locate the problematic file.<br>
<br>
cwb-s-decode CC-C -S text_id | tail -1<br>
<br>
should do the trick (but do check whether that's really a very large region).<br></blockquote><div><br></div><div>Excellent! That nailed one of the problematic cases right off the bat, and gave me the name of the file I needed to fix. Somehow, the final line in the file ended up being this:</div><div><br></div><div><font face="monospace, monospace">el el < </text></font></div><br>I assume the source file was truncated before tagging, as it ends not just in the middle of a phrase, but with an incomplete line -- there should be five columns in all, and there are only three plus the tag.</div><div class="gmail_quote"><br></div><div class="gmail_quote">By the way, what's the logic at work in this command? Are corpus contents ordered according to size?</div><div class="gmail_quote"><br></div><div class="gmail_quote"><br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Since the last line doesn't start with "<", it isn't recognized as an XML tag and inserted as a literal token (without annotations, so lemma, syn, … will be __UNDEF__ at this point). If you have indexed the corpus completely, you can also use CWB to look for<br>
<br>
".*<.+>.*";<br></blockquote><div><br></div><div>Wonderful -- this found another two cases of bad tags! Unfortunately, this method doesn't show the filename, but the corpus index number. Any suggestion on getting the name of the file from this? It's encoded as the unique text identifier, so the info should be there, but <font face="monospace, monospace">show +text_id</font> doesn't change what is shown.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">As a hotfix, just encode without the :0 nesting specifier, i.e.<br>
<br>
… -S text+id+corpus+tagger+label+<wbr>channel+audience+purpose+type+<wbr>medium+field+area+location+<wbr>source<br>
<br>
cwb-encode then assumes that <text> regions cannot be nested and will automatically insert </text> end tag whenever it encounters a new region.<br></blockquote><div><br></div><div>I'll give this a shot if I can't manage to identify the two files I need to fix. </div><div><br></div><div>Thanks a ton!</div><div>Scott</div></div><div><br></div>
</div></div>