<div dir="auto">Thanks very much, Peter and Andrew. I do indeed use the XML encoding through cwb-encode, and I knew that that processes tags correctly, but I didn&#39;t know how extensively it handles entities. All clear now.<div dir="auto"><br></div><div dir="auto">Best wishes,</div><div dir="auto">Scott<br><div dir="auto"><br><div class="gmail_quote" dir="auto"><div dir="ltr" class="gmail_attr">On Tue, Jan 21, 2020, 18:09 Hardie, Andrew &lt;<a href="mailto:a.hardie@lancaster.ac.uk">a.hardie@lancaster.ac.uk</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div lang="EN-GB" link="#0563C1" vlink="#954F72">

<div class="m_3887399609630593060WordSection1">

<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1f497d">Generally: cwb-encode will attempt to parse anything with a &lt; at the start of the line as if it were an XML tag.  So yes, they need

 escaping.<u></u><u></u></span></p>

<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1f497d"><u></u> <u></u></span></p>

<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1f497d">&amp;lt; is the best way to do so as Peter says. If you’reusing cwb-encode directly, remember to use the -x option so that this will be

 properly interpreted. If you’re going via CQPweb, then -x is always switched on <u></u>

<u></u></span></p>

<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1f497d"><u></u> <u></u></span></p>

<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1f497d">best<u></u><u></u></span></p>

<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1f497d"><u></u> <u></u></span></p>

<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1f497d">Andrew<u></u><u></u></span></p>

<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#1f497d"><u></u> <u></u></span></p>

<div>

<div style="border:none;border-top:solid #e1e1e1 1.0pt;padding:3.0pt 0cm 0cm 0cm">

<p class="MsoNormal"><b><span lang="EN-US">From:</span></b><span lang="EN-US"> <a href="mailto:cwb-bounces@sslmit.unibo.it" target="_blank" rel="noreferrer">cwb-bounces@sslmit.unibo.it</a> &lt;<a href="mailto:cwb-bounces@sslmit.unibo.it" target="_blank" rel="noreferrer">cwb-bounces@sslmit.unibo.it</a>&gt;

<b>On Behalf Of </b>Uhrig, Peter<br>

<b>Sent:</b> 21 January 2020 16:42<br>

<b>To:</b> Open source development of the Corpus WorkBench &lt;<a href="mailto:cwb@sslmit.unibo.it" target="_blank" rel="noreferrer">cwb@sslmit.unibo.it</a>&gt;<br>

<b>Subject:</b> Re: [CWB] Dealing with &quot;malformed tag&quot; error<u></u><u></u></span></p>

</div>

</div>

<p class="MsoNormal"><u></u> <u></u></p>

<div>

<p class="MsoNormal"><span lang="DE">Hi Scott,<u></u><u></u></span></p>

<p class="MsoNormal"><span lang="DE"><u></u> <u></u></span></p>

<p class="MsoNormal"><span lang="EN-US">I recommend using &amp;lt; here as an XML entity.<u></u><u></u></span></p>

<p class="MsoNormal"><span lang="EN-US">See here:

</span><span lang="DE"><a href="https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fpipermail%2Fcwb%2F2018-February%2F003072.html&amp;data=02%7C01%7Ca.hardie%40lancaster.ac.uk%7C395a9a26e0df4a7928f808d79e920a12%7C9c9bcd11977a4e9ca9a0bc734090164a%7C1%7C1%7C637152222392088934&amp;sdata=bf1OII%2FBs2y14LHzp4fC9fRO7%2B1DOSwKfwxssM4bExY%3D&amp;reserved=0" target="_blank" rel="noreferrer">http://liste.sslmit.unibo.it/pipermail/cwb/2018-February/003072.html</a></span><span lang="EN-US"><u></u><u></u></span></p>

<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>

<p class="MsoNormal"><span lang="EN-US">Best wishes,<u></u><u></u></span></p>

<p class="MsoNormal"><span lang="EN-US">Peter<u></u><u></u></span></p>

<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>

<p class="MsoNormal"><b><span lang="DE">Von:</span></b><span lang="DE"> <a href="mailto:cwb-bounces@sslmit.unibo.it" target="_blank" rel="noreferrer">

cwb-bounces@sslmit.unibo.it</a> &lt;<a href="mailto:cwb-bounces@sslmit.unibo.it" target="_blank" rel="noreferrer">cwb-bounces@sslmit.unibo.it</a>&gt;

<b>Im Auftrag von </b>Scott Sadowsky<br>

<b>Gesendet:</b> Dienstag, 21. Januar 2020 16:54<br>

<b>An:</b> CWBdev Mailing List &lt;<a href="mailto:cwb@sslmit.unibo.it" target="_blank" rel="noreferrer">cwb@sslmit.unibo.it</a>&gt;<br>

<b>Betreff:</b> [CWB] Dealing with &quot;malformed tag&quot; error<u></u><u></u></span></p>

<p class="MsoNormal"><span lang="DE"><u></u> <u></u></span></p>

<div>

<div>

<p class="MsoNormal"><span lang="DE">I&#39;m trying to encode a very large corpus derived from very heterogeneous text files. I&#39;ve solved most of the problems (e.g. multiple character encodings and the like), but there&#39;s one I&#39;m not sure how to deal with.<u></u><u></u></span></p>

</div>

<div>

<p class="MsoNormal"><span lang="DE"><u></u> <u></u></span></p>

</div>

<div>

<p class="MsoNormal"><span lang="DE">After tagging the texts with FreeLing I end up with a certain number of lines that are as follows:<u></u><u></u></span></p>

</div>

<div>

<p class="MsoNormal"><span lang="DE"><u></u> <u></u></span></p>

</div>

<div>

<p class="MsoNormal"><span lang="DE" style="font-family:&quot;Courier New&quot;">&lt;     &lt;     Fz     Fz     F     oth</span><span lang="DE"><u></u><u></u></span></p>

</div>

<div>

<p class="MsoNormal"><span lang="DE"><u></u> <u></u></span></p>

</div>

<p class="MsoNormal"><span lang="DE">When compiling the corpus, CQP throws the following error for each such case:<u></u><u></u></span></p>

<div>

<p class="MsoNormal"><span lang="DE"><u></u> <u></u></span></p>

</div>

<div>

<p class="MsoNormal"><span lang="DE" style="font-family:&quot;Courier New&quot;">Malformed tag &lt; &lt;       Fz      Fz      F       oth, inserted literally (file ~/02-Tagged/0128716.xml, line #85)</span><span lang="DE"><u></u><u></u></span></p>

</div>

<div>

<p class="MsoNormal"><span lang="DE"><u></u> <u></u></span></p>

</div>

<div>

<p class="MsoNormal"><span lang="DE">These cases seem to be from when writers got unduly creative with symbols, rather than from mathematical uses, so they&#39;re probably mostly expendable.<u></u><u></u></span></p>

</div>

<div>

<p class="MsoNormal"><span lang="DE"><u></u> <u></u></span></p>

</div>

<div>

<p class="MsoNormal"><span lang="DE">What&#39;s the best way to handle cases like these? I could in theory eliminate them with a script before CQP tries to compile the corpus, but I&#39;m loathe to make destructive changes to text contents. So it would be good to know

 what effect leaving them in will have on the final corpus -- with they interfere with CQP&#39;s corpus compilation process? For example, will they cause it to incorrectly determine where actual tags begin and end? Or are they basically harmless? </span></p></div></div></div></div></div>

</blockquote></div></div></div></div>