<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>
</head>
<body dir="ltr">
<div id="divtagdefaultwrapper" dir="ltr" style="font-size:12pt; color:rgb(0,0,0); font-family:Calibri,Helvetica,sans-serif,EmojiFont,"Apple Color Emoji","Segoe UI Emoji",NotoColorEmoji,"Segoe UI Symbol","Android Emoji",EmojiSymbols">
<p>Dear All,</p>
<p><br>
</p>
<p>I realise this question may not be a perfect fit for this mailing list, but I'm not sure who or where else to ask, so here goes: <span style="font-size:12pt">Have a</span><span style="font-size:12pt">ny</span><span style="font-size:12pt"> of you</span><span style="font-size:12pt"> ever
worked with components from the <a href="http://ice-corpora.net/ice/index.html" class="OWAAutoLink">
International Corpus of English</a></span><span style="font-size:12pt">? T</span><span style="font-size:12pt">he xml-like annotations in the </span><span style="font-size:12pt">original files seem to be broken in many ways (e.g., inconsistent, unclosed and
open tags, invalid overlaps, reserved characters in content), so preparing them for CQP turned out to be quite
</span><span style="font-size:12pt">challenging (</span><span style="font-size:12pt">at least for me). It's not really that I got caught on a specific
</span><span style="font-size:12pt">problem;</span><span style="font-size:12pt"> I'm rather curious whether you</span><span style="font-size:12pt"> have some general advice for correcting such ill-formed texts, perhaps from experience. I feel like regular expressions
can only go so far (though I may very well just not be sufficiently knowledgable). There is an International Corpus of Learner English on the Lancaster CQPweb page. Is that similar by any chance?</span></p>
<p><span style="font-size:12pt"><br>
</span></p>
<p><span style="font-size:12pt">Best,</span></p>
<p><span style="font-size:12pt">Florian</span></p>
</div>
</body>
</html>