No subject


Fri Oct 19 03:52:45 CEST 2012


orpora etc., but I can't find anything about simply appending new data to a=
n existing corpus.

Decoding the entire corpus, adding the new data to the generated file and r=
e-encoding the new file is an option, but the server we're running on isn't=
 exactly fast. Any way to save a few CPU cycles and directly insert the new=
 data into the existing corpus? Perhaps there's some functionality to combi=
ne two corpora into one?

Thanks,
Nik

--_000_0mlcep8vsneojhb29phmpi5i1352368513624emailandroidcom_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-=
1">
</head>
<body>
Unfortunately not, you need to re-index from scratch. A p-attribute include=
s frequency data and a reverse index as well as the actual corpus data, not=
 to mention that the latter is compressed using codes dependent on &nbsp;fr=
equency: so adding anything means the
 whole p-att must change.
<div><br>
</div>
<div>Best</div>
<div><br>
</div>
<div>Andrew.</div>
<br>
<br>
<br>
Nik &lt;cqplist at nikvdp.com&gt; wrote:<br>
<br>
<br>
<div>Hi all,
<div>I have a pretty simple question: is there any way to append text to an=
 existing corpus?</div>
<div><br>
</div>
<div>We're working on a corpus based on data collected from a webcrawler an=
d would like to periodically &nbsp;update the corpus with new data from the=
 crawler. From the documentation I found info on how to add annotations to =
existing corpora etc., but I can't find
 anything about simply appending new data to an existing corpus.&nbsp;</div=
>
<div><br>
</div>
<div>Decoding the entire corpus, adding the new data to the generated file =
and re-encoding the new file is an option, but the server we're running on =
isn't exactly fast.&nbsp;Any way to save a few CPU cycles and directly inse=
rt the new data into the existing corpus?
 Perhaps there's some functionality to combine two corpora into one?</div>
<div><br>
</div>
<div>Thanks,</div>
<div>Nik</div>
</div>
</body>
</html>

--_000_0mlcep8vsneojhb29phmpi5i1352368513624emailandroidcom_--


More information about the CWB mailing list