No subject
Fri Oct 19 03:52:45 CEST 2012
orpora etc., but I can't find anything about simply appending new data to a=
n existing corpus.
Decoding the entire corpus, adding the new data to the generated file and r=
e-encoding the new file is an option, but the server we're running on isn't=
exactly fast. Any way to save a few CPU cycles and directly insert the new=
data into the existing corpus? Perhaps there's some functionality to combi=
ne two corpora into one?
Thanks,
Nik
--_000_0mlcep8vsneojhb29phmpi5i1352368513624emailandroidcom_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-=
1">
</head>
<body>
Unfortunately not, you need to re-index from scratch. A p-attribute include=
s frequency data and a reverse index as well as the actual corpus data, not=
to mention that the latter is compressed using codes dependent on fr=
equency: so adding anything means the
whole p-att must change.
<div><br>
</div>
<div>Best</div>
<div><br>
</div>
<div>Andrew.</div>
<br>
<br>
<br>
Nik <cqplist at nikvdp.com> wrote:<br>
<br>
<br>
<div>Hi all,
<div>I have a pretty simple question: is there any way to append text to an=
existing corpus?</div>
<div><br>
</div>
<div>We're working on a corpus based on data collected from a webcrawler an=
d would like to periodically update the corpus with new data from the=
crawler. From the documentation I found info on how to add annotations to =
existing corpora etc., but I can't find
anything about simply appending new data to an existing corpus. </div=
>
<div><br>
</div>
<div>Decoding the entire corpus, adding the new data to the generated file =
and re-encoding the new file is an option, but the server we're running on =
isn't exactly fast. Any way to save a few CPU cycles and directly inse=
rt the new data into the existing corpus?
Perhaps there's some functionality to combine two corpora into one?</div>
<div><br>
</div>
<div>Thanks,</div>
<div>Nik</div>
</div>
</body>
</html>
--_000_0mlcep8vsneojhb29phmpi5i1352368513624emailandroidcom_--
More information about the CWB
mailing list