[CWB] [ cwb-Feature Requests-2891967 ] Read undump files without explicit line count

SourceForge.net noreply at sourceforge.net
Sat Nov 7 02:50:11 CET 2009


Feature Requests item #2891967, was opened at 2009-11-04 14:57
Message generated for change (Comment added) made by andrewhardie
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722306&aid=2891967&group_id=131809

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: CWB engine
Group: None
Status: Closed
Priority: 6
Private: No
Submitted By: Stefan Evert (schtepf)
Assigned to: Stefan Evert (schtepf)
Summary: Read undump files without explicit line count

Initial Comment:
The "undump" command in CQP requires an explicit line count header in the first line of the undump file, so that arrays can be pre-allocated.  This is a major hassle for exchanging data with spreadsheets, SQL database engines, R, and other software that would otherwise work quite well with the TAB-delimited format of dump/undump files.  Without this restriction, it would also be possible to use dump files as a platform-independent serialization format for query results (unlike "save", which produces unportable binary files that even store the registry directory of the base corpus). 

----------------------------------------------------------------------

>Comment By: Andrew Hardie (andrewhardie)
Date: 2009-11-07 01:50

Message:
CQPweb could take advantage of this and use the new format to save some
internal juggling. However, it strikes me that it may be faster to keep the
old format and use the pipe to cqp's stdin instead of a file. That would
avoid two disk operations (file write, file read). I have never measured,
but according to what I've read file operations are always slower than
piping.

However, to do this it would, I think, be necessary to give *direct*
access to the pipes in the CQP object (or, alternatively, add a method or
methods for pipe plumbing to the object) -- it would not work through the
usual "execute" method. Question: how would the end of the file then be
marked if the pipe to the child stdin has to remain open after the undump?
Presumably once the requisite number of lines have been read cqp will stop
undumping and go back to its normal line-reading-and-parsing mode; Stefan,
is this correct?

Andrew.

----------------------------------------------------------------------

Comment By: Stefan Evert (schtepf)
Date: 2009-11-04 15:00

Message:
Fixed in version 2.2.b101.

The header line is now optional if the undump is loaded from a regular
file.  CQP will automatically detect the new format and read the undump
file in two passes (first to determine number of lines, then to read actual
data).  The new format cannot be used when reading from a pipe or from
standard input (because pipes cannot be re-read).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=722306&aid=2891967&group_id=131809


More information about the CWB mailing list