[CWB] Problem encoding corpus with POS tags

Hardie, Andrew a.hardie at lancaster.ac.uk
Tue Nov 6 15:02:48 CET 2012


Yes – the stray CR will have attached itself to the tag at the end of the line, so what you thought was tagged NN was actually tagged NN{CR}.

This is a common enough gotcha that we should probably give cwb-encode the ability to spot CR on POSIX and raise the alarm.

Another thing for the TODO list!

best

Andrew.

From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Albert Gatt
Sent: 06 November 2012 13:44
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Problem encoding corpus with POS tags

I seem to have found a solution to the problem. It seems it was caused by line endings -- some of my files had CRLF endings instead of plain old unix-style ones. The search with POS works fine now.

Thanks for your help.
albert

On 5 November 2012 15:34, Hardie, Andrew <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>> wrote:
Hi Albert,

You may need to check whether pos has been configured properly as primary annotation.

As a superuser, go to the main corpus search page then on the menu select > Manage Annotation. See if the "Primary annotation" slot has POS selected. If not, change and update, then it should work.

If, on the other hand, pos *IS* properly selected on that screen, let me know, and I'll look into what else might be causing the problem.

(I am not sure why sometimes the primary annotation is not selected correctly at index time. A bug, of course, but none one I've managed to track down yet as it seems to be intermittent. I'll work it out eventually.)

best

Andrew.

==========================
From: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it> [mailto:cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>] On Behalf Of Albert Gatt
Sent: 05 November 2012 14:09
To: cwb at sslmit.unibo.it<mailto:cwb at sslmit.unibo.it>
Subject: [CWB] Problem encoding corpus with POS tags

I'm trying to install a corpus which has word + POS, via CQPWeb. An example of the data is shown below:

<text id="lh1">
<s id="0">
Anqas   MV
għaraftek       VV
...     PUN
</s>
...
</text>

When I install, I leave the s-attributes as default (since "s" is the only structural attribute I have, apart from "text") and specify "pos" as the primary p-attribute.

The corpus installs without problems, and I can use CQPWeb's frequency list functionality to see a list of different parts of speech, as well as word tokens. I can successfully run queries for words. However, any query that involves POS gives me no results (e.g. "kien_VA" where "kien" is a word and "VA" is a tag).

I'm not sure where the problem lies.

thanks
albert


--
-----------------------------------------------------------------
Albert Gatt
Institute of Linguistics
Rm 22, Block A
Car Park 6
University of Malta
Tal-Qroqq Msida MSD2080
Malta

tel: (+356) 2340 2150<tel:%28%2B356%29%202340%202150>
http://staff.um.edu.mt/albert.gatt/
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>
http://devel.sslmit.unibo.it/mailman/listinfo/cwb



--
-----------------------------------------------------------------
Albert Gatt
Institute of Linguistics
Rm 22, Block A
Car Park 6
University of Malta
Tal-Qroqq Msida MSD2080
Malta

tel: (+356) 2340 2150
http://staff.um.edu.mt/albert.gatt/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20121106/b4f8f9e3/attachment.html>


More information about the CWB mailing list