[CWB] [cwb:support-requests] #4 raw un-tagged text corpus

Andrew Hardie andrewhardie at users.sourceforge.net
Wed Apr 10 10:39:31 CEST 2019


- **status**: open --> closed
- **assigned_to**: Andrew Hardie
- **Comment**:

You don't have to tag the text, but you *do* have to tokenise it (i.e. split it so there is one token per line) in order to index it in CWB. Tagging is the most convenient way to do this as most POS taggers also tokenise. But if you have access to a non-tagging tokeniser you can use that instead.



---

** [support-requests:#4] raw un-tagged text corpus**

**Status:** closed
**Group:** v1.0_(example)
**Created:** Thu Apr 04, 2019 09:17 PM UTC by will lowder
**Last Updated:** Thu Apr 04, 2019 09:17 PM UTC
**Owner:** Andrew Hardie


Is it possible to input a raw text as a corpus into the CQP? I see that, according to the documentation, the standard input format is "vertical text" with each word tagged individually, but I only wish to directly process raw text (such as a news article) in CQP to view things like word frequency, n-grams, etc. Is it possible to use this text as a corpus in CWP without individually tagging each word?


---

Sent from sourceforge.net because cwb at sslmit.unibo.it is subscribed to https://sourceforge.net/p/cwb/support-requests/

To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/cwb/admin/support-requests/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20190410/b2d5751e/attachment.html>


More information about the CWB mailing list