[CWB] decoding and encoding CWB corpora in R (polminer and cwbtools)
Thomas Messerli
thomas.messerli at unibas.ch
Wed Jul 5 16:45:42 CEST 2023
Dear CWB members,
Maybe someone can help with this. I want to create a workflow to annotate, edit, etc. CWB corpora in R and I have some open issues.
What works so far:
1) polmineR C.old <- decode([CORPUS], to=“data.table) —> which works fine and creates a datable of the tokenstream with p_attributes as well as s_attributes in columns.
- The CWB corpus contains the following s_attributes: "”corpus”, "text” , “text_id",”s",”s_id",“s_polarity",”s_subjectivity"
- the decoded data.table C.old contains columns for all of these, with “corpus”,”text”, and “s” being empty
2) using cwbtools I do:
- C.new<- CorpusData$new()
- C.new$tokenstream <- C.old
- cpos_max_min <- function(x) list(cpos_left = min(x[["cpos"]]), cpos_right = max(x[["cpos"]]))
- C.new$metadata <- C.new$tokenstream[, cpos_max_min(.SD), by = text_id]
- C.new$tokenstream[, text_id := NULL]
- then I use C.new$encode(…)
While this works in principle, the resulting registry files for the s_attribute are different (see excerpts below), and I’m not sure yet whether this might create problems. More importantly, I am unclear how I could use this approach while also keeping the structuring of the corpus in sentences, including the annotation s_id, s_polarity, s_subjectivity.
Does anyone have any pointers as to how I could reencode a corpus in R that is more similar or even identical to what I decoded?
Best,
Thomas
The s_attributes part of the original registry file looks like this:
##
## s-attributes (structural markup)
##
# <corpus> ... </corpus>
STRUCTURE corpus
# <text id=".."> ... </text>
# (no recursive embedding allowed)
STRUCTURE text
STRUCTURE text_id # [annotations]
# <s id=".." polarity=".." subjectivity=".."> ... </s>
# (no recursive embedding allowed)
STRUCTURE s
STRUCTURE s_id # [annotations]
STRUCTURE s_polarity # [annotations]
STRUCTURE s_subjectivity # [annotations]
The registry file for C.new is simply:
## s-attributes
##
STRUCTURE text_id
-------------------------------------------------------------------------------------
Dr. Thomas C. Messerli
Postdoctoral Teaching and Research Fellow (Oberassistent)
Department of Languages and Literatures, Universität Basel
Englisches Seminar
Nadelberg 6
CH-4051 Basel
Office 15
+41 61 207 27 82
http://www.thomasmesserli.org
thomas.messerli at unibas.ch
Recent publications:
Dayter, Daria, Locher, Miriam, A. & Messerli, Thomas C. (2023). Pragmatics in Translation <https://www.cambridge.org/core/elements/pragmatics-in-translation/2253C3F6A17EEC4A08297B137450D402>. Cambridge University Press.
Landert, Daniela, Dayter, Daria, Messerli, Thomas C., & Locher, Miriam A. (2023). Corpus Pragmatics <https://www.cambridge.org/core/elements/corpus-pragmatics/30FE00EAA8BC1F9C3191B390AB4B0040>. Cambridge University Press.
Locher, Miriam. A, Jucker, Andreas H., Landert, Daniela, & Messerli, Thomas C. (2023). Fiction and Pragmatics <https://www.cambridge.org/core/elements/fiction-and-pragmatics/D198C6EEF1402A67B259E53221B1CD16>. Cambridge University Press.
Locher, Miriam A., & Messerli, Thomas C. (2023). “This is not the place to bother people about BTS” <https://www.sciencedirect.com/science/article/pii/S2211695823000193>: Pseudo-synchronicity and interaction in timed comments by Hallyu fans on the video streaming platform Viki Discourse, Context & Media, 52. https://doi.org/10.1016/j.dcm.2023.100686
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20230705/374ffc2c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1653 bytes
Desc: not available
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20230705/374ffc2c/attachment-0001.p7s>
More information about the CWB
mailing list