[CWB] decoding and encoding CWB corpora in R (polminer and cwbtools)

Thomas Messerli thomas.messerli at unibas.ch
Wed Jul 5 16:45:42 CEST 2023


Dear CWB members,

Maybe someone can help with this. I want to create a workflow to annotate, edit, etc. CWB corpora in R and I have some open issues.

What works so far:

1) polmineR C.old <- decode([CORPUS], to=“data.table) —> which works fine and creates a datable of the tokenstream with p_attributes as well as s_attributes in columns. 
	- The CWB corpus contains the following s_attributes: "”corpus”, "text” , “text_id",”s",”s_id",“s_polarity",”s_subjectivity"
	- the decoded data.table C.old contains columns for all of these, with “corpus”,”text”, and “s” being empty

2) using cwbtools I do:
	- C.new<- CorpusData$new()
	- C.new$tokenstream <- C.old
	- cpos_max_min <- function(x) list(cpos_left = min(x[["cpos"]]), cpos_right = max(x[["cpos"]]))
	- C.new$metadata <- C.new$tokenstream[, cpos_max_min(.SD), by = text_id]
	- C.new$tokenstream[, text_id := NULL]
	- then I use C.new$encode(…)

While this works in principle, the resulting registry files for the s_attribute are different (see excerpts below), and I’m not sure yet whether this might create problems. More importantly, I am unclear how I could use this approach while also keeping the structuring of the corpus in sentences, including the annotation s_id, s_polarity, s_subjectivity.

Does anyone have any pointers as to how I could reencode a corpus in R that is more similar or even identical to what I decoded?

Best,
Thomas





The s_attributes part of the original registry file looks like this:

##
## s-attributes (structural markup)
##

# <corpus> ... </corpus>
STRUCTURE corpus

# <text id=".."> ... </text>
# (no recursive embedding allowed)
STRUCTURE text
STRUCTURE text_id              # [annotations]

# <s id=".." polarity=".." subjectivity=".."> ... </s>
# (no recursive embedding allowed)
STRUCTURE s
STRUCTURE s_id                 # [annotations]
STRUCTURE s_polarity           # [annotations]
STRUCTURE s_subjectivity       # [annotations]


The registry file for C.new is simply: 

## s-attributes
##

STRUCTURE text_id




-------------------------------------------------------------------------------------
Dr. Thomas C. Messerli
Postdoctoral Teaching and Research Fellow (Oberassistent)
Department of Languages and Literatures, Universität Basel
Englisches Seminar
Nadelberg 6
CH-4051 Basel

Office 15 
+41 61 207 27 82

http://www.thomasmesserli.org
thomas.messerli at unibas.ch



Recent publications:
Dayter, Daria, Locher, Miriam, A. & Messerli, Thomas C. (2023). Pragmatics in Translation <https://www.cambridge.org/core/elements/pragmatics-in-translation/2253C3F6A17EEC4A08297B137450D402>. Cambridge University Press.
Landert, Daniela, Dayter, Daria, Messerli, Thomas C., & Locher, Miriam A. (2023). Corpus Pragmatics <https://www.cambridge.org/core/elements/corpus-pragmatics/30FE00EAA8BC1F9C3191B390AB4B0040>. Cambridge University Press.
Locher, Miriam. A, Jucker, Andreas H., Landert, Daniela, & Messerli, Thomas C. (2023). Fiction and Pragmatics <https://www.cambridge.org/core/elements/fiction-and-pragmatics/D198C6EEF1402A67B259E53221B1CD16>. Cambridge University Press.
Locher, Miriam A., & Messerli, Thomas C. (2023). “This is not the place to bother people about BTS” <https://www.sciencedirect.com/science/article/pii/S2211695823000193>: Pseudo-synchronicity and interaction in timed comments by Hallyu fans on the video streaming platform Viki Discourse, Context & Media, 52. https://doi.org/10.1016/j.dcm.2023.100686













-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20230705/374ffc2c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1653 bytes
Desc: not available
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20230705/374ffc2c/attachment-0001.p7s>


More information about the CWB mailing list