Introducing the ‘polmineR’-package

Andreas Blätte (andreas.blaette@uni-due.de)

2017-10-03

Purpose

The purpose of the package polmineR is to facilitate the interactive analysis of corpora using R. Apart from performance and usability, key considerations for developing the package are:

The ‘polmineR’ package supplements R packages that are already widely used for text mining. The CRAN task view is a good place to learn about relevant packages, see CRAN. The polmineR is intended to be an interface between the Corpus Workbench (CWB), an efficient system for storing and querying large corpora, and existing packages for mining text with advanced statistical methods.

Apart from the speed of text processing, the Corpus Query Processor (CQP) and the CQP syntax provide a powerful and widely used syntax to query corpora. This is not an unique idea. Using a combination of R and the CWB implies a software architecture you will also find in the TXM project, or with CQPweb. The ‘polmineR’ package offers a library with the grammer of corpus analysis below a graphical user interface (GUI). It is a toolset to perform simple tasts efficiently as well as to implement complex workflows.

Advanced user will need a good understanding of the Corpus Workbench. The Corpus Encoding Tutorial is an authoritative text for that. The vignette of the rcqp package includes an excellent explanation of the CWB data-model.

The most important thing users need to now is the difference between “s” and “p” attributes. The CWB distinguishes structural attributes (s-attributes) that will contain the metainformation that can be used to generate subcorpora, and positional attributes (p-attributes). Typically, the p-attributes will be ‘word’, ‘pos’ (for part-of-speech) and ‘lemma’ (for the lemmatized word form).

Getting started

Check that the CORPUS_REGISTRY environment variable is set

The annex of the vignette includes a detailed explanation how to install polmineR on Windows, MacOS, and Linux. Once you have installed polmineR, check that the environment variable CORPUS_REGISTRY is set.

Sys.getenv("CORPUS_REGISTRY")

The CORPUS_REGISTRY environment variable supplies the directory with registry files that describe where the CWB will find the files of an indexed corpus, and the s- and p-attributes. See the annex for an explanation how to set the CORPUS_REGISTRY environment variable for the current R session, or permanently.

Loading polmineR

If the CORPUS_REGISTRY variable is set correctly, i.e. pointing to the directory with the registry files describing the corpora, load the polmineR.

library(polmineR)
Using and installing packaged corpora

If you want to use a CWB corpus packaged in a R data package, you can call ‘use’ with the name of the R package. To access the corpus in the data package, the CORPUS_REGISTRY environment variable will be reset. In the followings examples, the CWB encoded English version of the Europarl wrapped into a data package will be used.

use("europarl.en")

Note that the use-function will call the resetRegistry-function that can also be used to set again the original path to the directory with registry files. If you want to use the English Europarl corpus, you can download it from a repository at the PolMine server.

install.corpus("europarl.en", repo = "http://polmine.sowi.uni-due.de/packages")

This package may serve as an example how CWB indexed corpora can be shared using an adapted version of standard R functions. Data packages with corpora have a version number which may be important for reproducing results, they can include a vignette documenting the data, and functions to perform specialized tasks.

Setting the interface to the Corpus Workbench

The standard interface used by the polmineR package to extract information from CWB indexed corpora is the package ‘rcqp’. The interface is defined in a class called ‘CQI’. To check which interface is used:

class(polmineR:::CQI)
## [1] "CQI.rcqp"  "CQI.super" "R6"

If you see “CQI.perl” leading the character vector that is returned, something went wrong. Accessing the corpora using perl scripts incurs an incredible performence loss. Reset the interface as follows:

unlockBinding(env = getNamespace("polmineR"), sym = "CQI")
assign("CQI", CQI.rcqp$new(), envir = getNamespace("polmineR"))
lockBinding(env = getNamespace("polmineR"), sym = "CQI")
}

An alternative interface is provided by the package ‘polmineR.Rcpp’, [available at GitHub][https://github.com/PolMine/polmineR.Rcpp] so far. This (experimental) package includes a few functions that speed up tasks such as counting terms, or preparing partitions. To install and use ‘polmineR.Rcpp’:

devtools::install_github("PolMine/polmineR.Rcpp")
setCorpusWorkbenchInterface("Rcpp")

To switch to the interface offered by polmineR.Rcpp, proceed as follows:

unlockBinding(env = getNamespace("polmineR"), sym = "CQI")
assign("CQI", CQI.Rcpp$new(), envir = getNamespace("polmineR"))
lockBinding(env = getNamespace("polmineR"), sym = "CQI")

The checks performed when submitting a package at CRAN issue a note when the ‘unlockBinding’ function appears in the source code of the package. This is why a function for reset the interface is not included in the package.

Checking that corpora are available

Use the corpus-method to check whether which corpora are accessible. It should be the EUROPARL-EN corpus in our case (the names of CWB corpora are always written upper case).

corpus()
##        corpus     size template
## 1 EUROPARL-EN 39431862    FALSE
Session settings

Many functions in the polmineR package use settings that are stored in the general options settings. You can see these settings as follows:

options()[grep("polmineR", names(options()))]

Several methods (such as kwic, or cooccurrences) will use these settings, if no explicit other value is provided. Here are a few examples how to change settings.

options("polmineR.left" = 15)
options("polmineR.right" = 15)
options("polmineR.mc" = FALSE)

To speed up computations, the polmineR package will sometimes try to use alternative, faster ways to access CWB corpora than the rcqp package. A few computations that are performance critical (for setting up partitions, or to count term frequencies) are implemented in the plugin package polmineR.Rcpp that has been mentioned before. It can be installed from GitHub. Note that it requires an installation of the Corpus Workbench to be present on your system (see installation instructions in the annex).

devtools::install_github("PolMineR/polmineR")

When loading the polmineR package, it is checked whether that plugin is present.

getOption("polmineR.Rcpp")

If you want to suppress using the polmineR.Rcpp functionality:

options("polmineR.Rcpp" = FALSE)

Working with corpora: Core methods

Core analytical tasks are implemented as methods (S4 class system), i.e. the bevaviour of the methods changes depending on the object that is supplied. Almost all methods can be applied to corpora as well as partitions (subcorpora). As an easy entry, methods applied to corpora are explained first.

Keyword-in-context (kwic)

The kwic method applied to the name of a corpus will return a KWIC object, output will be shown in the viewer pane of RStudio. You can include metadata from the corpus using the ‘meta’ parameter.

kwic("EUROPARL-EN", "Islam")
kwic("EUROPARL-EN", "Islam", meta = c("text_date", "speaker_name"))

You can also use the CQP query syntax for formulating queries. That way, you can find multi-word expressions, or match in a manner you may know from using regular expressions.

kwic("EUROPARL-EN", '"Geneva" "Convention"')
kwic("EUROPARL-EN", '"[Ss]ocial" "justice"')

Explaining the CQP syntax goes beyon this vignette. Consult the CQP tutorial to learn more about the CQP syntax.

Getting counts and frequencies

You can count one or several hits in a corpus.

count("EUROPARL-EN", "France")
##     query count         freq
## 1: France  5517 0.0001399122
count("EUROPARL-EN", c("France", "Germany", "Britain", "Spain", "Italy", "Denmark", "Poland"))
##      query count         freq
## 1:  France  5517 1.399122e-04
## 2: Germany  4196 1.064114e-04
## 3: Britain  1708 4.331523e-05
## 4:   Spain  3378 8.566676e-05
## 5:   Italy  3209 8.138089e-05
## 6: Denmark  1615 4.095673e-05
## 7:  Poland  1820 4.615557e-05
count("EUROPARL-EN", '"[pP]opulism"')
##            query count         freq
## 1: "[pP]opulism"   107 2.713542e-06
Dispersions

… get dispersions of counts accross one (or two) dimensions …

pop <- dispersion("EUROPARL-EN", "populism", sAttribute = "text_year", progress = FALSE)
popRegex <- dispersion("EUROPARL-EN", '"[pP]opulism"', sAttribute = "text_year", cqp = TRUE, progress = FALSE)

Note that it is a data.table that is returned. Visualising the result as a barplot …

barplot(height = popRegex[,count], names.arg = popRegex[,text_year], las = 2)

Cooccurrences

… get cooccurrence statistics …

br <- cooccurrences("EUROPARL-EN", query = "Brussels")
eu <- cooccurrences("EUROPARL-EN", query = '"European" "Union"', left = 10, right = 10)
subset(eu, rank_ll <= 100)@stat[["word"]][1:15]
##  [1] "the"       ","         "of"        "I"         "to"       
##  [6] "."         "within"    "States"    "The"       "its"      
## [11] "countries" "in"        "we"        "this"      "that"

Working with subcorpora - partitions

Easily creating partitions (i.e. subcorpora) based on s-attributes is a strength of the ‘polmineR’ package. So if we want to work with the speeches given in the European Parliament in 2006:

ep2006 <- partition("EUROPARL-EN", text_year = "2006")

To get some basic information about the partition that has been set up, the ‘show’-method can be used. It is also called when you simply type the name of the partition object.

ep2006
## ** partition object **
## corpus:              EUROPARL-EN 
## name:                 
## sAttributes:         text_year = 2006 
## cpos:                46 pairs of corpus positions
## size:                3100529 tokens
## count:               not available

To evaluate s-attributes, regular expressions can be used.

barroso <- partition("EUROPARL-EN", speaker_name = "Barroso", regex = TRUE)
sAttributes(barroso, "speaker_name")
## [1] "Barroso"             "José Manuel Barroso"

If you work with a flat XML structure, the order of the provided s-attributes may be relevant for speeding up the set up of the partition. For a nested XML, it is important that with the order, you move from ancestors to childs. For further information, see the documentation of the partition-function.

Cooccurrences

The cooccurrences-method can be applied to partition-objects.

ep2002 <- partition("EUROPARL-EN", text_year = "2006")
terror <- cooccurrences(ep2002, "terrorism", pAttribute = "lemma", left = 10, right = 10)

Note that is is possible to provide a query that uses the full CQP syntax. The statistical analysis of collocations to the query can be accessed as the slot “stat” of the context object.

terror@stat[1:10,][,.(lemma, count_partition, rank_ll)]
##             lemma count_partition rank_ll
##  1:         fight            1047       1
##  2:       against            2822       2
##  3:        combat             623       3
##  4:         crime             533       4
##  5:      organise             441       5
##  6:           war             449       6
##  7:        threat             445       7
##  8:           and           84531       8
##  9:   immigration             601       9
## 10: international            1876      10

Distribution of queries

To understand the occurance of a phenomenon, the distribution of query results across one or two dimensions will often be interesing. This is done via the ‘distribution’ function. The query may use the CQP syntax.

# one query / one dimension
oneQuery <- dispersion(ep2002, query = 'terrorism', "text_date", progress = FALSE)

# # multiple queries / one dimension
twoQueries <- dispersion(ep2002, query= c("war", "peace"), "text_date", progress = FALSE)

Getting features

To identify the specific vocabulary of a corpus of interest, a statistical test based (chi square, or log likelihood) can be performed.

ep2002 <- partition("EUROPARL-EN", text_year = "2002")
ep2002 <- enrich(ep2002, pAttribute = "word")

epPre911 <- partition("EUROPARL-EN", text_year = as.character(1997:2001))
epPre911 <- enrich(epPre911, pAttribute = "word")

F <- features(ep2002, epPre911, included = FALSE)
subset(F, rank_chisquare <= 50)@stat[["word"]]
##  [1] "2002"         "Johannesburg" "Seville"      "Barcelona"   
##  [5] "'s"           "2003"         "Copenhagen"   "terrorism"   
##  [9] "02"           "candidate"    "Spanish"      "2002/"       
## [13] "Palestinian"  "Kaliningrad"  ","            "Aznar"       
## [17] "enlargement"  "asbestos"     "Danish"       "Sharon"      
## [21] "Prestige"     "biofuels"     "2004"         "Convention"  
## [25] "2003."        "Galicia"      "2001/"        "137(1"       
## [29] "Summit"       "Iraq"         "Criminal"     "GM"          
## [33] "Sky"          "Presidency"   "competences"  "ICC"         
## [37] "Galileo"      "Cotonou"      "Haarder"      "Israel"      
## [41] "sustainable"  "however"      "cod"          "Valencia"    
## [45] "Arafat"       "mid-term"     "COM(2001"     "Afghanistan" 
## [49] "Explanation"  "medicinal"

Getting a tm TermDocumentMatrix

For many applications, term-document matrices are the point of departure. The tm class TermDocumentMatrix serves as an input to several R packages implementing advanced text mining techniques. Obtaining this input from a corpus imported to the CWB will usually involve setting up a partitionBundle and then applying a method to get the matrix.

speakers <- partitionBundle(
  "EUROPARL-EN", sAttribute = "speaker_id",
  progress = FALSE, verbose = FALSE
)
speakers <- enrich(speakers, pAttribute = "word")
tdm <- as.TermDocumentMatrix(speakers, col = "count")
class(tdm) # to see what it is
show(tdm)
m <- as.matrix(tdm) # turn it into an ordinary matrix
m[c("Barroso", "Schulz"),]

Moving on

The package includes many features that go beyond this vignette. It is a key aim in the project to develop respective documentation in the vignette and the man pages for the individual functions further. Feedback is very welcome!

Annex I: Installing polmineR

Windows (32 bit / i386)

At this stage, an easy way to install polmineR is available only for 32bit R. Usually, an R installation will include both 32bit and 64bit R. So if you want to keep things simple, make sure that you work with 32bit version. If you work with RStudio (highly recommended), the menu Tools > Global Options will open a dialogue where you can choose 32bit R.

Before installing polmineR, the package ‘rcqp’ needs to be installed. In turn, rcqp requires plyr, which should be installed first.

install.packages("plyr")

To avoid compiling C code in a package, packages with compiled binaries are very handy. Windows binaries for the rcqp package are not available at CRAN, but can be installed from a repository of packages entertained at the server of the PolMine project:

install.packages("rcqp", repos = "http://polmine.sowi.uni-due.de/packages", type = "win.binary")

To explain: Compiling the C code in the rcqp package on a windows machine is not yet possible. The package we offer uses a cross-compilation of these C libraries, i.e. binaries that have been prepared for windows on a MacOS/Linux machine.

Before proceeding to install polmineR, we install dependencies that are not installed automatically.

install.packages(pkgs = c("htmltools", "htmlwidgets", "magrittr", "iterators", "NLP"))

The latest stable version of polmineR can now be installed from CRAN. Several other packages that polmineR depends on, or that dependencies depend on may be installed automatically.

install.packages("polmineR")

The development version of the package, which may include the most recent updates and features, can be installed from GitHub. The easiest way to do this is to use a mechanism offered by the package devtools.

install.packages("devtools")
devtools::install_github("PolMine/polmineR", ref = "dev")

The installation may throw warnings. There are three warnings you can ignore at this stage:

The configure script is for Linux/MacOS installation, its sole purpose is to pass tests for uploading the package to CRAN. As mentioned, windows binaries are not yet available for 64bit R at present, so that can be ignored. The environment variable “CORPUS_REGISTRY” can be set as follows in R:

Sys.setenv(CORPUS_REGISTRY = "C:/PATH/TO/YOUR/REGISTRY")

To set the environment variable CORPUS_REGISTRY permanently, see the instructions R offer how to find the file ‘.Renviron’ or ‘.Renviron.site’ when calling the help for the startup process(?Startup).

Two important notes concerning problems with the CORPUS_REGISTRY environment variable that may cause serious headaches:

Finally: polmineR if optimized for working with RStudio. It you work with 32bit R, you may have to check in the settings of RStudio that it will call 32bit R. To be sure, check the startup message.

If everything works, check whether polmineR can be loaded.

library(polmineR)
corpus() # to see corpora available at your system

Windows (64 bit / x86)

At this stage, 64 bit support is still experimental. Apart from an installation of 64 bit R, you will need to install Rtools, available here. Rtools is a collection of tools necessary to build and compile R packages on a Windows machine.

To interface to a core C library of the Corpus Workbench (CWB), you will need an installation of a 64 bit AND a 32 bit version of the CWB.

The “official” 32 bit version of the CWB is available here. Installation instructions are available at the CWB Website. The 32 bit version should be installed in the directory “C:Files”, with admin rights.

The 64 bit version, prepared by Andreas Blaette, is available here. Install this 64 bit CWB version to “C:Files (x86)”. In the unzipped downloaded zip file, you will find a bat file that will do the installation. Take care that you run the file with administrator rights. Without these rights, no files will be copied.

The interface to the Corpus Workbench is the package polmineR.Rcpp, available at GitHub. If you use git, you can clone that repository, otherwise, you can download a zip file.

The downloaded zip file needs to be unzipped again. Then, in the directory with the ‘polmineR.Rcpp’-directory, run:

R CMD build polmineR.Rcpp
R CMD INSTALL polmineR.Rcpp_0.1.0.tar.gz

If you read closely what is going on during the compilation, you will see a few warnings that libraries are not found. If creating the package is not aborted, nothing is wrong. R CMD build will look for the 64 bit files in the directory with the 32 bit dlls first and discover that they do not work for 64 bit, only then will it move to the correct location.

One polmineR.Rcpp is installed, proceed with the instructions for installing polmineR in a 32 bit context. Future binary releases of the polmineR.Rcpp package may make things easier. Anyway, the proof of concept is there that polmineR will work on a 64 bit Windows machine too.

Finally, you need to make sure that polmineR will interface to CWB indexed corpora using polmineR.Rcpp, and not with rcqp (the default). To set the interface accordingly:

setCorpusWorkbenchInterface("Rcpp")

To test whether corpora are available:

corpus()
##        corpus     size template
## 1 EUROPARL-EN 39431862    FALSE

MacOS

The following instructions for Mac users assume that R is installed on your system. Binaries are available from the Homepage of the R Project. An installation of RStudio is highly recommended. The Open Source License version of RStudio Desktop is what you need.

Installing ‘polmineR’

The latest release of polmineR can be installed from CRAN using the usual install.packages-function.

install.packages("polmineR")

The development version of polmineR can be installed using devtools:

install.packages("devtools") # unless devtools is already installed
devtools::install_github("PolMine/polmineR", ref = "dev")

Installing ‘rcqp’

The default interface of the polmineR package to access CWB indexed corpora is the package ‘rcqp’. Accessing corpora will not work before you have installed the interface.

Installing precompiled binary of rcqp from the PolMine server

The easiest way to get rcqp for Mac is install a precompiled binary that is available at the PolMine server:

install.packages(
  "rcqp",
  repos = "http://polmine.sowi.uni-due.de/packages",
  type = "mac.binary"
  )
Building rcqp from source

If you want to get rcqp from CRAN and/or if you want to to compile the C code yourself, the procedure is as follows.

First, you will need an installation of Xcode, which you can get it via the Mac App Store. You will also need the Command Line Tools for Xcode. It can be installed from a terminal with:

xcode-select --install

To compile the C code in the rcqp package, there are system requirements that need to be fulfilled. Using a package manager such as Homebrew or Macports makes things considerably easier.

Option 1: Using Homebrew

We recommend to use ‘Homebrew’. To install Homebrew, follow the instructions on the Homebrew Homepage. The following commands will install the C libraries the rcqp package relies on:

brew -v install pkg-config
brew -v install glib --universal
brew -v install pcre --universal
brew -v install readline

Option 2: Using Macports

If you prefer using Macports, get it from https://www.macports.org/. After installing Macports, it is necessary to restart the computer. Next, an update of Macports is necessary.

sudo port -v selfupdate

Now we can install the libraries rcqp will require. Again, from the terminal.

sudo port install glib2
sudo port install pkgconfig
sudo port install pcre

Install dependencies and rcqp

Once the system requirements are there, the next steps can be done from R. Before installing rcqp, and then polmineR, we install a few packages. In the R console:

install.packages(pkgs = c("RUnit", "devtools", "plyr", "tm"))

Now rcqp can be installed, and then polmineR:

install.packages("rcqp")
install.packages("polmineR")

If you like to work with the development version, that can be installed from GitHub.

devtools::install_github("PolMine/polmineR", ref = "dev")

Linux

The pcre, glib and pkg-config libraries can be installed using apt-get.

sudo apt-get install libglib2.0-dev
sudo apt-get install libssl-dev
sudo apt-get install libcurl4-openssl-dev

The system requirements will now be fulfilled. From R, install dependencies for rcqp/polmineR first, and then rcqp and polmineR.

install.packages("RUnit", "devtools", "plyr", "tm")
install.packages("rcqp")
install.packages("polmineR")

Annex II: CWB corpora and the CORPUS_REGISTRY environment variable

Indexed corpora can be stored in two different locations. The conventional way is to keep CWB corpora in a directory with two subdirectories, a ‘registry’ directory, and an ‘indexed_corpora’ directory. The files in the registry directory (‘registry’ in short) describe the main features of a corpus, and where it is stored. It is necessary to inform rcqp, the package used by polmineR to access corpora, about the registry directory. That is done using the CORPUS_REGISTRY environment variable. It needs be defined before loading rcqp and polmineR. Note that you need to set the environment to the ‘registry’ folder, not the files that are located in this directory.

The CORPUS_REGISTRY environment variable can be set manually from the R console:

Sys.setenv(CORPUS_REGISTRY = "/PATH/TO/YOUR/REGISTRY/DIRECTORY")

# For example the path could look like this:
# Sys.setenv(CORPUS_REGISTRY = "/Library/Frameworks/R.framework/Versions/3.3/Resources/library/plprbt/extdata/cwb/registry")

To check whether and how the environment variable is set:

Sys.getenv("CORPUS_REGISTRY")
## [1] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/europarl.en/extdata/cwb/registry"

You can set the environment variable permanently to avoid having to set it each time before you want to use polmineR. A good way is to inlude the following line in the file .Renviron in your home directory:

CORPUS_REGISTRY="/PATH/TO/YOUR/REGISTRY/DIRECTORY"

There are a few other options to have environment variables set at every time you launch polmineR. To learn about these, use the help for the R startup procedure.

?Startup