R is rapidly increasing popularity in scientific research; it has several advantages over other programming languages: it is free, it is fairly powerful, there is an excellent user community, and it was designed with the statistical analysis of data in mind. The purpose of this vignette is to introduce some of the function in the SciencesPo package, and how these can be used in data analysis workflows and reporting results.
The SciencesPo package is meant to provide primarily functions and algorithms for analyzing political behavior data, including measures of political fragmentation and seat allocation methods. In addition, it also equip the environment with some functions to do descriptive statistics, cross-tabulation, tests, and to render pre set publication-quality plots that will require only a minimum amount of fiddling details. Although this package is available for the general public, it meets my personal needs and tastes. Yours may be different.
The development of this package began in July of 2011. I found myself spending a great deal of time manipulating several datasets with a basket of standalone functions, but a very little time actually analyzing data. So, I thought this package could help researchers with interests alike. Since then, I have diligently labored to maintain this R library updated. The SciencesPo is currently hosted by the The R Foundation, and distributed over the internet via CRAN. You can find the package source code on Github, and you are welcome to contribute code via pull requests, or bug reports.
This document is organized as follows. In the next section, I addressed the basic aspects of setting up the package. Following, I discuss some of the basic data manipulation functions. Next, I show some of the basic functions for creating tables and cross-tabulations. Then, I present basic functions for computing distributions. Next, I address the basic functions for computing measures of political behavior. I end by introducing several plotting functions, including themes and color palettes available. Throughout the paper, I try to keep the commentary to a minimum so the user can easily breeze through this without having to digest my witty banter.
This section introduces the basics of SciencesPo for conducting customary analysis; the package version I’m using is 1.4.1.
SciencesPo loads packages as needed and assumes that they are installed. Thus, the recomended form of installing the package to ensure that all the needed packages are also installed is by using the following statement:
install.packages("SciencesPo", dependencies = c("Depends", "Suggests"))
library('SciencesPo')
## Do things ...
detach('package:SciencesPo', unload = TRUE)
If you load it via loadNamespace('SciencesPo')
, you can call unloadNamespace('SciencesPo')
.
Whenever you load the package, it will setup its own environment, including plotting themes, summary table styles, etc. Thus, some objects may be printed or plotted differently as you would have seen before loading SciencesPo. Also, the package is also operator (%>%
) friendly, so much of the functions will work as expected in a chain like.
To see a list of existing vignettes, type:
library("SciencesPo")
vignette(package = "SciencesPo")
no vignettes found
To see the collection of data included in the package, type:
data(package = "SciencesPo")
Here are some examples that demonstrates the results of help.search(), or you can also use ??
to search for for all commands that have some strings related to the text within quotes. You can also restrict the search to a specific library as:
help.search("D'Hondt", package = 'SciencesPo')
The next example searches inside all R related sites for that “string”, and open the results in a web browser.
RSiteSearch("D'Hondt")
In this section, a samll selection of functions for conducting exploratory data analysis (EDA) is presented.
There is a specific document covering one-way, two-way, and multiway cross-tabulations with accompanying independent tests, so the following examples will only introduce this topic. To describe an entire data.frame, there is the Describe
function, which will produce a summary description table with the following features: variable names, labels, factor levels, and frequencies or summary statistics. For small datasets, the output can be used as the “table one” descriptive statistics in a data analysis publication report.
data("titanic")
Describe(titanic)
Summary table
Dataset: titanic
------------------------------------------------------
Variable Levels.or.Stats Freq Valid
---------- ----------------- --------------- ---------
1. 1. 1st 1: 325 (14.8%) 2201/2201
CLASS 2. 2nd 2: 285 (12.9%) (100.0%)
3. 3rd 3: 706 (32.1%)
4. crew 4: 885 (40.2%)
2. 1. adult 1: 2092 (95%) 2201/2201
AGE 2. child 2: 109 (5%) (100.0%)
3. 1. female 1: 470 (21.4%) 2201/2201
SEX 2. male 2: 1731 (78.6%) (100.0%)
4. 1. no 1: 1490 (67.7%) 2201/2201
SURVIVED 2. yes 2: 711 (32.3%) (100.0%)
------------------------------------------------------
To tabulate one variable responses, simply:
with(titanic, Crosstable(SURVIVED))
A more performant descriptive output can be obtained with the Freq
command, which resembles the SPSS output.
Freq(titanic, SURVIVED)
Frequency Valid Percent Cum Percent
no 1490 67.7 67.7
yes 711 32.3 100.0
--------------------------------------------------------
Total 2201
Warning: Statistics may not be meaningful for factors!
Mean Std dev
1.323035 0.4677422
Minimum Maximum
1 2
Valid cases Missing cases
2201 0
To add a second variable for a cross-tabulation:
with(titanic, Crosstable(SEX, CLASS))
=========================================
CLASS
---------------------------
SEX 1st 2nd 3rd crew Total
-----------------------------------------
female 145 106 196 23 470
31% 23% 42% 4.9% 100%
male 180 179 510 862 1731
10% 10% 29% 49.8% 100%
-----------------------------------------
Total 325 285 706 885 2201
15% 13% 32% 40.2% 100%
=========================================
To delete table entries that are less relevant, switch it to FALSE
and to TRUE
otherwise. For instance, to not show column proportions, switch column
to FALSE
.
Crosstable(titanic, SEX, CLASS, column = FALSE)
To add a third variable:
Crosstable(titanic, SEX, CLASS, SURVIVED)
The Crosstable()
function can produce tables up to 3 variables with some of the most common statistical tests.
with(titanic, Crosstable(SEX, SURVIVED, fisher=TRUE) )
===========================
SURVIVED
-------------
SEX no yes Total
---------------------------
female 126 344 470
27% 73% 100%
male 1364 367 1731
79% 21% 100%
---------------------------
Total 1490 711 2201
68% 32% 100%
===========================
Fisher's Exact Test for Count Data
data: x
p-value < 0.00000000000000022
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.07734453 0.12536590
sample estimates:
odds ratio
0.09869816
with(titanic, Crosstable(SEX, CLASS, SURVIVED, chisq = TRUE))
=================================
SURVIVED
-------------
SEX CLASS no yes Total
---------------------------------
female 1st 4 141 145
2.8% 97% 100%
2nd 13 93 106
12.3% 88% 100%
3rd 106 90 196
54.1% 46% 100%
crew 3 20 23
13.0% 87% 100%
--------------------------
Total 126 344 470
26.8% 73% 100%
---------------------------------
male 1st 118 62 180
66% 34% 100%
2nd 154 25 179
86% 14% 100%
3rd 422 88 510
83% 17% 100%
crew 670 192 862
78% 22% 100%
--------------------------
Total 1364 367 1731
79% 21% 100%
---------------------------------
Total 1st 122 203 325
38% 62% 100%
2nd 167 118 285
59% 41% 100%
3rd 528 178 706
75% 25% 100%
crew 673 212 885
76% 24% 100%
--------------------------
Total 1490 711 2201
68% 32% 100%
=================================
SEX : female
Number of cases in table: 470
Number of factors: 2
Test for independence of all factors:
Chisq = 130.69, df = 3, p-value = 0.0000000000000000000000000003837
SEX : male
Number of cases in table: 1731
Number of factors: 2
Test for independence of all factors:
Chisq = 29.852, df = 3, p-value = 0.000001483
To demonstrate the output of these functions, below are the US presidents and their main opponents’ heights (cm). There is claim that height matters in American presidential elections.
require("SciencesPo")
data(stature)
These functions perform three types of skewness and kurtosis assessments in conformity with the popular e1071 package (Meyer et al. 2015):
data(stature)
attach(stature)
# Type 1:
Skewness(winner.height, type = 1)
[1] -0.5589481
# Type 2
Skewness(winner.height, type = 2)
[1] -0.5741697
# Type 3, the default
Skewness(winner.height)
[1] -0.5443037
data(stature)
attach(stature)
# Type 1:
Kurtosis(winner.height, type = 1)
[1] -0.4117404
# Type 1:
Kurtosis(winner.height, type = 2)
[1] -0.3371491
# Type 3, the default
Kurtosis(winner.height)
[1] -0.5017599
x <- sample (10, replace = TRUE)
Mode(x)
[1] 10
data(stature)
attach(stature)
CI(winner.height, level=.95) # confidence interval
Lower Est. Mean Upper Std. Error
178.6336 180.5614 182.4892 0.9623
CI(winner.height, level=.95)@mean # get only the mean
[1] 180.5614
CI(opponent.height, level=.95, na.rm = TRUE) # confidence interval
Lower Est. Mean Upper Std. Error
177.6733 179.5000 181.3267 0.9119
SE(winner.height) # std. error
[1] 0.9623285
data(stature)
attach(stature)
Winsorize(winner.height)
[1] 180.5614
# replacing 35 outlier elements we get same stature values:
Winsorize(winner.height, k=35)
[1] 183
Winsorize(opponent.height, k=35)
[1] 183
data(stature)
attach(stature)
AAD(winner.height)
[1] 5.895968
data(religiosity)
Label(religiosity$Religiosity) <- "Religiosity index"
Used to recode values of a vector into new values.
require(SciencesPo)
table(iris$Species)
setosa versicolor virginica
50 50 50
iris$Species <- Recode(iris$Species, "versicolor", 2)
table(iris$Species)
setosa 2 virginica
50 50 50
By default, R converts character columns to factors. Instead of re-reading the data using , the function identifies which columns are currently factors, and convert them all to characters parsing the levels as strings.
require(SciencesPo)
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","2","virginica": 1 1 1 1 1 1 1 1 1 1 ...
iris_2 = Safechars(iris)
str(iris_2)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : chr "setosa" "setosa" "setosa" "setosa" ...
This function converts factor variables to numeric, much like the way Stata does.
require(SciencesPo)
# Simulate some data (12 respondents x 2 items)
df <- data.frame(replicate(2, sample(1:5, 12, replace=TRUE)))
df <- data.frame(lapply(df, factor, ordered=TRUE,
levels=1:5,
labels=c("Strongly disagree","Disagree", "Neutral","Agree","Strongly Agree")))
print(df)
X1 X2
1 Neutral Neutral
2 Neutral Strongly Agree
3 Disagree Disagree
4 Agree Agree
5 Disagree Agree
6 Neutral Disagree
7 Disagree Strongly disagree
8 Agree Disagree
9 Neutral Agree
10 Strongly disagree Strongly Agree
11 Disagree Strongly disagree
12 Neutral Strongly disagree
By Destring
it, we should get a numeric result with the same (un)order:
Destring(df, "X2")
X1 X2
1: Neutral 3
2: Neutral 5
3: Disagree 2
4: Agree 4
5: Disagree 4
6: Neutral 2
7: Disagree 1
8: Agree 2
9: Neutral 4
10: Strongly disagree 5
11: Disagree 1
12: Neutral 1
x <- sample(10)
# won't print normalized values by default
(y = Normalize(x) )
[,1]
[1,] 0.0000000
[2,] 0.2222222
[3,] 1.0000000
[4,] 0.8888889
[5,] 0.3333333
[6,] 0.1111111
[7,] 0.6666667
[8,] 0.7777778
[9,] 0.5555556
[10,] 0.4444444
attr(,"normalized:min")
[1] 1
attr(,"normalized:max")
[1] 10
# equals to:
(x-min(x))/(max(x)-min(x))
[1] 0.0000000 0.2222222 1.0000000 0.8888889 0.3333333 0.1111111 0.6666667
[8] 0.7777778 0.5555556 0.4444444
# Smithson and Verkuilen approach
(y = Normalize(x, method="SV"))
[,1]
[1,] 0.05
[2,] 0.25
[3,] 0.95
[4,] 0.85
[5,] 0.35
[6,] 0.15
[7,] 0.65
[8,] 0.75
[9,] 0.55
[10,] 0.45
attr(,"normalized:min")
[1] 1
attr(,"normalized:max")
[1] 10
Simple but useful. The Formatted
function rounds numbers to text and drops leading zeros in the process.
x <- as.double(c(0.1, 1, 10, 100, 1000, 10000))
Formatted(x)
[1] "$ 0.10" "$ 1.00" "$ 10.00" "$ 100.00" "$ 1000.00"
[6] "$ 10000.00"
Formatted(x, "BRL")
[1] "R$ 0,10" "R$ 1,00" "R$ 10,00" "R$ 100,00"
[5] "R$ 1.000,00" "R$ 10.000,00"
p = c(0.25, 25, 50)
Formatted(p, "Perc", flag="+")
[1] "+0.25%" "+25%" "+50%"
Formatted(p, "Perc", decimal.mark=",")
[1] "0,25%" "25%" "50%"
The following is an example data frame.
Z X Y
1 K 1 no
2 G 2 no
3 S 3 yes
4 N 5 no
5 L 4 no
Then, the anonymized result of the data frame above:
dt %>% Anonymize()
V1 V2 V3
1 O 0.3333333 X
2 G 0.6666667 X
3 M 1.0000000 C
4 R 1.6666667 X
5 B 1.3333333 X
Here are some implementation of standard statistical distributions for pedagogic use. Most of these distributions are available within the base R, however, I remaned them with more intuitive names–at least in my view.
normalpdf(x=1.2, mu=0, sigma=1)
normalcdf(lower=-1.96, upper=1.96, mu=0, sigma=1)
invnormal(area=0.35, mu=0, sigma=1)
The function will return a matrix with rows, each containing a single random number according to the supplied alpha vector or matrix.
alphas <- cbind(1:4, 1, 4:1);
rdirichlet(4, alphas );
[,1] [,2] [,3]
[1,] 0.2173234 0.06355659 0.71911997
[2,] 0.6197115 0.11290606 0.26738244
[3,] 0.8371398 0.00306902 0.15979119
[4,] 0.6147979 0.35498354 0.03021851
Consider the following example of usage: A Brazilian face-to-face poll conducted by Datafolha on Oct 03-04 (2014) with 18,116 insterviews, asked voters for their vote preference among presidential candidates.
# draw a sample from the posterior
set.seed(1234);
n <- 18116;
poll <- c(40,24,22,5,5,4) / 100 * n; # The data
mcmc <- 10000;
sims <- rdirichlet(mcmc, alpha = poll + 1);
Let’s see a descriptive summary of the simulated data:
Describe(sims)
Summary table
Dataset: sims
-----------------------------------------------------------------------------
Variable Levels.or.Stats Freq Valid
---------- ----------------------------------- ------------------ -----------
1. mean (sd) = 0.4 (0) 10000 distinct val 10000/10000
V1 min < med < max = 0.39 < 0.4 < 0.41 (100.0%)
IQR (CV) = 0 (0.01)
2. mean (sd) = 0.24 (0) 10000 distinct val 10000/10000
V2 min < med < max = 0.23 < 0.24 < (100.0%)
0.25
IQR (CV) = 0 (0.01)
3. mean (sd) = 0.22 (0) 10000 distinct val 10000/10000
V3 min < med < max = 0.21 < 0.22 < (100.0%)
0.23
IQR (CV) = 0 (0.01)
4. mean (sd) = 0.05 (0) 10000 distinct val 10000/10000
V4 min < med < max = 0.04 < 0.05 < (100.0%)
0.06
IQR (CV) = 0 (0.03)
5. mean (sd) = 0.05 (0) 10000 distinct val 10000/10000
V5 min < med < max = 0.04 < 0.05 < (100.0%)
0.06
IQR (CV) = 0 (0.03)
6. mean (sd) = 0.04 (0) 10000 distinct val 10000/10000
V6 min < med < max = 0.04 < 0.04 < (100.0%)
0.05
IQR (CV) = 0 (0.04)
-----------------------------------------------------------------------------
After obtained 10,000 simulated values, we look at the margins as of Aecio over Marina in the very last days of the campaign:
# compute the margins: Aecio minus Marina
margins <- sims[,2] - sims[,3];
# What is the mean of the margins
# posterior mean estimate:
mean(margins);
[1] 0.02003513
# posterior standard deviation:
sd(margins);
[1] 0.005006119
# 90% credible interval:
quantile(margins, probs = c(0.025, 0.975));
2.5% 97.5%
0.01031474 0.02998522
# posterior probability of a positive margin (Aécio over Marina):
mean(margins > 0);
[1] 1
There is a specific vignette for Indices, covering in great detail many of the political behaviour indices.
Let’s consider the 1980’s US presidential election (vote share) to compute the effective number of electoral parties (ENEP).
Democratic Republican Independent Libertarian Citizens Others
0.410 0.507 0.066 0.011 0.003 0.003
PoliticalDiversity(US1980, index = "laakso/taagepera");
[1] 2.328
PoliticalDiversity(US1980, index = "golosov");
[1] 2.597
Considers the following data.frame
with electoral results for the 1999 election in Helsinki, the seats were allocated using both Saint-Laguë and D’Hondt methods:
votes seats.SL seats.dH
1 68885 5 5
2 18343 1 1
3 86448 6 7
4 21982 1 1
5 51587 4 4
6 27227 2 2
7 8482 1 0
8 7250 0 0
9 365 0 0
10 2734 0 0
11 1925 0 0
12 475 0 0
13 1693 0 0
14 693 0 0
15 308 0 0
16 980 0 0
17 560 0 0
18 590 0 0
19 185 0 0
# ENEP defaults (laakso/taagepera) method
with(Helsinki, PoliticalDiversity(votes)); #ENEP Votes
[1] 5.453
with(Helsinki, PoliticalDiversity(seats.SL)); #ENP for Saint-Lague
[1] 4.762
with(Helsinki, PoliticalDiversity(seats.dH)); #ENP for D'Hondt
[1] 4.167
Let’s use the 2012 Quebec provincial election:
party votes pvotes seats pseats
1 PQ 1393703 31.95 54 43.2
2 Lib 1360968 31.20 50 40.0
3 CAQ 1180235 27.05 19 15.2
4 QS 263111 6.03 2 1.6
5 Option 82539 1.89 0 0.0
6 Green 43394 0.99 0 0.0
7 Others 38738 0.89 0 0.0
with(Quebec, Proportionality(pvotes, pseats,
index = "Rae") )
Rae's Index : 0.057
with(Quebec, Proportionality(pvotes, pseats,
index = "Loosemore-Hanby") )
Loosemore-Hanby's Index : 0.201
with(Quebec, Proportionality(pvotes, pseats,
index = "Gallagher") )
Gallagher's Index : 0.136
Now using data from 2014 Brazilian legislative elections, especifically from one district, let’s compare the results from D’Hondt, Saint-Lague, Hungtinton-Hill, and Imperiali methods.
PCdoB PDT PEN PMDB PRB PSB PSC PSTU PTdoB
187906 326841 132531 981096 2043217 15061 103679 109830 213988
PTC PTN
67145 278267
The basic imputs for this class of functions are: 1) a list of parties, 2) a list of positive votes, and 3) a constant value for the number of seats to be returned. A numeric value (0~1) for the threshold is optional.
HighestAverages(parties=names(Ceara), votes=Ceara,
seats = 42, method = "dh")
Method: d'Hondt
Divisors: 1 2 3 4 ...
ENP: 3.65 (After): 3.14
Gallagher Index: 47.01
-------------------------
Party Seats % Seats
------- ------- ---------
PRB 21 50
PMDB 10 23.8
PDT 3 7.1
PTN 2 4.8
PTdoB 2 4.8
PCdoB 1 2.4
PEN 1 2.4
PSC 1 2.4
PSTU 1 2.4
PSB 0 0
PTC 0 0
-------------------------
The d’Hondt method is only one way of allocating seats in party list systems. Other methods include the Saint-Lague, the modified Saint-Lague, the Danish version, Imperiali (do not to confuse with the Imperiali quota which is a Largest remainder method), and Hungtinton-Hill.
HighestAverages(parties=names(Ceara), votes=Ceara,
seats = 42, method = "sl")
Method: Sainte-Laguë
Divisors: 1 3 5 7 ...
ENP: 3.65 (After): 3.74
Gallagher Index: 43.99
-------------------------
Party Seats % Seats
------- ------- ---------
PRB 19 45.2
PMDB 9 21.4
PDT 3 7.1
PTN 3 7.1
PCdoB 2 4.8
PTdoB 2 4.8
PEN 1 2.4
PSC 1 2.4
PSTU 1 2.4
PTC 1 2.4
PSB 0 0
-------------------------
Let’s assume the electoral system has a 5% vote threshold. Meaning that parties must get at least 5% of the total unspoiled votes cast in order to participate in the distribution of seats. Parties PCdoB, PTdoB, PEN, PSC, PSTU, PSB, and PTC would then be elimiated from competition at the outset. If the d’Hondt method of seat allocation were employed, then party PRB would get 4 extra seats than otherwise, and party PMDB 3 additional seats.
HighestAverages(parties=names(Ceara), votes=Ceara,
seats = 42, method = "dh", threshold = 5/100)
Method: d'Hondt
Divisors: 1 2 3 4 ...
ENP: 3.65 (After): 2.39
Gallagher Index: 53.23
-------------------------
Party Seats % Seats
------- ------- ---------
PRB 24 57.1
PMDB 12 28.6
PDT 3 7.1
PTN 3 7.1
PCdoB 0 0
PEN 0 0
PSB 0 0
PSC 0 0
PSTU 0 0
PTC 0 0
PTdoB 0 0
-------------------------
Other methods divide the votes by a mathematically derived quota, such as the Droop quota, the Hare quota (or Hamilton/Vinton), or the Imperiali quota, see next.
LargestRemainders(parties=names(Ceara), votes=Ceara,
seats = 42, method = "hare")
LargestRemainders(parties=names(Ceara), votes=Ceara,
seats = 42, method = "droop")
The output produced by the highestAveragesof()
and largestRemainders()
functions is always a data.frame
; therefore, it’s very straightforward to use with other aplications. For instance, I like the idea of using the output with knitr (2016) and ggplot2 (2009) packages to produce publishable-quality tables and graphs.
mytable = HighestAverages(parties=names(Ceara), votes=Ceara,
seats = 42, method = "dh")
Method: d'Hondt
Divisors: 1 2 3 4 ...
ENP: 3.65 (After): 3.14
Gallagher Index: 47.01
library(knitr)
kable(mytable, align=c("l","c","c"))
Party | Seats | % Seats |
---|---|---|
PRB | 21 | 50.0 |
PMDB | 10 | 23.8 |
PDT | 3 | 7.1 |
PTN | 2 | 4.8 |
PTdoB | 2 | 4.8 |
PCdoB | 1 | 2.4 |
PEN | 1 | 2.4 |
PSC | 1 | 2.4 |
PSTU | 1 | 2.4 |
PSB | 0 | 0.0 |
PTC | 0 | 0.0 |
mytable = HighestAverages(parties=names(Ceara), votes=Ceara,
seats = 42, method = "dh")
Method: d'Hondt
Divisors: 1 2 3 4 ...
ENP: 3.65 (After): 3.14
Gallagher Index: 47.01
p <- ggplot(mytable, aes(x=reorder(Party, Seats), y=Seats)) +
geom_lollipop() + coord_flip() + labs(x=NULL, y="# Seats")
p + theme_538() + theme(panel.grid.major.y = element_blank())
2014 Legislative Election in Ceara (M=42)
$p
[1] 0.0000000 0.2264877 0.2303473 0.3608878 0.4116205 0.6213280 0.7130032
[8] 0.7581524 0.8769094 1.0000000
$L
[1] 0.0000000 0.1955209 0.1990112 0.3231465 0.3731350 0.5883759 0.6829786
[8] 0.7313230 0.8617789 1.0000000
$L.general
[1] 0.0000 176.2074 179.3530 291.2262 336.2768 530.2563 615.5142
[8] 659.0832 776.6527 901.2203
attr(,"class")
[1] "lorenz"
x <- c(778, 815, 857, 888, 925, 930, 965, 990, 1012)
wgt <- runif(n=length(x))
Gini(x, wgt)
[1] 0.03611
x <- c(778, 815, 857, 888, 925, 930, 965, 990, 1012)
Atkinson(x, epsilon = 0.5)
[1] 0.001719
The SciencesPo brings in few themes and color palettes to enhance ggplot2 graphs, but there are specialized packages like ggthemes, ggthemr, cowplot, ggalt, GGally, and sjPlot that can make graph decoration a lot easier. There’s also an interesting discussion on plot design in the outstanding reference Cookbook for R that might be of your interest.
With SciencesPo, one can “preview” ggplot themes by simply invoking the Previewplot()
function. This will draw a predefined plot that we can combine with themes, for instance.
Previewplot() + theme_grey()
In my point of view, the default ggplot2 design has its charm, but, very often I don’t like the gray background grid. I feel it distracts from the data. For example, see the following ggplot2 graph with default settings on.
# detach("package:SciencesPo", unload = TRUE)
gg <- ggplot(mtcars, aes(mpg, disp,color=factor(carb),size=hp))
gg <- gg + geom_point(alpha=0.7) + labs(title="Bubble Plot")
gg <- gg +scale_size_continuous(range = c(3,8))
gg
theme_blank()
Produces a consistent theme for ggplot2 graphs with no background, axis texts, or grid lines.library(SciencesPo)
gg <- ggplot(mtcars, aes(mpg, disp,color=factor(carb),size=hp))
gg <- gg + geom_point(alpha=0.7) + labs(title="Bubble Plot")
gg <- gg +scale_size_continuous(range = c(3,8))
gg <- gg + theme_blank()
gg
theme_pub()
Produces a consistent theme for ggplot2 graphs with a white background and fewer grid lines.library(SciencesPo)
gg <- ggplot(mtcars, aes(mpg, disp,color=factor(carb),size=hp))
gg <- gg + geom_point(alpha=0.7) + labs(title="Bubble Plot")
gg <- gg +scale_size_continuous(range = c(3,8))
gg <- gg + theme_pub()
gg
theme_fte()
Produces a consistent theme for ggplot2 graphs, similar to the plots used in the http://fivethirtyeight.com/.library(SciencesPo)
gg <- ggplot(mtcars, aes(mpg, disp,color=factor(carb),size=hp))
gg <- gg + geom_point(alpha=0.7) + labs(title="Bubble Plot")
gg <- gg +scale_size_continuous(range = c(3,8))
gg <- gg + theme_538()
gg
theme_darkside()
Produces a consistent theme for ggplot2 graphs with a dark background and fewer grid lines.library(SciencesPo)
gg <- ggplot(mtcars, aes(mpg, disp,color=factor(carb),size=hp))
gg <- gg + geom_point(alpha=0.7) + labs(title="Bubble Plot")
gg <- gg +scale_size_continuous(range = c(3,8))
gg <- gg + theme_darkside()
gg
If you want to change the theme for an entire session you can use theme_set()
as in theme_set(theme_gray())
to switch to default ggplot2 theme for all subsequent plots. Otherwise, you may also apply themes without changing the defaults as of plot + theme_pub()
.
To modify general aspects of the theme_pub()
as fontsize, font family, color etc:
require(SciencesPo)
# "Verdana", "Tahoma", "Gill Sans" "serif" and "sans" are also high-readability fonts
theme_set(theme_pub(base_size=18, base_family = "serif"))
Modify it with theme()
. You might modify text sizes, colors etc; then add to the plot.
prefs <- theme(axis.text = element_text(size=18,
family = "Gill Sans",
colour="red"))
Alternatively, we can use a set of built in function to quick plot customization:
Previewplot() +
theme_538() +
align_title_right()
Previewplot() +
theme_538() +
no_y_gridlines()
Previewplot() +
theme_538() +
no_y_gridlines() +
no_x_gridlines()
The function Footnote()
can be handy for adding extra text information, such as data source, equation, or credits as footnotes to a ggplot2 or gtable objects. The next chunck demonstrates its usage:
gg <- ggplot(Presidents, aes(x=height_ratio, y=vote_support))
gg <- gg + geom_smooth(method=lm, colour="red", fill="gold")
gg <- gg + geom_point(size = 5, alpha = .7)
gg <- gg + xlim(0.9,1.2) + ylim(.40, .70)
gg <- gg + labs(x="Winner/Opponent height ratio", y="Winner vote share", title="Does Height Matter in Presidential Elections?")
gg <- gg + theme_pub()
# Commence adding layers here
Render(gg) + Footnote(note="danielmarcelino.github.io", color="orange")
with(mtcars, Scatterplot(x = wt, y = mpg,
main = "Vehicle Weight-Gas Mileage Relationship",
xlab = "Vehicle Weight",
ylab = "Miles per Gallon",
font.family = "serif") )
The following scales help passing manual values to scale colors for ggplot2 objects. In many ways, I think these scales are consistent with the themes whitin the SciencesPo package described earlier.
scale_color_pub()
scale_fill_pub()
scale_shape_pub()
scale_color_party()
scale_fill_party()
library("scales", quietly = TRUE)
show_col(pub_pal("pub12")(12))
show_col(pub_pal("gray5")(6), labels = FALSE)
show_col(pub_pal("manyeyes")(20))
show_col(pub_pal("tableau20")(20))
show_col(pub_pal("tableau10")(10))
show_col(pub_pal("tableau10light")(10))
show_col(pub_pal("cyclic")(20))
party_pal("BRA", plot=TRUE)
show_col(pub_pal("trafficlight")(9))
show_col(pub_pal("bivariate1")(9))
show_col(pub_pal("bivariate2")(9))
show_col(pub_pal("bivariate3")(9))
show_col(pub_pal("bivariate4")(9))
setting value
version R version 3.3.0 (2016-05-03)
system x86_64, darwin14.5.0
ui X11
language (EN)
collate C
tz America/Sao_Paulo
date 2016-08-04
package * version date source
R6 2.1.2 2016-01-26 CRAN (R 3.3.0)
Rcpp 0.12.6 2016-07-19 CRAN (R 3.3.0)
SciencesPo * 1.4.1 2013-08-27 CRAN
assertthat 0.1 2013-12-06 cran (@0.1)
chron 2.3-47 2015-06-24 cran (@2.3-47)
codetools 0.2-14 2015-07-15 CRAN (R 3.3.0)
colorspace 1.2-6 2015-03-11 cran (@1.2-6)
crayon 1.3.2 2016-06-28 CRAN (R 3.3.0)
data.table 1.9.6 2015-09-19 cran (@1.9.6)
devtools * 1.12.0 2016-06-24 CRAN (R 3.3.0)
digest 0.6.9 2016-01-08 CRAN (R 3.3.0)
evaluate 0.9 2016-04-29 CRAN (R 3.3.0)
formatR 1.4 2016-05-09 CRAN (R 3.3.0)
ggplot2 * 2.1.0.9000 2016-08-04 Github (hadley/ggplot2@91859da)
gtable 0.2.0 2016-02-26 cran (@0.2.0)
htmltools 0.3.5 2016-03-21 CRAN (R 3.3.0)
httpuv 1.3.3 2015-08-04 CRAN (R 3.3.0)
knitr * 1.13 2016-05-09 CRAN (R 3.3.0)
lubridate 1.5.6 2016-04-06 cran (@1.5.6)
magrittr 1.5 2014-11-22 CRAN (R 3.3.0)
memoise 1.0.0 2016-01-29 CRAN (R 3.3.0)
mime 0.5 2016-07-07 CRAN (R 3.3.0)
munsell 0.4.3 2016-02-13 cran (@0.4.3)
pander 0.6.0 2015-11-23 cran (@0.6.0)
plyr 1.8.4.9000 2016-07-27 Github (hadley/plyr@fe19241)
rmarkdown 0.9.6.14 2016-07-27 Github (rstudio/rmarkdown@cb74ee5)
roxygen2 5.0.1 2015-11-11 CRAN (R 3.3.0)
rstudioapi 0.6 2016-06-27 CRAN (R 3.3.0)
scales * 0.4.0 2016-02-26 cran (@0.4.0)
shiny 0.13.2 2016-03-28 CRAN (R 3.3.0)
stringi 1.1.1 2016-05-27 CRAN (R 3.3.0)
stringr 1.0.0 2015-04-30 CRAN (R 3.3.0)
testthat 1.0.2 2016-04-23 CRAN (R 3.3.0)
tibble 1.1 2016-07-04 cran (@1.1)
withr 1.0.2 2016-06-20 CRAN (R 3.3.0)
xtable 1.8-2 2016-02-05 CRAN (R 3.3.0)
yaml 2.1.13 2014-06-12 CRAN (R 3.3.0)
Arnold, Jeffrey B. 2016. Ggthemes: Extra Themes, Scales and Geoms for ’Ggplot2’. https://CRAN.R-project.org/package=ggthemes.
Lüdecke, Daniel. 2016. SjPlot: Data Visualization for Statistics in Social Science. http://CRAN.R-project.org/package=sjPlot.
Meyer, David, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, and Friedrich Leisch. 2015. E1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. https://CRAN.R-project.org/package=e1071.
R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Rudis, Bob. 2016. Ggalt: Extra Coordinate Systems, Geoms and Statistical Transformations for ’Ggplot2’. https://CRAN.R-project.org/package=ggalt.
Schloerke, Barret, Jason Crowley, Di Cook, Francois Briatte, Moritz Marbach, Edwin Thoen, and Amos Elberg. 2016. GGally: Extension to Ggplot2. https://CRAN.R-project.org/package=GGally.
Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.
Wilke, Claus O. 2016. Cowplot: Streamlined Plot Theme and Plot Annotations for ’Ggplot2’. https://CRAN.R-project.org/package=cowplot.
Xie, Yihui. 2016. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://CRAN.R-project.org/package=knitr.