furniture
We will first make a ficticious data set:
df <- data.frame(a = rnorm(100, 1.5, 2),
b = seq(1, 100, 1),
c = c(rep("control", 40), rep("Other", 7), rep("treatment", 50), rep("None", 3)),
d = c(sample(1:1000, 90, replace=TRUE), rep(-99, 10)))
There are two functions that we’ll demonstrate here:
washer
table1
washer
is a great function for quick data cleaning. In situations where there are placeholders, extra levels in a factor, or several values need to be changed to another.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df <- df %>%
mutate(d = washer(d, -99), ## changes the placeholder -99 to NA
c = washer(c, "Other", "None", value = "control")) ## changes "Other" and "None" to "Control"
Now that the data is “washed” we can start exploring and reporting.
table1(df, a, b, factor(c), d)
##
## |==================================
## Mean/Count (SD/%)
## Observations 100
## a
## 1.58 (1.75)
## b
## 50.5 (29.01)
## factor(c)
## control 50 (50%)
## treatment 50 (50%)
## d
## 460 (300.01)
## |==================================
The variables must be numeric or factor. Since we use a special type of evaluation (i.e. Non-Standard Evaluation) we can change the variables in the function (e.g., factor(c)
). This can be extended to making a whole new variable in the function as well.
table1(df, a, b, d, ifelse(a > 1, 1, 0))
##
## |=========================================
## Mean/Count (SD/%)
## Observations 100
## a
## 1.58 (1.75)
## b
## 50.5 (29.01)
## d
## 460 (300.01)
## ifelse(a > 1, 1, 0)
## 0.56 (0.5)
## |=========================================
This is just the beginning though. Two powerful things the function can do are shown below:
table1(df, a, b, d, ifelse(a > 1, 1, 0),
splitby=~factor(c),
test=TRUE)
##
## |===============================================================
## control treatment P-Value
## Observations 50 50
## a 0.722
## 1.65 (1.76) 1.52 (1.76)
## b <.001
## 28.5 (22.37) 72.5 (14.58)
## d 0.005
## 542.64 (321.5) 369.67 (248.04)
## ifelse(a > 1, 1, 0) 0.691
## 0.58 (0.5) 0.54 (0.5)
## |===============================================================
The splitby = ~factor(c)
stratifies the means and counts by a factor variable (in this case either control or treatment). When we use this we can also automatically compute tests of significance using test=TRUE
.
Finally, you can polish it quite a bit using a few other options. For example, you can do the following:
table1(df, a, b, d, ifelse(a > 1, 1, 0),
splitby=~factor(c),
test=TRUE,
var_names = c("A", "B", "D", "New Var"),
splitby_labels = c("Control", "Treatment"))
##
## |========================================================
## Control Treatment P-Value
## Observations 50 50
## A 0.722
## 1.65 (1.76) 1.52 (1.76)
## B <.001
## 28.5 (22.37) 72.5 (14.58)
## D 0.005
## 542.64 (321.5) 369.67 (248.04)
## New Var 0.691
## 0.58 (0.5) 0.54 (0.5)
## |========================================================
This can also be outputted as a latex table:
table1(df, a, b, d, ifelse(a > 1, 1, 0),
splitby=~factor(c),
test=TRUE,
var_names = c("A", "B", "D", "New Var"),
splitby_labels = c("Control", "Treatment"),
output_type = "latex")
Both table1
and washer
add simplicity to cleaning up and understanding your data. Use these pieces of furniture to make your quantitative life a bit easier.