Furniture

2016-10-20

Using furniture

We will first make a ficticious data set:

df <- data.frame(a = rnorm(100, 1.5, 2), 
                 b = seq(1, 100, 1), 
                 c = c(rep("control", 40), rep("Other", 7), rep("treatment", 50), rep("None", 3)),
                 d = c(sample(1:1000, 90, replace=TRUE), rep(-99, 10)))

There are two functions that we’ll demonstrate here:

  1. washer
  2. table1

Washer

washer is a great function for quick data cleaning. In situations where there are placeholders, extra levels in a factor, or several values need to be changed to another.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df <- df %>%
  mutate(d = washer(d, -99),  ## changes the placeholder -99 to NA
         c = washer(c, "Other", "None", value = "control")) ## changes "Other" and "None" to "Control"

Table1

Now that the data is “washed” we can start exploring and reporting.

table1(df, a, b, factor(c), d)
## 
## |==================================
##               Mean/Count (SD/%)
##  Observations 100              
##  a                             
##               1.58 (1.75)      
##  b                             
##               50.5 (29.01)     
##  factor(c)                     
##     control   50 (50%)         
##     treatment 50 (50%)         
##  d                             
##               460 (300.01)     
## |==================================

The variables must be numeric or factor. Since we use a special type of evaluation (i.e. Non-Standard Evaluation) we can change the variables in the function (e.g., factor(c)). This can be extended to making a whole new variable in the function as well.

table1(df, a, b, d, ifelse(a > 1, 1, 0))
## 
## |=========================================
##                      Mean/Count (SD/%)
##  Observations        100              
##  a                                    
##                      1.58 (1.75)      
##  b                                    
##                      50.5 (29.01)     
##  d                                    
##                      460 (300.01)     
##  ifelse(a > 1, 1, 0)                  
##                      0.56 (0.5)       
## |=========================================

This is just the beginning though. Two powerful things the function can do are shown below:

table1(df, a, b, d, ifelse(a > 1, 1, 0),
       splitby=~factor(c), 
       test=TRUE)
## 
## |===============================================================
##                      control        treatment       P-Value
##  Observations        50             50                     
##  a                                                  0.722  
##                      1.65 (1.76)    1.52 (1.76)            
##  b                                                  <.001  
##                      28.5 (22.37)   72.5 (14.58)           
##  d                                                  0.005  
##                      542.64 (321.5) 369.67 (248.04)        
##  ifelse(a > 1, 1, 0)                                0.691  
##                      0.58 (0.5)     0.54 (0.5)             
## |===============================================================

The splitby = ~factor(c) stratifies the means and counts by a factor variable (in this case either control or treatment). When we use this we can also automatically compute tests of significance using test=TRUE.

Finally, you can polish it quite a bit using a few other options. For example, you can do the following:

table1(df, a, b, d, ifelse(a > 1, 1, 0),
       splitby=~factor(c), 
       test=TRUE,
       var_names = c("A", "B", "D", "New Var"),
       splitby_labels = c("Control", "Treatment"))
## 
## |========================================================
##               Control        Treatment       P-Value
##  Observations 50             50                     
##  A                                           0.722  
##               1.65 (1.76)    1.52 (1.76)            
##  B                                           <.001  
##               28.5 (22.37)   72.5 (14.58)           
##  D                                           0.005  
##               542.64 (321.5) 369.67 (248.04)        
##  New Var                                     0.691  
##               0.58 (0.5)     0.54 (0.5)             
## |========================================================

This can also be outputted as a latex table:

table1(df, a, b, d, ifelse(a > 1, 1, 0),
       splitby=~factor(c), 
       test=TRUE,
       var_names = c("A", "B", "D", "New Var"),
       splitby_labels = c("Control", "Treatment"),
       output_type = "latex")

Conclusion

Both table1 and washer add simplicity to cleaning up and understanding your data. Use these pieces of furniture to make your quantitative life a bit easier.