sum_up prints detailed summary statistics (corresponds to Stata summarize)
N <- 100
df <- data_frame(
id = 1:N,
v1 = sample(5, N, TRUE),
v2 = sample(1e6, N, TRUE)
)
sum_up(df)
df %>% sum_up(starts_with("v"), d = TRUE)
df %>% group_by(v1) %>% sum_up()
tab prints distinct rows with their count
N <- 1e2 ; K = 10
df <- data_frame(
id = sample(5, N, TRUE),
v1 = sample(5, N, TRUE)
)
tab(df, id, v1)
tab(df, id, v1, na.rm = TRUE)
df %>% group_by(id) %>% tab(v1)
join
is a wrapper for dplyr merge functionalities.
The option “kind” specifies the kind of join based on SQL syntax. Possible kinds are : left, right, inner, full, semi, anti and cross.
Stata | statar |
---|---|
merge v1 | join(x, y, kind = “full”) |
merge v1, keep(master matched) | join(x, y, kind = “left”) |
merge v1, keep(matched using) | join(x, y, kind = “right”) |
merge v1, keep(matched) | join(x, y, kind = “inner”) |
merge v1, keep(matched) keepusing(v1) | join(x, y, kind = “semi”) |
merge v1, keep(master) keepusing(v1) | join(x, y, kind = “anti”) |
crossby | join(x, y, kind = “cross”) |
The option “check” checks there are no duplicates in the master or using data.tables (as in Stata).
# merge m:1 v1
join(x, y, kind = "full", check = m~1)
The option “gen” specifies the name of a new variable that identifies non matched and matched rows (as in Stata).
# merge m:1 v1, gen(_merge)
join(x, y, kind = "full", gen = "_merge")
The option “update” allows to update missing values of the master dataset by the value in the using dataset
graph
is a wrapper for ggplot2
functionalities, useful for interactive exploration of datasets
N <- 10000
DT <- data.table(
id = sample(c("id1","id2","id3"), N, TRUE),
v1 = sample(c(1:5), N, TRUE),
v2 = rnorm(N, sd = 20),
v3 = sample(runif(100, max=100), N, TRUE)
)
DT[, v4 := (id=="id1")* v2 + rnorm(N, sd = 5)]
graph(DT)
graph(DT, by = id)
graph(DT, by = id, type = "boxplot")
graph(DT, v3, v4, along_with = v2)
You can also regress the variable on another, after partialing out thanks to control specified in formula:
graph(DT, v3, along_with = v2, by = id, type = "felm", formula = ~v4|v1)
The functions select variables similarly to dplyr
syntax (see the dplyr vignette for more details).
# NSE version
sum_up(DT, list(v2, v3), by = list(id,v1))
# SE version
sum_up_(DT, c("v2","v3"), by = c("id","v1"))