The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
The explore package simplifies Exploratory Data Analysis (EDA). Get faster insights with less code!
explore package on Github: https://github.com/rolkra/explore
As the explore-functions fits well into the tidyverse, we load the dplyr-package as well.
Explore your data set (in this case the iris data set) in one line of code:
A shiny app is launched, you can inspect individual variable, explore their relation to a target (binary / categorical / numerical), grow a decision tree or create a fully automated report of all variables with a few “mouse clicks”.
You can choose each variable containing as a target, that is binary (0/1, FALSE/TRUE or “no”/“yes”), categorical or numeric.
Create a rich HTML report of all variables with one line of code:
Or you can simply add a target and create the report. In this case we use a binary target, but a categorical or numerical target would work as well.
# report of all variables and their relationship with a binary target
iris$is_versicolor <- ifelse(iris$Species == "versicolor", 1, 0)
iris %>%
report(output_file = "report.html",
output_dir = tempdir(),
target = is_versicolor)
If you use a binary target, the parameter split =
FALSE (or targetpct = TRUE
) will give you a
different view on the data.
Grow a decision tree with one line of code:
You can grow a decision tree with a binary target too.
iris$is_versicolor <- ifelse(iris$Species == "versicolor", 1, 0)
iris %>% select(-Species) %>% explain_tree(target = is_versicolor)
Or using a numerical target. The syntax stays the same.
You can control the growth of the tree using the parameters
maxdepth
, minsplit
and cp
.
To create other types of models use explain_forest()
,
explain_xgboost()
and explain_logreg()
.
Explore your table with one line of code to see which type of variables it contains.
You can also use describe_tbl()
if you just need the
main facts without visualization.
Explore a variable with one line of code. You don’t have to care if a variable is numerical or categorical.
Explore a variable and its relationship with a binary target with one line of code. You don’t have to care if a variable is numerical or categorical.
Using split = FALSE will change the plot to %target:
The target can have more than two levels:
Or the target can even be numeric:
iris %>%
select(Sepal.Length, Sepal.Width, is_versicolor) %>%
explore_all(target = is_versicolor, split = FALSE)
To use a high number of variables with explore_all()
in
a RMarkdown-File, it is necessary to set a meaningful fig.width and
fig.height in the junk. The function total_fig_height()
helps to automatically set fig.height:
fig.height=total_fig_height(iris)
If you use a target:
fig.height=total_fig_height(iris, var_name_target = "Species")
You can control total_fig_height() by parameters ncols (number of columns of the plots) and size (height of 1 plot)
Explore correlation between two variables with one line of code:
You can add a target too:
If you use explore to explore a variable and want to set lower and
upper limits for values, you can use the min_val
and
max_val
parameters. All values below min_val will be set to
min_val. All values above max_val will be set to max_val.
explore
uses auto-scale by default. To deactivate it use
the parameter auto_scale = FALSE
Describe your data in one line of code:
iris %>% describe()
#> # A tibble: 5 x 8
#> variable type na na_pct unique min mean max
#> <chr> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 Sepal.Length dbl 0 0 35 4.3 5.84 7.9
#> 2 Sepal.Width dbl 0 0 23 2 3.06 4.4
#> 3 Petal.Length dbl 0 0 43 1 3.76 6.9
#> 4 Petal.Width dbl 0 0 22 0.1 1.2 2.5
#> 5 Species fct 0 0 3 NA NA NA
The result is a data-frame, where each row is a variable of your
data. You can use filter
from dplyr for quick checks:
# show all variables that contain less than 5 unique values
iris %>% describe() %>% filter(unique < 5)
#> # A tibble: 1 x 8
#> variable type na na_pct unique min mean max
#> <chr> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 Species fct 0 0 3 NA NA NA
# show all variables contain NA values
iris %>% describe() %>% filter(na > 0)
#> # A tibble: 0 x 8
#> # i 8 variables: variable <chr>, type <chr>, na <int>, na_pct <dbl>,
#> # unique <int>, min <dbl>, mean <dbl>, max <dbl>
You can use describe
for describing variables too. You
don’t need to care if a variale is numerical or categorical. The output
is a text.
Use one of the prepared datasets to explore:
use_data_beer()
use_data_diamonds()
use_data_iris()
use_data_mpg()
use_data_mtcars()
use_data_penguins()
use_data_starwars()
use_data_titanic()
use_data_wordle()
use_data_beer() %>% describe()
#> # A tibble: 11 x 8
#> variable type na na_pct unique min mean max
#> <chr> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 name chr 0 0 161 NA NA NA
#> 2 brand chr 0 0 29 NA NA NA
#> 3 country chr 0 0 3 NA NA NA
#> 4 year dbl 0 0 1 2023 2023 2023
#> 5 type chr 0 0 3 NA NA NA
#> 6 color_dark dbl 0 0 2 0 0.09 1
#> 7 alcohol_vol_pct dbl 2 1.2 35 0 4.32 8.4
#> 8 original_wort dbl 5 3.1 54 5.1 11.3 18.3
#> 9 energy_kcal_100ml dbl 11 6.8 34 20 39.9 62
#> 10 carb_g_100ml dbl 16 9.9 44 1.5 3.53 6.7
#> 11 sugar_g_100ml dbl 16 9.9 26 0 0.72 4.6
Use one of the prepared datasets to explore:
create_data_abtest()
create_data_app()
create_data_buy()
create_data_churn()
create_data_esoteric()
create_data_newsletter()
create_data_person()
create_data_unfair()
create_data_random()
# create dataset and describe it
data <- create_data_app(obs = 100)
describe(data)
#> # A tibble: 7 x 8
#> variable type na na_pct unique min mean max
#> <chr> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 os chr 0 0 3 NA NA NA
#> 2 free int 0 0 2 0 0.62 1
#> 3 downloads int 0 0 99 255 6704. 18386
#> 4 rating dbl 0 0 5 1 3.44 5
#> 5 type chr 0 0 10 NA NA NA
#> 6 updates dbl 0 0 72 1 45.6 99
#> 7 screen_sizes dbl 0 0 5 1 2.61 5
# create dataset and describe it
data <- create_data_random(obs = 100, vars = 5)
describe(data)
#> # A tibble: 7 x 8
#> variable type na na_pct unique min mean max
#> <chr> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 id int 0 0 100 1 50.5 100
#> 2 target_ind int 0 0 2 0 0.53 1
#> 3 var_1 int 0 0 61 1 51.4 99
#> 4 var_2 int 0 0 63 1 48.6 98
#> 5 var_3 int 0 0 62 1 49.2 100
#> 6 var_4 int 0 0 68 0 48.6 100
#> 7 var_5 int 0 0 64 2 51.9 99
You can build you own random dataset by using
create_data_empty()
and add_var_random_*()
functions:
# create dataset and describe it
data <- create_data_empty(obs = 1000) %>%
add_var_random_01("target") %>%
add_var_random_dbl("age", min_val = 18, max_val = 80) %>%
add_var_random_cat("gender",
cat = c("male", "female", "other"),
prob = c(0.4, 0.4, 0.2)) %>%
add_var_random_starsign() %>%
add_var_random_moon()
describe(data)
#> # A tibble: 5 x 8
#> variable type na na_pct unique min mean max
#> <chr> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 target int 0 0 2 0 0.51 1
#> 2 age dbl 0 0 1000 18.2 49.1 80.0
#> 3 gender chr 0 0 3 NA NA NA
#> 4 random_starsign chr 0 0 12 NA NA NA
#> 5 random_moon chr 0 0 4 NA NA NA
To clean a variable you can use clean_var
. With one line
of code you can rename a variable, replace NA-values and set a minimum
and maximum for the value.
iris %>%
clean_var(Sepal.Length,
min_val = 4.5,
max_val = 7.0,
na = 5.8,
name = "sepal_length") %>%
describe()
#> # A tibble: 5 x 8
#> variable type na na_pct unique min mean max
#> <chr> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 sepal_length dbl 0 0 26 4.5 5.81 7
#> 2 Sepal.Width dbl 0 0 23 2 3.06 4.4
#> 3 Petal.Length dbl 0 0 43 1 3.76 6.9
#> 4 Petal.Width dbl 0 0 22 0.1 1.2 2.5
#> 5 Species fct 0 0 3 NA NA NA
To drop variables or observations you can use
drop_var_*()
and drop_obs_*()
functions.
Create an RMarkdown template to explore your own data. Set output_dir (existing file may be overwritten)
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.