The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
Santoku is a package for cutting data into intervals. It provides
chop()
, a replacement for base R’s cut()
function, as well as several convenience functions to cut different
kinds of intervals.
To install santoku, run:
Use chop()
like cut()
, to cut numeric data
into intervals between a set of breaks
.
library(santoku)
x <- runif(10, 0, 10)
(chopped <- chop(x, breaks = 0:10))
#> [1] [4, 5) [8, 9) [3, 4) [4, 5) [7, 8) [9, 10] [6, 7) [8, 9) [1, 2)
#> [10] [4, 5)
#> Levels: [1, 2) [3, 4) [4, 5) [6, 7) [7, 8) [8, 9) [9, 10]
data.frame(x, chopped)
#> x chopped
#> 1 4.978 [4, 5)
#> 2 8.970 [8, 9)
#> 3 3.392 [3, 4)
#> 4 4.677 [4, 5)
#> 5 7.057 [7, 8)
#> 6 9.708 [9, 10]
#> 7 6.714 [6, 7)
#> 8 8.377 [8, 9)
#> 9 1.086 [1, 2)
#> 10 4.495 [4, 5)
chop()
returns a factor.
If data is beyond the limits of breaks
, they will be
extended automatically:
chopped <- chop(x, breaks = 3:7)
data.frame(x, chopped)
#> x chopped
#> 1 4.978 [4, 5)
#> 2 8.970 [7, 9.708]
#> 3 3.392 [3, 4)
#> 4 4.677 [4, 5)
#> 5 7.057 [7, 9.708]
#> 6 9.708 [7, 9.708]
#> 7 6.714 [6, 7)
#> 8 8.377 [7, 9.708]
#> 9 1.086 [1.086, 3)
#> 10 4.495 [4, 5)
To chop a single number into a separate category, put the number
twice in breaks
:
x_fives <- x
x_fives[1:5] <- 5
chopped <- chop(x_fives, c(2, 5, 5, 8))
data.frame(x_fives, chopped)
#> x_fives chopped
#> 1 5.000 {5}
#> 2 5.000 {5}
#> 3 5.000 {5}
#> 4 5.000 {5}
#> 5 5.000 {5}
#> 6 9.708 [8, 9.708]
#> 7 6.714 (5, 8)
#> 8 8.377 [8, 9.708]
#> 9 1.086 [1.086, 2)
#> 10 4.495 [2, 5)
To quickly produce a table of chopped data, use
tab()
:
To chop into fixed-width intervals, starting at the minimum value,
use chop_width()
:
chopped <- chop_width(x, 2)
data.frame(x, chopped)
#> x chopped
#> 1 4.978 [3.086, 5.086)
#> 2 8.970 [7.086, 9.086)
#> 3 3.392 [3.086, 5.086)
#> 4 4.677 [3.086, 5.086)
#> 5 7.057 [5.086, 7.086)
#> 6 9.708 [9.086, 11.09]
#> 7 6.714 [5.086, 7.086)
#> 8 8.377 [7.086, 9.086)
#> 9 1.086 [1.086, 3.086)
#> 10 4.495 [3.086, 5.086)
To chop into a fixed number of intervals, each with the same width,
use chop_evenly()
:
chopped <- chop_evenly(x, intervals = 3)
data.frame(x, chopped)
#> x chopped
#> 1 4.978 [3.96, 6.834)
#> 2 8.970 [6.834, 9.708]
#> 3 3.392 [1.086, 3.96)
#> 4 4.677 [3.96, 6.834)
#> 5 7.057 [6.834, 9.708]
#> 6 9.708 [6.834, 9.708]
#> 7 6.714 [3.96, 6.834)
#> 8 8.377 [6.834, 9.708]
#> 9 1.086 [1.086, 3.96)
#> 10 4.495 [3.96, 6.834)
To chop into groups with a fixed number of members, use
chop_n()
:
chopped <- chop_n(x, 4)
table(chopped)
#> chopped
#> [1.086, 4.978) [4.978, 8.97) [8.97, 9.708]
#> 4 4 2
To chop into a fixed number of groups, each with the same number of
elements, use chop_equally()
:
chopped <- chop_equally(x, groups = 5)
table(chopped)
#> chopped
#> [1.086, 4.275) [4.275, 4.858) [4.858, 6.851) [6.851, 8.495) [8.495, 9.708]
#> 2 2 2 2 2
To chop data up by quantiles, use chop_quantiles()
:
chopped <- chop_quantiles(x, c(0.25, 0.5, 0.75))
data.frame(x, chopped)
#> x chopped
#> 1 4.978 [25%, 50%)
#> 2 8.970 [75%, 100%]
#> 3 3.392 [0%, 25%)
#> 4 4.677 [25%, 50%)
#> 5 7.057 [50%, 75%)
#> 6 9.708 [75%, 100%]
#> 7 6.714 [50%, 75%)
#> 8 8.377 [75%, 100%]
#> 9 1.086 [0%, 25%)
#> 10 4.495 [0%, 25%)
To chop data up by proportions of the data range, use
chop_proportions()
:
chopped <- chop_proportions(x, c(0.25, 0.5, 0.75))
data.frame(x, chopped)
#> x chopped
#> 1 4.978 [3.242, 5.397)
#> 2 8.970 [7.552, 9.708]
#> 3 3.392 [3.242, 5.397)
#> 4 4.677 [3.242, 5.397)
#> 5 7.057 [5.397, 7.552)
#> 6 9.708 [7.552, 9.708]
#> 7 6.714 [5.397, 7.552)
#> 8 8.377 [7.552, 9.708]
#> 9 1.086 [1.086, 3.242)
#> 10 4.495 [3.242, 5.397)
You can think of these six functions as logically arranged in a table.
To chop into… | Sizing intervals by… | |
---|---|---|
number of elements: | interval width: | |
a specific number of equal intervals… | chop_equally() |
chop_evenly() |
intervals of one specific size… | chop_n() |
chop_width() |
intervals of different specific sizes… | chop_quantiles() |
chop_proportions() |
To chop data by standard deviations around the mean, use
chop_mean_sd()
:
chopped <- chop_mean_sd(x)
data.frame(x, chopped)
#> x chopped
#> 1 4.978 [-1 sd, 0 sd)
#> 2 8.970 [1 sd, 2 sd)
#> 3 3.392 [-1 sd, 0 sd)
#> 4 4.677 [-1 sd, 0 sd)
#> 5 7.057 [0 sd, 1 sd)
#> 6 9.708 [1 sd, 2 sd)
#> 7 6.714 [0 sd, 1 sd)
#> 8 8.377 [0 sd, 1 sd)
#> 9 1.086 [-2 sd, -1 sd)
#> 10 4.495 [-1 sd, 0 sd)
To chop data into attractive intervals, use
chop_pretty()
. This selects intervals which are a multiple
of 2, 5 or 10. It’s useful for producing bar plots.
chopped <- chop_pretty(x)
data.frame(x, chopped)
#> x chopped
#> 1 4.978 [4, 6)
#> 2 8.970 [8, 10]
#> 3 3.392 [2, 4)
#> 4 4.677 [4, 6)
#> 5 7.057 [6, 8)
#> 6 9.708 [8, 10]
#> 7 6.714 [6, 8)
#> 8 8.377 [8, 10]
#> 9 1.086 [0, 2)
#> 10 4.495 [4, 6)
tab_n()
, tab_width()
, and friends act
similarly to tab()
, calling the related chop_*
function and then table()
on the result.
tab_n(x, 4)
#> [1.086, 4.978) [4.978, 8.97) [8.97, 9.708]
#> 4 4 2
tab_width(x, 2)
#> [1.086, 3.086) [3.086, 5.086) [5.086, 7.086) [7.086, 9.086) [9.086, 11.09]
#> 1 4 2 2 1
tab_evenly(x, 5)
#> [1.086, 2.81) [2.81, 4.535) [4.535, 6.259) [6.259, 7.983) [7.983, 9.708]
#> 1 2 2 2 3
tab_mean_sd(x)
#> [-2 sd, -1 sd) [-1 sd, 0 sd) [0 sd, 1 sd) [1 sd, 2 sd)
#> 1 4 3 2
By default, santoku labels intervals using mathematical notation:
[0, 1]
means all numbers between 0 and 1
inclusive.(0, 1)
means all numbers strictly between 0
and 1, not including the endpoints.[0, 1)
means all numbers between 0 and 1, including 0
but not 1.(0, 1]
means all numbers between 0 and 1, including 1
but not 0.{0}
means just the number 0.To override these labels, provide names to the breaks
argument:
chopped <- chop(x, c(Lowest = 1, Low = 2, Higher = 5, Highest = 8))
data.frame(x, chopped)
#> x chopped
#> 1 4.978 Low
#> 2 8.970 Highest
#> 3 3.392 Low
#> 4 4.677 Low
#> 5 7.057 Higher
#> 6 9.708 Highest
#> 7 6.714 Higher
#> 8 8.377 Highest
#> 9 1.086 Lowest
#> 10 4.495 Low
Or, you can specify factor labels with the labels
argument:
chopped <- chop(x, c(2, 5, 8), labels = c("Lowest", "Low", "Higher", "Highest"))
data.frame(x, chopped)
#> x chopped
#> 1 4.978 Low
#> 2 8.970 Highest
#> 3 3.392 Low
#> 4 4.677 Low
#> 5 7.057 Higher
#> 6 9.708 Highest
#> 7 6.714 Higher
#> 8 8.377 Highest
#> 9 1.086 Lowest
#> 10 4.495 Low
You need as many labels as there are intervals - one fewer than
length(breaks)
if your data doesn’t extend beyond
breaks
, one more than length(breaks)
if it
does.
To label intervals with a dash, use lbl_dash()
:
chopped <- chop(x, c(2, 5, 8), labels = lbl_dash())
data.frame(x, chopped)
#> x chopped
#> 1 4.978 2—5
#> 2 8.970 8—9.708
#> 3 3.392 2—5
#> 4 4.677 2—5
#> 5 7.057 5—8
#> 6 9.708 8—9.708
#> 7 6.714 5—8
#> 8 8.377 8—9.708
#> 9 1.086 1.086—2
#> 10 4.495 2—5
To label integer data, use lbl_discrete()
. It uses more
informative right endpoints:
chopped <- chop(1:10, c(2, 5, 8), labels = lbl_discrete())
chopped2 <- chop(1:10, c(2, 5, 8), labels = lbl_dash())
data.frame(x = 1:10, lbl_discrete = chopped, lbl_dash = chopped2)
#> x lbl_discrete lbl_dash
#> 1 1 1 1—2
#> 2 2 2—4 2—5
#> 3 3 2—4 2—5
#> 4 4 2—4 2—5
#> 5 5 5—7 5—8
#> 6 6 5—7 5—8
#> 7 7 5—7 5—8
#> 8 8 8—10 8—10
#> 9 9 8—10 8—10
#> 10 10 8—10 8—10
You can customize the first or last labels:
chopped <- chop(x, c(2, 5, 8), labels = lbl_dash(first = "< 2", last = "8+"))
data.frame(x, chopped)
#> x chopped
#> 1 4.978 2—5
#> 2 8.970 8+
#> 3 3.392 2—5
#> 4 4.677 2—5
#> 5 7.057 5—8
#> 6 9.708 8+
#> 7 6.714 5—8
#> 8 8.377 8+
#> 9 1.086 < 2
#> 10 4.495 2—5
To label intervals in order use lbl_seq()
:
chopped <- chop(x, c(2, 5, 8), labels = lbl_seq())
data.frame(x, chopped)
#> x chopped
#> 1 4.978 b
#> 2 8.970 d
#> 3 3.392 b
#> 4 4.677 b
#> 5 7.057 c
#> 6 9.708 d
#> 7 6.714 c
#> 8 8.377 d
#> 9 1.086 a
#> 10 4.495 b
You can use numerals or even roman numerals:
chop(x, c(2, 5, 8), labels = lbl_seq("(1)"))
#> [1] (2) (4) (2) (2) (3) (4) (3) (4) (1) (2)
#> Levels: (1) (2) (3) (4)
chop(x, c(2, 5, 8), labels = lbl_seq("i."))
#> [1] ii. iv. ii. ii. iii. iv. iii. iv. i. ii.
#> Levels: i. ii. iii. iv.
Other labelling functions include:
lbl_endpoints()
- use left endpoints as labelslbl_midpoints()
- use interval midpoints as labelslbl_glue()
- specify labels flexibly with the
{glue}
packageBy default, chop()
extends breaks
if
necessary. If you don’t want that, set extend = FALSE
:
chopped <- chop(x, c(3, 5, 7), extend = FALSE)
data.frame(x, chopped)
#> x chopped
#> 1 4.978 [3, 5)
#> 2 8.970 <NA>
#> 3 3.392 [3, 5)
#> 4 4.677 [3, 5)
#> 5 7.057 <NA>
#> 6 9.708 <NA>
#> 7 6.714 [5, 7]
#> 8 8.377 <NA>
#> 9 1.086 <NA>
#> 10 4.495 [3, 5)
Data outside the range of breaks
will become
NA
.
By default, intervals are closed on the left, i.e. they include their
left endpoints. If you want right-closed intervals, set
left = FALSE
:
y <- 1:5
data.frame(
y = y,
left_closed = chop(y, 1:5),
right_closed = chop(y, 1:5, left = FALSE)
)
#> y left_closed right_closed
#> 1 1 [1, 2) [1, 2]
#> 2 2 [2, 3) [1, 2]
#> 3 3 [3, 4) (2, 3]
#> 4 4 [4, 5] (3, 4]
#> 5 5 [4, 5] (4, 5]
By default, the last interval is closed on both ends. If you want to
keep the last interval open at the end, set
close_end = FALSE
:
You can chop many kinds of vectors with santoku, including Date objects…
y2k <- as.Date("2000-01-01") + 0:10 * 7
data.frame(
y2k = y2k,
chopped = chop(y2k, as.Date(c("2000-02-01", "2000-03-01")))
)
#> y2k chopped
#> 1 2000-01-01 [2000-01-01, 2000-02-01)
#> 2 2000-01-08 [2000-01-01, 2000-02-01)
#> 3 2000-01-15 [2000-01-01, 2000-02-01)
#> 4 2000-01-22 [2000-01-01, 2000-02-01)
#> 5 2000-01-29 [2000-01-01, 2000-02-01)
#> 6 2000-02-05 [2000-02-01, 2000-03-01)
#> 7 2000-02-12 [2000-02-01, 2000-03-01)
#> 8 2000-02-19 [2000-02-01, 2000-03-01)
#> 9 2000-02-26 [2000-02-01, 2000-03-01)
#> 10 2000-03-04 [2000-03-01, 2000-03-11]
#> 11 2000-03-11 [2000-03-01, 2000-03-11]
… and POSIXct (date-time) objects:
# hours of the 2020 Crew Dragon flight:
crew_dragon <- seq(as.POSIXct("2020-05-30 18:00", tz = "GMT"),
length.out = 24, by = "hours")
liftoff <- as.POSIXct("2020-05-30 15:22", tz = "America/New_York")
dock <- as.POSIXct("2020-05-31 10:16", tz = "America/New_York")
data.frame(
crew_dragon = crew_dragon,
chopped = chop(crew_dragon, c(liftoff, dock),
labels = c("pre-flight", "flight", "docked"))
)
#> Warning in .check_tzones(e1, e2): 'tzone' attributes are inconsistent
#> Warning in .check_tzones(e1, e2): 'tzone' attributes are inconsistent
#> crew_dragon chopped
#> 1 2020-05-30 18:00:00 pre-flight
#> 2 2020-05-30 19:00:00 pre-flight
#> 3 2020-05-30 20:00:00 flight
#> 4 2020-05-30 21:00:00 flight
#> 5 2020-05-30 22:00:00 flight
#> 6 2020-05-30 23:00:00 flight
#> 7 2020-05-31 00:00:00 flight
#> 8 2020-05-31 01:00:00 flight
#> 9 2020-05-31 02:00:00 flight
#> 10 2020-05-31 03:00:00 flight
#> 11 2020-05-31 04:00:00 flight
#> 12 2020-05-31 05:00:00 flight
#> 13 2020-05-31 06:00:00 flight
#> 14 2020-05-31 07:00:00 flight
#> 15 2020-05-31 08:00:00 flight
#> 16 2020-05-31 09:00:00 flight
#> 17 2020-05-31 10:00:00 flight
#> 18 2020-05-31 11:00:00 flight
#> 19 2020-05-31 12:00:00 flight
#> 20 2020-05-31 13:00:00 flight
#> 21 2020-05-31 14:00:00 flight
#> 22 2020-05-31 15:00:00 docked
#> 23 2020-05-31 16:00:00 docked
#> 24 2020-05-31 17:00:00 docked
Note how santoku correctly handles the different timezones.
You can use chop_width()
with objects from the
lubridate
package, to chop by irregular periods such as
months:
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
data.frame(
y2k = y2k,
chopped = chop_width(y2k, months(1))
)
#> y2k chopped
#> 1 2000-01-01 [2000-01-01, 2000-02-01)
#> 2 2000-01-08 [2000-01-01, 2000-02-01)
#> 3 2000-01-15 [2000-01-01, 2000-02-01)
#> 4 2000-01-22 [2000-01-01, 2000-02-01)
#> 5 2000-01-29 [2000-01-01, 2000-02-01)
#> 6 2000-02-05 [2000-02-01, 2000-03-01)
#> 7 2000-02-12 [2000-02-01, 2000-03-01)
#> 8 2000-02-19 [2000-02-01, 2000-03-01)
#> 9 2000-02-26 [2000-02-01, 2000-03-01)
#> 10 2000-03-04 [2000-03-01, 2000-04-01)
#> 11 2000-03-11 [2000-03-01, 2000-04-01)
You can format labels using format strings from
strptime()
. lbl_discrete()
is useful here:
data.frame(
y2k = y2k,
chopped = chop_width(y2k, months(1), labels = lbl_discrete(fmt = "%e %b"))
)
#> y2k chopped
#> 1 2000-01-01 1 Jan—31 Jan
#> 2 2000-01-08 1 Jan—31 Jan
#> 3 2000-01-15 1 Jan—31 Jan
#> 4 2000-01-22 1 Jan—31 Jan
#> 5 2000-01-29 1 Jan—31 Jan
#> 6 2000-02-05 1 Feb—29 Feb
#> 7 2000-02-12 1 Feb—29 Feb
#> 8 2000-02-19 1 Feb—29 Feb
#> 9 2000-02-26 1 Feb—29 Feb
#> 10 2000-03-04 1 Mar—31 Mar
#> 11 2000-03-11 1 Mar—31 Mar
You can also chop vectors with units, using the units
package:
library(units)
#> udunits database from /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/units/share/udunits/udunits2.xml
x <- set_units(1:10 * 10, cm)
br <- set_units(1:3, ft)
data.frame(
x = x,
chopped = chop(x, br)
)
#> x chopped
#> 1 10 [cm] [ 10.00 [cm], 30.48 [cm])
#> 2 20 [cm] [ 10.00 [cm], 30.48 [cm])
#> 3 30 [cm] [ 10.00 [cm], 30.48 [cm])
#> 4 40 [cm] [ 30.48 [cm], 60.96 [cm])
#> 5 50 [cm] [ 30.48 [cm], 60.96 [cm])
#> 6 60 [cm] [ 30.48 [cm], 60.96 [cm])
#> 7 70 [cm] [ 60.96 [cm], 91.44 [cm])
#> 8 80 [cm] [ 60.96 [cm], 91.44 [cm])
#> 9 90 [cm] [ 60.96 [cm], 91.44 [cm])
#> 10 100 [cm] [ 91.44 [cm], 100.00 [cm]]
You should be able to chop anything that has a comparison operator. You can even chop character data using lexical ordering. By default santoku emits a warning in this case, to avoid accidentally misinterpreting results:
chop(letters[1:10], c("d", "f"))
#> Warning in categorize_non_numeric(x, breaks, left): `x` or `breaks` is of type
#> character, using lexical sorting. To turn off this warning, run:
#> options(santoku.warn_character = FALSE)
#> [1] [a, d) [a, d) [a, d) [d, f) [d, f) [f, j] [f, j] [f, j] [f, j] [f, j]
#> Levels: [a, d) [d, f) [f, j]
If you find a type of data that you can’t chop, please file an issue.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.