Introduction to kit

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Introduction to kit

Overview

kit provides a collection of fast utility functions implemented in C for data manipulation in R. It serves as a lightweight, high-performance toolkit for tasks that are either slow or cumbersome in base R, such as row-wise operations, vectorized conditionals, and duplicate detection.

Key features include:

Parallel statistical functions: Row-wise operations (psum, pmean, pfirst) using OpenMP.
Vectorized conditionals: Fast if-else logic (iif, nif, vswitch) that preserves attributes.
Efficient set operations: Faster unique, duplicated, and count for vectors and data frames.
Partial sorting: Retrieve top N elements without sorting the entire vector (topn).
Factor utilities: Fast character-to-factor conversion (charToFact) and level manipulation (setlevels).

Most functions are implemented in C and support multi-threading where applicable, making them significantly faster than their base R equivalents on large datasets.

Parallel Statistical Functions

Computing row-wise statistics across multiple vectors or data frame columns is a common task. While base R has pmin() and pmax(), it lacks efficient equivalents for sum, mean, or product. kit fills this gap.

Row-wise Arithmetic

psum(), pmean(), and pprod() compute parallel sum, mean, and product respectively. They accept multiple vectors or a single list/data frame.

x <- c(1, 3, NA, 5)
y <- c(2, NA, 4, 1)
z <- c(3, 4, 4, 1)

# Parallel sum
psum(x, y, z, na.rm = TRUE)
#> [1] 6 7 8 7

# Parallel mean
pmean(x, y, z, na.rm = TRUE)
#> [1] 2.000000 3.500000 4.000000 2.333333

They are particularly useful for data frames:

df <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9))
psum(df)
#> [1] 12 15 18

Row-wise Min, Max, and Range

fpmin(), fpmax(), and prange() compute parallel minimum, maximum, and range (max - min) respectively. They complement base R’s pmin() and pmax(), providing greater performance and the ability to work efficiently with data frames.

x <- c(1, 3, NA, 5)
y <- c(2, NA, 4, 1)
z <- c(3, 4, 4, 1)

# Parallel minimum
fpmin(x, y, z, na.rm = TRUE)
#> [1] 1 3 4 1

# Parallel maximum
fpmax(x, y, z, na.rm = TRUE)
#> [1] 3 4 4 5

# Parallel range (max - min)
prange(x, y, z, na.rm = TRUE)
#> [1] 2 1 0 4

Like psum() and pmean(), these functions preserve the input type when all inputs have the same type, and automatically promote to the highest type when inputs are mixed (logical < integer < double). prange() always returns double to avoid integer overflow.

# With data frames
fpmin(df)
#> [1] 1 2 3
fpmax(df)
#> [1] 7 8 9
prange(df)
#> [1] 6 6 6

Coalescing Values

pfirst() and plast() return the first or last non-missing value across a set of vectors. This is equivalent to the SQL COALESCE function (for pfirst).

primary   <- c(NA, 2, NA, 4)
secondary <- c(1, NA, 3, NA)
fallback  <- c(0, 0, 0, 0)

# Take first available value
pfirst(primary, secondary, fallback)
#> [1] 1 2 3 4

Logical and Count Operations

You can check for conditions or count values row-wise with pall, pany, and pcount.

a <- c(TRUE, FALSE, NA, TRUE)
b <- c(TRUE, NA, TRUE, FALSE)
c <- c(NA, TRUE, FALSE, TRUE)

# Any TRUE per row?
pany(a, b, c, na.rm = TRUE)
#> [1] TRUE TRUE TRUE TRUE

# Count NAs per row
pcountNA(a, b, c)
#> [1] 1 1 1 0

# Count specific value (e.g., TRUE) per row
pcount(a, b, c, value = TRUE)
#> [1] 2 1 1 2

Vectorized Conditionals

Fast If-Else (`iif`)

Base R’s ifelse() is known to be slow and often strips attributes (like Date class or factor levels). iif() is a faster, more robust alternative that preserves attributes from the yes argument.

dates <- as.Date(c("2024-01-01", "2024-01-02", "2024-01-03"))

# Base ifelse strips class
class(ifelse(dates > "2024-01-01", dates, dates - 1))
#> [1] "numeric"

# iif preserves class
class(iif(dates > "2024-01-01", dates, dates - 1))
#> [1] "Date"

It also supports explicit NA handling:

x <- c(-2, -1, NA, 1, 2)
iif(x > 0, "positive", "non-positive", na = "missing")
#> [1] "non-positive" "non-positive" "missing"      "positive"     "positive"

Nested Conditionals (`nif`)

For multiple conditions, nif() offers a cleaner, more efficient syntax than nested ifelse() calls, similar to SQL’s CASE WHEN.

score <- c(95, 82, 67, 45, 78)

nif(
  score >= 90, "A",
  score >= 80, "B", 
  score >= 70, "C",
  score >= 60, "D",
  default = "F"
)
#> [1] "A" "B" "D" "F" "C"

Vectorized Switch (`vswitch`, `nswitch`)

vswitch() maps input values to outputs efficiently.

status_code <- c(1L, 2L, 3L, 1L, 4L)

vswitch(
  x = status_code,
  values = c(1L, 2L, 3L),
  outputs = c("pending", "approved", "rejected"),
  default = "unknown"
)
#> [1] "pending"  "approved" "rejected" "pending"  "unknown"

For pairwise syntax, nswitch() pairs values and outputs directly.

nswitch(status_code,
  1L, "pending",
  2L, "approved", 
  3L, "rejected",
  default = "unknown"
)
#> [1] "pending"  "approved" "rejected" "pending"  "unknown"

It can also replace with values from other vectors (columns), mixing scalars and vectors:

df <- data.frame(
  code = c(1, 2, 1, 3, 2),
  val_a = c(10, 20, 30, 40, 50),
  val_b = c(100, 200, 300, 400, 500)
)
with(df, nswitch(code,
  1, val_a,
  2, val_b,
  3, 0,
  default = NA_real_
))
#> [1]  10 200  30   0 500

Fast Unique and Duplicates

kit provides optimized versions of unique() and duplicated() that are significantly faster for vectors and data frames.

Unique Values and Duplicates

vec <- c("a", "b", "a", "c", "b")

# Get unique values
funique(vec)
#> [1] "a" "b" "c"

# Check for duplicates
fduplicated(vec)
#> [1] FALSE FALSE  TRUE FALSE  TRUE

uniqLen() efficiently counts the number of unique elements without allocating the unique vector itself:

df <- data.frame(
  x = c(1, 1, 2, 2),
  y = c("a", "a", "b", "b")
)
uniqLen(df)
#> [1] 2
funique(df)
#>   x y
#> 1 1 a
#> 2 2 b

Counting Occurrences

countOccur() produces a frequency table (similar to table() or dplyr::count()) but returns a standard data frame.

countOccur(c("apple", "banana", "apple", "cherry"))
#>   Variable Count
#> 1    apple     2
#> 2   banana     1
#> 3   cherry     1

Sorting and Utilities

Partial Sorting (`topn`)

Sorting a large vector just to get the top few elements is inefficient. topn() uses a partial sorting algorithm to retrieve the top (or bottom) \(N\) indices or values.

set.seed(42)
x <- rnorm(1000)

# Get indices of top 5 values
topn(x, n = 5)
#> [1] 988 525 820 459 900

# Get the actual values (decreasing = FALSE for bottom values)
topn(x, n = 5, decreasing = FALSE, index = FALSE)
#> [1] -3.371739 -3.017933 -2.993090 -2.958780 -2.699930

Factor Manipulation

charToFact() is a fast alternative to as.factor() for character vectors, with control over NA levels.

charToFact(c("a", "b", NA, "a"))
#> [1] a    b    <NA> a   
#> Levels: a b <NA>

setlevels() allows you to change factor levels by reference (in-place), avoiding object copying.

Finding Positions (`fpos`)

fpos() finds the positions of a pattern (needle) within a vector (haystack). It can be used to find occurrences of one vector inside another.

haystack <- c(1, 2, 3, 4, 1, 2, 5)
needle <- c(1, 2)

fpos(needle, haystack)
#> [1] 1 5

Summary

Task	kit function	Base R equivalent
Row-wise sum	`psum()`	`rowSums(cbind(...))`
Row-wise mean	`pmean()`	`rowMeans(cbind(...))`
Row-wise min	`fpmin()`	`pmin(...)`
Row-wise max	`fpmax()`	`pmax(...)`
Row-wise range	`prange()`	`pmax(...) - pmin(...)`
First non-NA	`pfirst()`	`apply(..., 1, function(x) x[!is.na(x)][1])`
Fast if-else	`iif()`	`ifelse()`
Nested if-else	`nif()`	Nested `ifelse()`
Switch	`vswitch()`	`match()` + indexing
Unique values	`funique()`	`unique()`
Top N indices	`topn()`	`order()[1:n]`
Char to Factor	`charToFact()`	`as.factor()`

For comprehensive details and performance benchmarks, please refer to the individual function documentation.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.

Introduction to kit

Overview

Parallel Statistical Functions

Row-wise Arithmetic

Row-wise Min, Max, and Range

Coalescing Values

Logical and Count Operations

Vectorized Conditionals

Fast If-Else (iif)

Nested Conditionals (nif)

Vectorized Switch (vswitch, nswitch)

Fast Unique and Duplicates

Unique Values and Duplicates

Counting Occurrences

Sorting and Utilities

Partial Sorting (topn)

Factor Manipulation

Finding Positions (fpos)

Summary

Fast If-Else (`iif`)

Nested Conditionals (`nif`)

Vectorized Switch (`vswitch`, `nswitch`)

Partial Sorting (`topn`)

Finding Positions (`fpos`)