The datacheckr::check_data() function takes two arguments: the data frame to check and a named list specifying the various conditions.

Checking Columns and Classes

The names of the list elements specify the columns that need to appear in the data frame while the classes of the vectors specify the classes of the columns.

Thus, to specify that x must contain a column called col1 of class integer the call would be as follows.

library(datacheckr)
check_data(mtcars, list(col1 = integer()))
## Error: column col1 in mtcars must be of class 'integer'

To specify that x can not contain a column called mpg the call is just

check_data(mtcars, list(mpg = NULL))
## Error: mtcars must not include column mpg

and to specify that it can contain a column col1 that can be integer or numeric values the call would be

check_data(mtcars, list(
  col1 = integer(), 
  col1 = NULL, 
  col1 = numeric()))

If a column is not named in the list then no checks are performed on it.

Checking Missing Values

To specify that a column cannot include missing values pass a single non-missing value.

check_data(mtcars, list(mpg = 3))
check_data(mtcars, list(mpg = -1))

To specify that it can include missing values include an NA in the vector

check_data(mtcars, list(mpg = c(NA, 9)))

and to specify that it can only include missing values use

check_data(mtcars, list(mpg = as.numeric(NA)))
## Error: column mpg in mtcars can only include missing values

Checking Ranges

To indicate that the non-missing values must fall within a range use two non-missing values (the following code tests for counts).

data1 <- data.frame(
  Count = c(0L, 3L, 3L, 0L), 
  LocationX = c(2000, NA, 2001, NA), 
  Extra = TRUE)

check_data(data1, list(Count = c(0L, .Machine$integer.max)))

As .Machine$integer.max is difficult to remember the max_integer() wrapper function is provided so that the above code can be written as.

check_data(data1, list(Count = c(0L, max_integer())))

Checking Specific Values

If particular values are required then specify them as a vector of three or more non-missing values

check_data(data1, list(Count = c(0L, 1L, 3L)))
check_data(data1, list(Count = c(1L, 2L, 2L)))
## Error: column Count in data1 must only include the permitted values 1 and 2

The order is unimportant.

Checking Numeric, Date and POSIXct Vectors

Numeric, Date and POSIXct vectors have exactly the same behaviour regarding ranges and specific values as illustrated above using integers.

Checking Logical Vectors

With logical values two non-missing values produce the same behaviour as three or more non-missing values. For example to test for only TRUE values use

check_data(data1, list(Extra = c(TRUE, TRUE)))

Checking Character Vectors

To specify that col1 must be a character vector use

check_data(x, list(col1 = "b"))

while the following requires that the values match both character elements which are treated as regular expressions

check_data(x, list(col1 = c("^//d", ".*")))

with three or more non-missing character elements each value in col1 must match at least one of the elements which are treated as regular expressions. Regular expressions are matched using grepl with perl=TRUE.

Checking Factors

To indicate that supp should be a factor use either of the following

check_data(ToothGrowth, list(supp = factor()))
check_data(ToothGrowth, list(supp = factor("blahblah")))

To specify that supp should be a factor that includes the factor levels OJ and VC (in any order) just pass two non-missing values

check_data(ToothGrowth, list(supp = factor(c("VC", "OJ"))))

And to specify the actual factor levels that supp must have pass three or more non-missing values

check_data(ToothGrowth, list(supp = factor(c("VC", "OJ", "OJ"))))