Repository Mirror for your Cloud Server and Webhosting

Type:

Package

Title:

Dataframe Difference Tool

Version:

1.1.1

Description:

Functions for comparing two data.frames against each other. The core functionality is to provide a detailed breakdown of any differences between two data.frames as well as providing utility functions to help narrow down the source of problems and differences.

Encoding:

UTF-8

Language:

en-GB

Depends:

R (≥ 3.1.2)

Imports:

tibble, assertthat, methods

Suggests:

testthat, lubridate, knitr, rmarkdown, purrr, dplyr, stringi, stringr, devtools, covr, bit64

RoxygenNote:

7.3.2

VignetteBuilder:

knitr

License:

MIT + file LICENSE

URL:

https://gowerc.github.io/diffdf/, https://github.com/gowerc/diffdf/

Config/testthat/edition:

BugReports:

https://github.com/gowerc/diffdf/issues

NeedsCompilation:

Packaged:

2024-09-24 16:38:01 UTC; gowerc

Author:

Craig Gower-Page [cre, aut], Kieran Martin [aut]

Maintainer:

Craig Gower-Page <craig.gower-page@roche.com>

Repository:

CRAN

Date/Publication:

2024-09-24 17:00:02 UTC

as_ascii_table

Description

This function takes a data.frame and attempts to convert it into a simple ascii format suitable for printing to the screen It is assumed all variable values have a as.character() method in order to cast them to character.

Usage

as_ascii_table(dat, line_prefix = "  ")

Arguments

dat

Input dataset to convert into a ascii table

line_prefix

Symbols to prefix in front of every line of the table

as_character

Description

Stub function to enable mocking in unit tests

Usage

as_character()

Format vector to printable string

Description

Coerces a vector of any type into a printable string. The most significant transformation is performed on existing character vectors which will be truncated, have newlines converted to explicit symbols and will be wrapped in quotes if they contain white space.

Usage

as_fmt_char(x, ...)

## S3 method for class 'numeric'
as_fmt_char(x, ...)

## S3 method for class ''NULL''
as_fmt_char(x, ...)

## S3 method for class 'list'
as_fmt_char(x, ...)

## S3 method for class 'factor'
as_fmt_char(x, ...)

## S3 method for class 'character'
as_fmt_char(x, add_quotes = TRUE, crop_at = 30, ...)

## Default S3 method:
as_fmt_char(x, ...)

## S3 method for class 'POSIXt'
as_fmt_char(x, ...)

Arguments

x

(vector)
vector to be converted to character

...

additional arguments (not currently used)

add_quotes

(logical)
if true will wrap strings that contain whitespace with quotes

crop_at

(numeric)
specifies the limit at which strings should be truncated to

Assert that keys are valid

Description

Utility function to check that user provided "keys" aren't listed as a problem variable of the current list of issues.

Usage

assert_valid_keys(COMPARE, KEYS, component, msg)

Arguments

COMPARE

(list)
A named list of which each element is a data.frame with the column VARIABLE

KEYS

(character)
name of key variables to check to make sure they don't contain any issues

component

(character)
name of the component within COMPARE to check against

msg

(character)
error message to print if any of KEYS are found within COMPARE[component]$VARIABLE

cast_variables

Description

Function to cast datasets columns if they have differing types Restricted to specific cases, currently integer and double, and character and factor

Usage

cast_variables(
  BASE,
  COMPARE,
  ignore_vars = NULL,
  cast_integers = FALSE,
  cast_factors = FALSE
)

Arguments

BASE

base dataset

COMPARE

comparison dataset

ignore_vars

Variables not to be considered for casting

cast_integers

Logical - Whether integers should be cased to double when compared to doubles

cast_factors

Logical - Whether characters should be casted to characters when compared to characters

class_merge

Description

Convenience function to put all classes an object has into one string

Usage

class_merge(x)

Arguments

x

an object

compare_vectors

Description

Compare two vectors looking for differences

Usage

compare_vectors(target, current, ...)

Arguments

target

the base vector

current

a vector to compare target to

...

Additional arguments which might be passed through (numerical accuracy)

compare_vectors.default

Description

Default method, if the vector is not numeric or factor. Basic comparison

Usage

## Default S3 method:
compare_vectors(target, current, ...)

Arguments

target

the base vector

current

a vector to compare target to

...

Additional arguments which might be passed through (numerical accuracy)

compare_vectors.factor

Description

Compares factors. Sets them as character and then compares

Usage

## S3 method for class 'factor'
compare_vectors(target, current, ...)

Arguments

target

the base vector

current

a vector to compare target to

...

Additional arguments which might be passed through (numerical accuracy)

compare_vectors.int64

Description

Handle int64 vectors. Uses numeric comparison

Usage

## S3 method for class 'integer64'
compare_vectors(
  target,
  current,
  tolerance = sqrt(.Machine$double.eps),
  scale = NULL,
  ...
)

Arguments

target

the base vector

current

a vector to compare target to

tolerance

Level of tolerance for differences between two variables

scale

Scale that tolerance should be set on. If NULL assume absolute

...

Not used

compare_vectors.numeric

Description

This is a modified version of the all.equal function which returns a vector rather than a message

Usage

## S3 method for class 'numeric'
compare_vectors(
  target,
  current,
  tolerance = sqrt(.Machine$double.eps),
  scale = NULL,
  ...
)

Arguments

target

the base vector

current

a vector to compare target to

tolerance

Level of tolerance for differences between two variables

scale

Scale that tolerance should be set on. If NULL assume absolute

...

Not used

construct_issue

Description

Make an s3 object with class issue and possible additional class, and assign other arguments to attributes

Usage

construct_issue(value, message, add_class = NULL)

Arguments

value

the value of the object

message

the value of the message attribute

add_class

additional class to add

convert_to_issue

Description

converts the count value into the correct issue format

Usage

convert_to_issue(datin)

Arguments

datin

data inputted

Describe the datasets being compared

Description

This function is used to produce a basic summary table of the core features of the two data.frame's being compared.

Usage

describe_dataframe(base, comp, base_name, comp_name)

Arguments

base

(data.frame)
base dataset to be described

comp

(data.frame)
comparison dataset to be described

base_name

(character)
name of the base dataset

comp_name

(character)
name of the comparison dataset

diffdf

Description

Compares 2 dataframes and outputs any differences.

Usage

diffdf(
  base,
  compare,
  keys = NULL,
  suppress_warnings = FALSE,
  strict_numeric = TRUE,
  strict_factor = TRUE,
  file = NULL,
  tolerance = sqrt(.Machine$double.eps),
  scale = NULL,
  check_column_order = FALSE,
  check_df_class = FALSE
)

Arguments

base

input dataframe

compare

comparison dataframe

keys

vector of variables (as strings) that defines a unique row in the base and compare dataframes

suppress_warnings

Do you want to suppress warnings? (logical)

strict_numeric

Flag for strict numeric to numeric comparisons (default = TRUE). If False diffdf will cast integer to double where required for comparisons. Note that variables specified in the keys will never be casted.

strict_factor

Flag for strict factor to character comparisons (default = TRUE). If False diffdf will cast factors to characters where required for comparisons. Note that variables specified in the keys will never be casted.

file

Location and name of a text file to output the results to. Setting to NULL will cause no file to be produced.

tolerance

Set tolerance for numeric comparisons. Note that comparisons fail if (x-y)/scale > tolerance.

scale

Set scale for numeric comparisons. Note that comparisons fail if (x-y)/scale > tolerance. Setting as NULL is a slightly more efficient version of scale = 1.

check_column_order

Should the column ordering be checked? (logical)

check_df_class

Do you want to check for differences in the class between base and compare? (logical)

Examples

x <- subset(iris, -Species)
x[1, 2] <- 5
COMPARE <- diffdf(iris, x)
print(COMPARE)

#### Sample data frames

DF1 <- data.frame(
    id = c(1, 2, 3, 4, 5, 6),
    v1 = letters[1:6],
    v2 = c(NA, NA, 1, 2, 3, NA)
)

DF2 <- data.frame(
    id = c(1, 2, 3, 4, 5, 7),
    v1 = letters[1:6],
    v2 = c(NA, NA, 1, 2, NA, NA),
    v3 = c(NA, NA, 1, 2, NA, 4)
)

diffdf(DF1, DF1, keys = "id")

# We can control matching with scale/location for example:

DF1 <- data.frame(
    id = c(1, 2, 3, 4, 5, 6),
    v1 = letters[1:6],
    v2 = c(1, 2, 3, 4, 5, 6)
)
DF2 <- data.frame(
    id = c(1, 2, 3, 4, 5, 6),
    v1 = letters[1:6],
    v2 = c(1.1, 2, 3, 4, 5, 6)
)

diffdf(DF1, DF2, keys = "id")
diffdf(DF1, DF2, keys = "id", tolerance = 0.2)
diffdf(DF1, DF2, keys = "id", scale = 10, tolerance = 0.2)

# We can use strict_factor to compare factors with characters for example:

DF1 <- data.frame(
    id = c(1, 2, 3, 4, 5, 6),
    v1 = letters[1:6],
    v2 = c(NA, NA, 1, 2, 3, NA),
    stringsAsFactors = FALSE
)

DF2 <- data.frame(
    id = c(1, 2, 3, 4, 5, 6),
    v1 = letters[1:6],
    v2 = c(NA, NA, 1, 2, 3, NA)
)

diffdf(DF1, DF2, keys = "id", strict_factor = TRUE)
diffdf(DF1, DF2, keys = "id", strict_factor = FALSE)

diffdf_has_issues

Description

Utility function which returns TRUE if an diffdf object has issues or FALSE if an diffdf object does not have issues

Usage

diffdf_has_issues(x)

Arguments

x

diffdf object

Examples


# Example with no issues
x <- diffdf(iris, iris)
diffdf_has_issues(x)

# Example with issues
iris2 <- iris
iris2[2, 2] <- NA
x <- diffdf(iris, iris2, suppress_warnings = TRUE)
diffdf_has_issues(x)

Identify Issue Rows

Description

This function takes a diffdf object and a dataframe and subsets the data.frame for problem rows as identified in the comparison object. If vars has been specified only issue rows associated with those variable(s) will be returned.

Usage

diffdf_issuerows(df, diff, vars = NULL)

Arguments

df

dataframe to be subsetted

diff

diffdf object

vars

(optional) character vector containing names of issue variables to subset dataframe on. A value of NULL (default) will be taken to mean available issue variables.

Details

Note that diffdf_issuerows can be used to subset against any dataframe. The only requirement is that the original variables specified in the keys argument to diffdf are present on the dataframe you are subsetting against. However please note that if no keys were specified in diffdf then the row number is used. This means using diffdf_issuerows without a keys against an arbitrary dataset can easily result in nonsense rows being returned. It is always recommended to supply keys to diffdf.

Examples

iris2 <- iris
for (i in 1:3) iris2[i, i] <- 99
x <- diffdf(iris, iris2, suppress_warnings = TRUE)
diffdf_issuerows(iris, x)
diffdf_issuerows(iris2, x)
diffdf_issuerows(iris2, x, vars = "Sepal.Length")
diffdf_issuerows(iris2, x, vars = c("Sepal.Length", "Sepal.Width"))

factor_to_character

Description

Takes a dataframe and converts any factor variables to character

Usage

factor_to_character(dsin, vars = NULL)

Arguments

dsin

input dataframe

vars

variables to consider for conversion. Default NULL will consider every variable within the dataset

find_difference

Description

This determines if two vectors are different. It expects vectors of the same length and type, and is intended to be used after checks have already been done Initially picks out any NA's (matching NA's count as a match) Then compares remaining vector

Usage

find_difference(target, current, ...)

Arguments

target

the base vector

current

a vector to compare target to

...

Additional arguments which might be passed through (numerical accuracy)

Generate unique key name

Description

Function to generate a name for the keys if not provided

Usage

generate_keyname(
  BASE,
  COMP,
  replace_names = c("..ROWNUMBER..", "..RN..", "..ROWN..", "..N..")
)

Arguments

BASE

base dataset

COMP

comparison dataset

replace_names

a vector of replacement names. Used for recursion, should be edited in function for clarity

get_casted_dataset

Description

Internal utility function to loop across a dataset casting all target variables

Usage

get_casted_dataset(df, columns, whichdat)

Arguments

df

dataset to be casted

columns

columns to be casted

whichdat

whether base or compare is being casted (used for messages)

get_casted_vector

Description

casts a vector depending on its type and input

Usage

get_casted_vector(colin, colname, whichdat)

Arguments

colin

column to cast

colname

name of vector

whichdat

whether base or compare is being casted (used for messages)

get_issue_dataset

Description

Internal function used by diffdf_issuerows to extract the dataframe from each a target issue. In particular it also strips off any non-key variables

Usage

get_issue_dataset(issue, diff)

Arguments

issue

name of issue to extract the dataset from diff

diff

diffdf object which contains issues

get_issue_message

Description

Simple function to grab the issue message

Usage

get_issue_message(object, ...)

Arguments

object

inputted object of class issue

...

other arguments

get_print_message

Description

Get the required text depending on type of issue

Usage

get_print_message(object, ...)

Arguments

object

inputted object of class issue

...

other arguments

get_print_message.default

Description

Errors, as this should only ever be given an issue

Usage

## Default S3 method:
get_print_message(object, ...)

Arguments

object

issue

...

Not used

get_print_message.issue

Description

Get text from a basic issue, based on the class of the value of the issue

Usage

## S3 method for class 'issue'
get_print_message(object, row_limit, ...)

Arguments

object

an object of class issue_basic

row_limit

Max row limit for difference tables (NULL to show all rows)

...

Additional arguments (not used)

get_table

Description

Generate nice looking table from a data frame

Usage

get_table(dsin, row_limit = 10)

Arguments

dsin

dataset

row_limit

Max row limit for difference tables (NULL to show all rows)

has_unique_rows

Description

Check if a data sets rows are unique

Usage

has_unique_rows(DAT, KEYS)

Arguments

DAT

input data set (data frame)

KEYS

Set of keys which should be unique

Identify differences in attributes

Description

Identifies any attribute differences between two data frames

Usage

identify_att_differences(BASE, COMP, exclude_cols = "")

Arguments

BASE

Base dataset for comparison (data.frame)

COMP

Comparator dataset to compare base against (data.frame)

exclude_cols

Columns to exclude from comparison

identify_class_differences

Description

Identifies any class differences between two data frames

Usage

identify_class_differences(BASE, COMP)

Arguments

BASE

Base dataset for comparison (data.frame)

COMP

Comparator dataset to compare base against (data.frame)

Find column ordering differences

Description

Compares two datasets and outputs a table listing any differences in the column orders between the two datasets. Columns that are not contained within both are ignored however column ordering is derived prior to removing these columns.

Usage

identify_column_order_differences(BASE, COMP)

Arguments

BASE

(data.frame)
Base dataset for comparison

COMP

(data.frame)
Comparator dataset to compare base against

identify_differences

Description

Compares each column within 2 datasets to identify any values which they mismatch on.

Usage

identify_differences(
  BASE,
  COMP,
  KEYS,
  exclude_cols,
  tolerance = sqrt(.Machine$double.eps),
  scale = NULL
)

Arguments

BASE

Base dataset for comparison (data.frame)

COMP

Comparator dataset to compare base against (data.frame)

KEYS

List of variables that define a unique row within the datasets (strings)

exclude_cols

Columns to exclude from comparison

tolerance

Level of tolerance for numeric differences between two variables

scale

Scale that tolerance should be set on. If NULL assume absolute

identify_extra_cols

Description

Identifies columns that are in a baseline dataset but not in a comparator dataset

Usage

identify_extra_cols(DS1, DS2)

Arguments

DS1

Baseline dataset (data frame)

DS2

Comparator dataset (data frame)

identify_extra_rows

Description

Identifies rows that are in a baseline dataset but not in a comparator dataset

Usage

identify_extra_rows(DS1, DS2, KEYS)

Arguments

DS1

Baseline dataset (data frame)

DS2

Comparator dataset (data frame)

KEYS

List of variables that define a unique row within the datasets (strings)

identify_matching_cols

Description

Identifies columns with the same name in two data frames

Usage

identify_matching_cols(DS1, DS2, EXCLUDE = "")

Arguments

DS1

Input dataset 1 (data frame)

DS2

Input dataset 2 (data frame)

EXCLUDE

Columns to ignore

identify_mode_differences

Description

Identifies any mode differences between two data frames

Usage

identify_mode_differences(BASE, COMP)

Arguments

BASE

Base dataset for comparison (data.frame)

COMP

Comparator dataset to compare base against (data.frame)

identify_properties

Description

Returns a dataframe of metadata for a given dataset. Returned values include variable names , class , mode , type & attributes

Usage

identify_properties(dsin)

Arguments

dsin

input dataframe that you want to get the metadata from

identify_unsupported_cols

Description

Identifies any columns for which the package is not setup to handle

Usage

identify_unsupported_cols(dsin)

Arguments

dsin

input dataset

invert

Description

Utility function used to replicated purrr::transpose. Turns a list inside out.

Usage

invert(x)

Arguments

x

list

is_variable_different

Description

This subsets the data set on the variable name, picks out differences and returns a tibble of differences for the given variable

Usage

is_variable_different(variablename, keynames, datain, ...)

Arguments

variablename

name of variable being compared

keynames

name of keys

datain

Inputted dataset with base and compare vectors

...

Additional arguments which might be passed through (numerical accuracy)

Value

A boolean vector which is T if target and current are different

Print diffdf objects

Description

Print nicely formatted version of an diffdf object

Usage

## S3 method for class 'diffdf'
print(x, row_limit = 10, as_string = FALSE, ...)

Arguments

x

comparison object created by diffdf().

row_limit

Max row limit for difference tables (NULL to show all rows)

as_string

Return printed message as an R character vector?

...

Additional arguments (not used)

Examples

x <- subset(iris, -Species)
x[1, 2] <- 5
COMPARE <- diffdf(iris, x)
print(COMPARE)
print(COMPARE, row_limit = 5)

recursive_reduce

Description

Utility function used to replicated purrr::reduce. Recursively applies a function to a list of elements until only 1 element remains

Usage

recursive_reduce(.l, .f)

Arguments

.l

list of values to apply a function to

.f

function to apply to each each element of the list in turn. See details.

Details

This function is essentially performing the following operation:

.l[[1]] <- .f( .l[[1]] , .l[[2]]) ; .l[[1]] <- .f( .l[[1]] , .l[[3]])

sort_then_join

Description

Convenience function to sort two strings and paste them together

Usage

sort_then_join(string1, string2)

Arguments

string1

first string

string2

second string

Pad String

Description

Utility function used to replicate str_pad. Adds white space to either end of a string to get it to equal the desired length

Usage

string_pad(x, width)

Arguments

x

string

width

desired length