The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
dtrackr
assumes a tidy data paradigm where one row of
data is relevant to one logical entity, whether it be cars, irises,
diamonds, or anything else. This is not always the case, if for example
the data you are processing comes from a join of data sets. Here we
simulate a set of patients, test samples, and test results in a
hypothetical trial:
age_cats = factor(sprintf("%02d-%02d",seq(0,80,5),seq(4,84,5)))
# A set of synthetic patients:
patients = tibble::tibble(
patient_id = 1:100,
age_category = sample(age_cats,100, replace=TRUE),
ethnicity = sample(1:6, 100, replace = TRUE),
gender = sample(c("Male","Female"), 100, replace=TRUE),
group = sample(c("Cases","Controls"), 100, replace=TRUE)
)
# each patient is going to have a random selection of tests
tests = tibble::tibble(
test_id = 1:1000,
patient_id = sample(1:100,1000, replace = TRUE),
test_type = sample(c("FBC","LFT","Electrolytes"), 1000, replace=TRUE),
test_date = as.Date("2025-01-01")+sample.int(50, 1000, replace=TRUE)
)
# and each test a random selection of results consisting of components and
# values:
tests = tests %>% mutate(
result = purrr::map(test_type, ~ case_when(
.x == "FBC" ~ list(tibble::tibble(
component = c("HB","platelets","WCC"),
value = c( runif(1,13.5,15), runif(1,100,1000), runif(1,0,30))
)),
.x == "LFT" ~ list(tibble::tibble(
component = c("AST","GGT"),
value = c( runif(1,0,100), runif(1,0,100))
)),
.x == "Electrolytes" ~ list(tibble::tibble(
component = c("NA","K","Glucose"),
value = c( runif(1,130,150), runif(1,3.3,5.2), runif(1,50,150))
))
))
)
data = patients %>% inner_join(
tests %>% unnest(result) %>% unnest(result),
by="patient_id"
)
data %>% glimpse()
## Rows: 2,654
## Columns: 10
## $ patient_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, …
## $ age_category <fct> 10-14, 10-14, 10-14, 10-14, 10-14, 10-14, 10-14, 10-14, 1…
## $ ethnicity <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 6, 6, 6, 6, 6, …
## $ gender <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", "…
## $ group <chr> "Cases", "Cases", "Cases", "Cases", "Cases", "Cases", "Ca…
## $ test_id <int> 272, 272, 446, 446, 684, 684, 684, 685, 685, 685, 781, 78…
## $ test_type <chr> "LFT", "LFT", "LFT", "LFT", "Electrolytes", "Electrolytes…
## $ test_date <date> 2025-02-18, 2025-02-18, 2025-01-15, 2025-01-15, 2025-02-…
## $ component <chr> "AST", "GGT", "AST", "GGT", "NA", "K", "Glucose", "HB", "…
## $ value <dbl> 51.800040, 80.631694, 85.141485, 10.422086, 138.213790, 4…
We might have an objective to prepare this data set for analysis but have inclusion or exclusion criteria that apply at different levels. We might have patients who need to be excluded as too young or old, or specific test results that were taken at the wrong time, or patients who have evidence of diabetes, or exclude specific test results that are out of range. All of this we need to do while stratified by the control group status.
To achieve this we use nesting to collapse the data frame into one
row per patient, one row per test or one row per test result, depending
on what we are trying to exclude. This allows dtrackr
to
dynamically change what it regards as a single countable thing,
depending on the context of the pipeline.
processed = data %>%
# the data is originally long format with one row per test result:
track("{.count} test results") %>%
mutate(maybe_diabetic = any(component == "Glucose" & value>130), .by = patient_id) %>%
nest(test_panel = c(component,value), .messages="") %>%
# Now the data is long format with one row per test:
comment("{.count} tests") %>%
nest(tests = starts_with("test_"), .messages="") %>%
# and now long format with one row per patient:
comment("{.count} patients") %>%
group_by(group) %>%
comment("{.count} patients") %>%
# these exclusions are at the patient level
exclude_all(
.headline = "people",
maybe_diabetic ~ "{.excluded} diabetics",
age_category %in% age_cats[1:4] ~ "{.excluded} under 20"
) %>%
# these are now back at the test level
unnest(tests) %>%
comment("{.count} tests",.headline = "") %>%
exclude_all(
.headline = "tests",
test_date < "2025-01-07" ~ "{.excluded} with invalid dates"
) %>%
count_subgroup(test_type, .headline = "") %>%
# and finally at the granular test result level
unnest(test_panel) %>%
exclude_all(
.headline = "results",
component == "HB" & value < 14 ~ "{.excluded} invalid Hb results",
component == "K" & value < 3.5 ~ "{.excluded} haemolysed K+"
) %>%
group_by(test_type, .add=TRUE, .messages="By tests") %>%
count_subgroup(component, .headline = "{test_type}") %>%
ungroup(.messages = "{.count} eligible results") %>%
nest(test_panel = c(component,value), .messages="") %>%
comment("{.count} eligible tests") %>%
nest(tests = starts_with("test_"), .messages="") %>%
comment("{.count} eligible patients")
processed %>%
flowchart()
Going back to the original example data, in a slightly contrived example let’s assume we want to exclude age categories that don’t have a close gender match between cases and controls. We have to create a lot of small groups to count.
data %>%
group_by(age_category, gender, group) %>%
summarise(
n = n_distinct(patient_id)
) %>%
pivot_wider(values_from = n, names_from = group) %>%
filter(abs(Cases-Controls) <= 1) %>%
glimpse()
## `summarise()` has grouped output by 'age_category', 'gender'. You can override
## using the `.groups` argument.
## Rows: 18
## Columns: 4
## Groups: age_category, gender [18]
## $ age_category <fct> 00-04, 00-04, 05-09, 10-14, 15-19, 25-29, 30-34, 35-39, 4…
## $ gender <chr> "Female", "Male", "Male", "Male", "Female", "Female", "Fe…
## $ Cases <int> 2, 2, 1, 3, 2, 1, 1, 3, 2, 1, 2, 1, 2, 1, 3, 2, 1, 2
## $ Controls <int> 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 3, 1, 2, 2, 1, 2
If we were to try and monitor this data frame through the pipeline
there would be a problem with the flowchart because too many groups are
generated. This causes performance and legibility issues for the
resulting graph and is a result of an interim stage of the data pipeline
where grouping is used to do fine scale summarisation operation. The
most number of groups that dtrackr
will attempt to keep
track of is configurable but defaults to 16, and if the number of groups
exceeds that it will pause tracking, until the number of groups is
restored to a lower number, at which point it will start following
again. A “< hidden steps >” message is inserted into the graph
when this happens but this can be changed, or disabled altogether with
options(dtrackr.hidden_steps = "")
. dtrackr
does not by default warn the user of this unless the
options(dtrackr.verbose=TRUE)
is set.
old = options(dtrackr.verbose=TRUE)
data %>%
track() %>%
group_by(gender) %>%
comment(c("{.count} items","before pause")) %>%
# the tracking is paused on this next step as the number of groups becomes >16
group_by(age_category, group, .add=TRUE) %>%
comment("This message is not tracked") %>%
summarise(
n = n_distinct(patient_id)
) %>%
pivot_wider(values_from = n, names_from = group) %>%
filter(abs(Cases-Controls) <= 1) %>%
# the tracking is automatically resumed at this point as the grouping has
# returned to manageable levels.
group_by(gender) %>%
comment(c("{.count} summarised rows","after resume")) %>%
flowchart()
## • This group_by() has created more than the maximum number of supported groupings (16) which will likely impact performance. We have paused tracking the dataframe.
## • To change this limit set the option 'dtrackr.max_supported_groupings'.
## • Automatically resuming tracking.
By default this behaviour is triggered if we get to 16 subgroups. This can be changed by setting the option:
Pausing and unpausing the tracking can also be done manually by
calling dtrackr::pause()
and
dtrackr::resume()
. This is a fairly experimental feature,
and I don’t expect it to be heavily used.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.