The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
Joining dtrackr
tracked data is supported and allows us
to combine linked data sets. In this toy example the data sets are
characters from a popular film from my youth.
# here we create a set of linked data from the starwars data
# in a real example these data sets would have come from different places
people = starwars %>% select(-films, -vehicles, -starships)
vehicles = starwars %>% select(name,vehicles) %>% unnest(cols = c(vehicles))
starships = starwars %>% select(name,starships) %>% unnest(cols = c(starships))
films = starwars %>% select(name,films) %>% unnest(cols = c(films))
# these 4 data frames are linked together by the name attribute
# we track both input data sets:
tmp1 = people %>% track() %>% comment("People df {.total}")
tmp2 = films %>% track() %>% comment("Films df {.total}") %>% comment("a test comment")
# and here we (re)join the two data sets:
tmp1 %>%
inner_join(tmp2, by="name") %>%
comment("joined {.total}") %>%
flowchart()
# The join message used by inner_join here is configurable but defaults to
# {.count.lhs} on LHS
# {.count.rhs} on RHS
# {.count.out} in linked set
All dplyr
join types are supported by
dtrackr
which allows us to report on the numbers on either
side of the join and on the resulting total. This can help detect if any
data items are lost during the join. However we do not yet capture data
that becomes excluded during joins, as the interpretation depends on the
type of join employed.
Another type of binary operator is a union. This is a simpler problem
and works as expected. In this example the early part of the pipeline is
detected to be the same on both branches of the data flow. This
therefore results in a flow that splits then subsequently joins again
during the union (bind_rows
) operator.
tmp = people %>% comment("start")
tmp1 = tmp %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Droid" ~ "{.included} droids"
)
tmp2 = tmp %>% include_any(
species == "Gungan" ~ "{.included} gungans"
) %>% comment("{.count} gungans")
tmp3 = bind_rows(tmp1,tmp2) %>% comment("{.count} human,droids and gungans")
tmp3 %>% flowchart()
Other dplyr
set operations are supported such as
setdiff()
, union()
, union_all()
and intersect()
which are included in the function
documentation.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.