rquery
is a SQL
query generator for R
. It is based on Edgar F. Codd’s relational algebra plus experience using SQL
and dplyr
at big data scale. The design represents an attempt to make SQL
more teachable by denoting composition by a sequential pipeline notation instead of nested queries or functions. The implementation delivers reliable high performance data processing on large data systems such as Spark
and databases. Package features include: data processing trees or pipelines as observable objects (able to report both columns produced and columns used), optimized SQL
generation as an explicit user visible modeling step, convenience methods for applying query trees to in-memory data.frame
s, and low direct package dependencies.
SQL
SQL
is a very powerful data processing (or data engineering) grammar. Data scientists are well advised to learn to work with SQL
.
An inessential difficulty in using SQL
is SQL
represents composition of operations by nesting, which can rapidly become confusing and illegible. This can be overcome by using a query composer such as rquery
(some more query composers are listed here).
Let’s set up our environment so we can work with examples.
run_vignette <- requireNamespace("RSQLite", quietly = TRUE)
library("rquery")
# example database connection
db <- DBI::dbConnect(RSQLite::SQLite(),
":memory:")
RSQLite::initExtension(db)
dbopts <- dbi_connection_preferences(db)
print(dbopts)
## $rquery.SQLiteConnection.use_pass_limit
## [1] TRUE
##
## $rquery.SQLiteConnection.use_DBI_dbExistsTable
## [1] TRUE
##
## $rquery.SQLiteConnection.use_DBI_dbListFields
## [1] FALSE
##
## $rquery.SQLiteConnection.use_DBI_dbRemoveTable
## [1] FALSE
##
## $rquery.SQLiteConnection.use_DBI_dbExecute
## [1] TRUE
##
## $rquery.SQLiteConnection.create_temporary
## [1] TRUE
##
## $rquery.SQLiteConnection.control_temporary
## [1] TRUE
##
## $rquery.SQLiteConnection.control_rownames
## [1] TRUE
options(dbopts)
# copy in example data
dbi_copy_to(
db, 'd',
data.frame(v = c(1, -5, 3)),
temporary = FALSE,
overwrite = TRUE)
## [1] "table('d')"
# produce a hande to existing table
d <- dbi_table(db, "d")
d
is a “table description” which is just the name of a table and the names of expected columns. d
does not store data or a database reference (making it safe to serialize/de-serialize). All rquery
operation trees or pipelines must start either with a table description or a data.frame
. We will discuss table descriptions later.
Note: in examples we use dbi_copy_to()
to create data. This is only for the purpose of having easy portable examples. With big data the data is usually already in the remote database or Spark system. The task is almost always to connect and work with this pre-existing remote data and the method to do this is dbi_table()
, which builds a reference to a remote table given the table name. The suggested pattern for working with remote tables is to get inputs via dbi_table()
and land remote results with materialze()
. To work with local data one can copy data from memory to the database with dbi_copy_to()
and bring back results with execute()
(though be aware operation on remote non-memory data is rquery
’s primary intent).
For our first example we will introduce a new column and perform a calculation using this column. This is achieved in SQL
by writing code in one of two styles: defining the first new column twice (once to land the value and once to use), or sequencing two queries by nesting. We will demonstrate both methods.
The define the column twice solution looks like the following.
DBI::dbGetQuery(db, "
SELECT
*,
ABS(v) AS absv,
ABS(v) - v AS delta
FROM
d
")
## v absv delta
## 1 1 1 0
## 2 -5 5 10
## 3 3 3 0
In SQL
the column absv
is not available for calculation in the same query that it is produced.
The nested method looks like the following, we produce the column absv
in one query and then wrap that in another query to later use the column. For expressions longer than ABS(v)
this is the preferred solution (until one moves to something like common table expressions).
DBI::dbGetQuery(db, "
SELECT
*,
absv - v AS delta
FROM (
SELECT
*,
ABS(v) AS absv
FROM
d
) subtab
")
## v absv delta
## 1 1 1 0
## 2 -5 5 10
## 3 3 3 0
sql_node()
Using rquery
we can write the SQL
composition using pipe notation (where composition is written as x %.>% f %.>% g
instead of g(f(x))
). We are going to use wrapr
dot-pipe instead of the magrittr
pipe to pick up a neat feature we will use later (all other examples will work with the magrittr
pipe). The “%.>%
” glyph can be bound to a keyboard shortcut for convenience.
The rquery
realization of the above calculation is as follows:
op_tree <- d %.>%
sql_node(., "absv" := "ABS(v)") %.>%
sql_node(., "delta" := "absv - v")
execute(db, op_tree)
## absv delta v
## 1 1 0 1
## 2 5 10 -5
## 3 3 0 3
The above is what we call “piped SQL
” and represents a major convenience for users as the details of how to compose the statements are left to the package. The sql_node()
is a very powerful node. We will use it in our first few examples and move onto more convenient higher level relational nodes.
We can view the SQL
translation of the operations tree as follows:
cat(to_sql(op_tree, db))
SELECT
`absv` AS `absv`,
absv - v AS `delta`,
`v` AS `v`
FROM (
SELECT
`v` AS `v`,
ABS(v) AS `absv`
FROM (
SELECT
`d`.`v`
FROM
`d`
) tsql_76657282644575127952_0000000000
) tsql_76657282644575127952_0000000001
Notice the above translations did not add identifier quotes to our use of “v
” in “ABS(v)
”. This is because the SQL
expression is not parsed in R
. If we want to identify terms as variables we can wrap them with as.name()
or quote()
to get the quoting (and other variable oriented features). The extra SELECT
step to pull data from the inner table is used by rquery
for important column narrowing steps, and can actually improve query performance.
op_tree <- d %.>%
sql_node(., "absv" := list(list("ABS(", quote(v), ")"))) %.>%
sql_node(., "delta" := list(list(quote(absv),"-", quote(v))))
cat(to_sql(op_tree, db))
SELECT
`absv` AS `absv`,
`absv` - `v` AS `delta`,
`v` AS `v`
FROM (
SELECT
`v` AS `v`,
ABS( `v` ) AS `absv`
FROM (
SELECT
`d`.`v`
FROM
`d`
) tsql_03186459973308663628_0000000000
) tsql_03186459973308663628_0000000001
The list(list())
notation is how we say in R
that we have a single element list (i.e. one expression) that is built up as a list of terms. The marking notation is cumbersome, but is not needed when we move on to relation nodes, which are parsed in R
and can spot identifiers without additional help.
op_tree
itself is a an object with its own presentation format:
cat(format(op_tree))
## table('d') %.>%
## sql_node(.,
## absv := ABS( v ),
## *=TRUE) %.>%
## sql_node(.,
## delta := absv - v,
## *=TRUE)
The op_tree
supplies an number of important summaries about the proposed query:
column_names(op_tree)
## [1] "absv" "delta" "v"
tables_used(op_tree)
## [1] "d"
columns_used(op_tree)
## $d
## [1] "v"
We can add nodes to an op_tree
to build larger operator trees (or pipelines).
op_tree <- op_tree %.>%
sql_node(., "prod" := "absv * delta")
cat(format(op_tree))
## table('d') %.>%
## sql_node(.,
## absv := ABS( v ),
## *=TRUE) %.>%
## sql_node(.,
## delta := absv - v,
## *=TRUE) %.>%
## sql_node(.,
## prod := absv * delta,
## *=TRUE)
And, the op_tree
record keeping can be used to catch potential errors early in pipeline construction. For example if we try to refer to a non-existent variable when adding an operator we get an thrown exception (note: a sql_node()
being added must have its variables marked as above for pre-checking to occur, relational nodes will get this checking automatically).
op_tree <- op_tree %.>%
sql_node(., "z" := list(list("1 + ", quote(z))))
## Error in sql_node.relop(., `:=`("z", list(list("1 + ", quote(z))))): rquery::sql_node.relop undefined columns: z
We can express non-trivial operations in sql_node()
s. For example we can build a node the calculates for each row how many columns contain NA
/NULL
as is demonstrated here.
# load up example data
d2 <- dbi_copy_to(
db, 'd2',
data.frame(v1 = c(1, 2, NA, 3),
v2 = c(NA, "b", NA, "c"),
v3 = c(NA, NA, 7, 8),
stringsAsFactors = FALSE))
# look at table
execute(db, d2)
## v1 v2 v3
## 1 1 <NA> NA
## 2 2 b NA
## 3 NA <NA> 7
## 4 3 c 8
# get list of columns
vars <- column_names(d2)
print(vars)
## [1] "v1" "v2" "v3"
# build a NA/NULLs per-row counting expression.
# names are "quoted" by wrapping them with as.name().
# constants can be quoted by an additional list wrapping.
expr <- lapply(vars,
function(vi) {
list("+ (CASE WHEN (",
as.name(vi),
"IS NULL ) THEN 1.0 ELSE 0.0 END)")
})
expr <- unlist(expr, recursive = FALSE)
expr <- c(list(0.0), expr)
cat(paste(unlist(expr), collapse = " "))
## 0 + (CASE WHEN ( v1 IS NULL ) THEN 1.0 ELSE 0.0 END) + (CASE WHEN ( v2 IS NULL ) THEN 1.0 ELSE 0.0 END) + (CASE WHEN ( v3 IS NULL ) THEN 1.0 ELSE 0.0 END)
# instantiate the operator node
op_tree_count_null <- d2 %.>%
sql_node(., "num_missing" := list(expr))
cat(format(op_tree_count_null))
## table('d2') %.>%
## sql_node(.,
## num_missing := 0 + (CASE WHEN ( v1 IS NULL ) THEN 1.0 ELSE 0.0 END) + (CASE WHEN ( v2 IS NULL ) THEN 1.0 ELSE 0.0 END) + (CASE WHEN ( v3 IS NULL ) THEN 1.0 ELSE 0.0 END),
## *=TRUE)
# examine produced SQL
sql <- to_sql(op_tree_count_null, db)
cat(sql)
## SELECT
## `v3` AS `v3`,
## 0 + (CASE WHEN ( `v1` IS NULL ) THEN 1.0 ELSE 0.0 END) + (CASE WHEN ( `v2` IS NULL ) THEN 1.0 ELSE 0.0 END) + (CASE WHEN ( `v3` IS NULL ) THEN 1.0 ELSE 0.0 END) AS `num_missing`,
## `v1` AS `v1`,
## `v2` AS `v2`
## FROM (
## SELECT
## `d2`.`v1`,
## `d2`.`v2`,
## `d2`.`v3`
## FROM
## `d2`
## ) tsql_48999109988716345118_0000000000
# execute
execute(db, op_tree_count_null)
## v3 num_missing v1 v2
## 1 NA 2 1 <NA>
## 2 NA 1 2 b
## 3 7 2 NA <NA>
## 4 8 0 3 c
And, as this is an important capability, this exact functionality is wrapped in count_null_cols()
.
# whole process wrapped in convenience node
d2 %.>%
count_null_cols(., vars, "nnull") %.>%
execute(db, .)
## v3 nnull v1 v2
## 1 NA 2 1 <NA>
## 2 NA 1 2 b
## 3 7 2 NA <NA>
## 4 8 0 3 c
There is a method to apply a parameterized SQL
expression to a set of columns.
# vector of columns we want to work on
colset <- qc(v1, v2, v3)
# build new names we want as results
colterms <- paste0(colset, "_isNA") := colset
map_to_char(colterms)
## [1] "c('v1_isNA' = 'v1', 'v2_isNA' = 'v2', 'v3_isNA' = 'v3')"
# build an apply expression to set of columns query
s_tree <- d2 %.>%
sql_expr_set(., colterms,
"CASE WHEN . IS NULL THEN 1 ELSE 0 END")
cat(to_sql(s_tree, db))
## SELECT
## v1 AS `v1`,
## v2 AS `v2`,
## v3 AS `v3`,
## CASE WHEN `v1` IS NULL THEN 1 ELSE 0 END AS `v1_isNA`,
## CASE WHEN `v2` IS NULL THEN 1 ELSE 0 END AS `v2_isNA`,
## CASE WHEN `v3` IS NULL THEN 1 ELSE 0 END AS `v3_isNA`
## FROM (
## SELECT
## `d2`.`v1`,
## `d2`.`v2`,
## `d2`.`v3`
## FROM
## `d2`
## ) tsql_57668660136678033503_0000000000
execute(db, s_tree)
## v1 v2 v3 v1_isNA v2_isNA v3_isNA
## 1 1 <NA> NA 0 1 1
## 2 2 b NA 0 0 1
## 3 NA <NA> 7 1 1 0
## 4 3 c 8 0 0 0
SQL
firstrquery
is a “SQL
first” system. It is designed to create SQL
queries and dispatch them to remote systems (SQLite
, Spark
, PostgreSQL
, Redshift
, and other databases) for execution. The execute()
method can be used with big data by adding a table_name
argument (or also by using the materialize()
method) to land results in a remote table instead of pulling them back to R
.
The mantra of SQL
-first is data starts in the database, and stays in the database (i.e., it is too large to depend on round-tripping through R
). Another important SQL
-first package is cdata
which provides pure SQL
based implementations of operators that generalize pivot/un-pivot, cast/melt, or spread/gather.
The better the database implementation the better rquery
will be, both in terms of performance and in terms of function (such as the availability of SQL
window functions).
As a convenience rquery
can work with in-memory data.frame
s by sending them to the SQL
service provider. This provider defaults to RSQlite
or can be set by setting the global variable winvector_temp_db_handle
. We demonstrate this below.
winvector_temp_db_handle <- list(db = db)
data.frame(v = -2:2) %.>%
execute(., op_tree)
## v prod absv delta
## 1 -2 8 2 4
## 2 -1 2 1 2
## 3 0 0 0 0
## 4 1 0 1 0
## 5 2 0 2 0
When using the wrapr
dot pipe the above can be abbreviated as:
data.frame(v = -2:2) %.>% op_tree
## v prod absv delta
## 1 -2 8 2 4
## 2 -1 2 1 2
## 3 0 0 0 0
## 4 1 0 1 0
## 5 2 0 2 0
The above calculation is managed by wrapr
dot pipe S3
wrapr_function
extensions.
rquery
operators can be used directly (without any table description nodes) when working with in-memory data.frame
s.
data.frame(x = 5) %.>% sql_node(., "z" := "sqrt(x)")
## x z
## 1 5 2.236068
The above calculation is triggered by S3
override of any of print()
, as.data.frame()
and head()
. Remote tables need an execute()
or materialize()
step to specify the database connection.
rquery
table descriptions are simple objects that store only the name of a table and expected columns. Any local data or database table that has at least the set of columns named in the table description can be used in a given rquery
pipeline.
The table description “d
” we have been using in examples was produced as a result of moving data to a database by dbi_copy_to()
. However we can also create a description of an existing table with dbi_table()
or even build a description by hand with table_source()
.
The sql_node()
alone can make writing, understanding, and maintaining complex data transformations as queries easier. And this node is a good introduction to some of the power of the rquery
package. However, the primary purpose of rquery
is to provide ready-made relational operators to further simplify to the point of rarely needing to use the sql_node()
directly.
The primary operators supplied by rquery
are:
The primary relational operators include:
extend_nse()
/extend_se()
. Extend adds derived columns to a relation table. With a sufficiently powerful SQL
provider this includes ordered and partitioned window functions. This operator also includes built-in seplyr
-style assignment partitioning.project()
. Project is usually portrayed as the equivalent to column selection, though the original definition includes aggregation. In our opinion the original relational nature of the operator is best captured by moving SQL
’s “GROUP BY
” aggregation functionality.natural_join()
. This a specialized relational join operator, using all common columns as an equi-join condition.theta_join()
. This is the relational join operator allowing an arbitrary predicate.select_rows()
. This is Codd’s relational row selection. Obviously select
alone is an over-used and now ambiguous term (for example: it is already used as the “doit” verb in SQL
and the column selector in dplyr
).rename_columns()
. This operator renames sets of columns.The primary non-relational (traditional SQL
) operators are:
select_columns()
. This allows choice of columns (central to SQL
), but is not a relational operator as it can damage row-uniqueness.orderby()
. Row order is not a concept in the relational algebra (and also not maintained in most SQL
implementations). This operator is only useful when used with its limit=
option, or as the last step as data comes out of the relation store and is moved to R
(where row-order is usually maintained).The above list (and especially naming) are chosen to first match Codd’s relational concepts (project
, select
, rename
, join
, aggregation), SQL
naming conventions. Notice this covers the primary dplyr
operators mutate()
(Codd’s extend
), select()
(not relational), filter()
(Codd’s select
, represented in SQL
by “WHERE
”), summarise()
(Codd’s project
or aggregate concepts, triggered in SQL
by “GROUP BY
”), arrange()
(not a relational concept, implemented in SQL
by “ORDER BY”). This correspondence is due to Codd’s ideas and SQL
driving data engineering thinking for almost the last 50 years (both with and without credit or citation).
With relational operators the user can work fast and work further away from syntactic details. For example some R
operators (such as is.na
) are translated to SQL
analogues (in this case IS NULL
).
d %.>%
extend_nse(., was_na := ifelse(is.na(v), 1, 0)) %.>%
to_sql(., db) %.>%
cat(.)
## SELECT
## `v`,
## ( CASE WHEN ( ( `v` ) IS NULL ) THEN ( 1 ) ELSE ( 0 ) END ) AS `was_na`
## FROM (
## SELECT
## `d`.`v`
## FROM
## `d`
## ) tsql_72414376057825575488_0000000000
The exact translation depends on the database (which is why to_sql()
takes a database argument). Some care has to be taken as SQL
types are different than R
types (in particular for some databases logical types are not numeric).
With a database that supplies window functions one can quickly work the “logistic scoring by hand” example from
from Let’s Have Some Sympathy For The Part-time R User. This example worked with rquery
code that works with both PostgreSQL
and Spark
can be found here.
We can demonstrate the pipeline, but the SQLite
database we are using in this vignette does not have the window functions required to execute it. PostgreSQL
, Spark
, and many other databases do have the necessary functionality. The pipeline is a good example of a non-trivial sequence of relational nodes.
scale <- 0.237
dq <- table_source("d3",
columns = qc(subjectID,
surveyCategory,
assessmentTotal)) %.>%
extend_nse(.,
probability :=
exp(assessmentTotal * scale)) %.>%
normalize_cols(.,
"probability",
partitionby = 'subjectID') %.>%
pick_top_k(.,
partitionby = 'subjectID',
rev_orderby = c('probability', 'surveyCategory')) %.>%
rename_columns(., 'diagnosis' := 'surveyCategory') %.>%
select_columns(., c('subjectID',
'diagnosis',
'probability')) %.>%
orderby(., 'subjectID')
qc()
is “quoting concatenate”, a convenience function that lets us skip a few quote marks. No list()
, as.name()
, or quote()
steps are needed as the operator nodes are parsed by R
to find identifiers. The scale
constant was added to the environment as pipelines try to bind constants during pipe construction (else scale
would be estimated to be a missing column name).
Even though we are not going to run this query here, we can still check some properties of the query.
tables_used(dq)
## [1] "d3"
columns_used(dq)
## $d3
## [1] "subjectID" "surveyCategory" "assessmentTotal"
column_names(dq)
## [1] "subjectID" "diagnosis" "probability"
The operations can be printed as an operations tree.
cat(format(dq))
## table('d3') %.>%
## extend(.,
## probability := exp(assessmentTotal * scale)) %.>%
## extend(.,
## probability := probability / sum(probability),
## p= subjectID) %.>%
## extend(.,
## row_number := row_number(),
## p= subjectID,
## o= "probability" DESC, "surveyCategory" DESC) %.>%
## select_rows(.,
## row_number <= 1) %.>%
## rename(.,
## c('diagnosis' = 'surveyCategory')) %.>%
## select_columns(.,
## subjectID, diagnosis, probability) %.>%
## orderby(., subjectID)
Notice the returned presentation is not exactly the set of nodes we specified. This is because of the nodes we used (normalize_cols()
and pick_top_k()
) are actually higher-order nodes (implemented in terms of nodes). Also extend()
nodes are re-factored to be unambiguous in their use and re-use of column names.
We can also exhibit the SQL
this operations tree renders, to (though the SQLite
database we are using in vignettes does not have the required window-functions to execute it; we suggest using PostgreSQL
).
cat(to_sql(dq, db))
## SELECT * FROM (
## SELECT
## `subjectID`,
## `diagnosis`,
## `probability`
## FROM (
## SELECT
## `probability` AS `probability`,
## `subjectID` AS `subjectID`,
## `surveyCategory` AS `diagnosis`
## FROM (
## SELECT * FROM (
## SELECT
## `probability`,
## `subjectID`,
## `surveyCategory`,
## row_number ( ) OVER ( PARTITION BY `subjectID` ORDER BY `probability` DESC, `surveyCategory` DESC ) AS `row_number`
## FROM (
## SELECT
## `subjectID`,
## `surveyCategory`,
## `probability` / sum ( `probability` ) OVER ( PARTITION BY `subjectID` ) AS `probability`
## FROM (
## SELECT
## `subjectID`,
## `surveyCategory`,
## `assessmentTotal`,
## exp ( `assessmentTotal` * 0.237 ) AS `probability`
## FROM (
## SELECT
## `d3`.`subjectID`,
## `d3`.`surveyCategory`,
## `d3`.`assessmentTotal`
## FROM
## `d3`
## ) tsql_88285858144118776056_0000000000
## ) tsql_88285858144118776056_0000000001
## ) tsql_88285858144118776056_0000000002
## ) tsql_88285858144118776056_0000000003
## WHERE `row_number` <= 1
## ) tsql_88285858144118776056_0000000004
## ) tsql_88285858144118776056_0000000005
## ) tsql_88285858144118776056_0000000006 ORDER BY `subjectID`
The above query is long, but actually quite performant.
To see the query executed, please see here.
SQL
nodesNot all data transform steps can conveniently be written as a single SQL
statement. To work around this potential limitation rquery
supplies a special type of node called non_sql_node()
. non_sql_node()
is used to implement arbitrary table to table transforms as rquery
pipeline steps. Two prototypical non_sql_node()
is rsummary_node()
.
rsummary_node()
builds a table of summary information about another database table. The format is each column of the original table produces a row of summary information in the result table. Here is a simple example.
op_tree %.>%
rsummary_node(.) %.>%
execute(db, .)
## column index class nrows nna nunique min max mean sd
## 1 v 1 numeric 3 0 NA -5 3 -0.3333333 4.163332
## 2 prod 2 numeric 3 0 NA 0 50 16.6666667 28.867513
## 3 absv 3 numeric 3 0 NA 1 5 3.0000000 2.000000
## 4 delta 4 numeric 3 0 NA 0 10 3.3333333 5.773503
## lexmin lexmax
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
Users can add additional capabilities by writing their own non_sql_node()
s.
rquery
goes out of its way to supply easy to program over value-oriented interfaces. For any meta-programming we suggest using wrapr::let()
, a powerful and well-documented meta-programming system.
rquery
accepts many assignment in a sql_node()
or in a single extend
node. The extend
node comes with automatic assignment partitioning to ensure correct and performant results. This allows the user to write large extend
blocks and know they will be executed correctly.
Here is an example.
ot <- table_source('d4',
columns = qc('a', 'b', 'c', 'd')) %.>%
extend_nse(.,
x = a + 1,
y = x + 1,
u = b + 1,
v = c + 1,
w = d + 1)
cat(format(ot))
## table('d4') %.>%
## extend(.,
## x := a + 1,
## u := b + 1,
## v := c + 1,
## w := d + 1) %.>%
## extend(.,
## y := x + 1)
Notice the dependent assignment was moved into its own extend block. This sort of transform is critical in getting correct results from SQL
(here is an example of what can happen when one does not correctly mitigate this issue).
A node that uses the assignment partitioning and re-ordering is the if_else_block()
which can be used to simulate block-oriented if-else semantics as seen in systems such as SAS
(also meaning rquery
can be critical porting code from SAS
to SQL
based R
). This allows coordinated assignments such as the following:
ifet <- table_source("d5",
columns = "test") %.>%
extend_se(.,
c(qae(x = '',
y = ''),
if_else_block(
qe(test > 5),
thenexprs = qae(x = 'a',
y = 'b'),
elseexprs = qae(x = 'b',
y = 'a')
)))
cat(format(ifet))
## table('d5') %.>%
## extend(.,
## x := "",
## y := "",
## ifebtest_1 := test > 5) %.>%
## extend(.,
## x := ifelse(ifebtest_1, "a", x),
## y := ifelse(ifebtest_1, "b", y)) %.>%
## extend(.,
## x := ifelse(!( ifebtest_1 ), "b", x),
## y := ifelse(!( ifebtest_1 ), "a", y))
As you can see, the if_else_block()
works by landing the test in a column and then using that column to conditional all further statements. qe()
and qae()
are quoting convenience functions. Note the if_else_block
depends on x
and y
being defined before entering the block, as they are self-assigned ( this is checked by the extend
node). The if_else_block()
returns a list of assignments, which then used in the extend_se()
statement, which in turn is re-factored into a sequence of safe extend nodes.
As rquery
pipelines are factored into stages similar to the common relational operators they tend to be very compatible with downstream query optimizers. We think some of the advantage is the fact that rquery
deliberately does not have a group_by
operator, but instead considers this as the partitionby
attribute of a project()
node (non-trivial example here).
We have seen database based rquery
outperform both in-memory dplyr
and database based dplyr
(Figure from: here.)
In addition rquery
includes automatic column narrowing: where only columns used to construct the final result are pulled from initial tables. This feature is important in production (where data marts can be quite wide) and has show significant additional performance advantages
From a coding point of view the automatic narrowing effect looks like this.
wp <- table_source(table = 'd6',
columns = letters[1:5]) %.>%
extend_nse(., res := a + b)
# full query
cat(to_sql(wp, db))
## SELECT
## `a`,
## `b`,
## `c`,
## `d`,
## `e`,
## `a` + `b` AS `res`
## FROM (
## SELECT
## `d6`.`a`,
## `d6`.`b`,
## `d6`.`c`,
## `d6`.`d`,
## `d6`.`e`
## FROM
## `d6`
## ) tsql_31602941834594284081_0000000000
# longer pipeline
wn <- wp %.>%
select_columns(., "res")
# notice select at end of the pipeline automatically
# gets propagated back to the beginning of the
# pipeline
cat(to_sql(wn, db))
## SELECT
## `res`
## FROM (
## SELECT
## `a`,
## `b`,
## `a` + `b` AS `res`
## FROM (
## SELECT
## `d6`.`a`,
## `d6`.`b`
## FROM
## `d6`
## ) tsql_30287613790029703802_0000000000
## ) tsql_30287613790029703802_0000000001
A graph of the the effects of this kind of narrowing (for dplyr
by hand as dplyr
currently does not have the above type of automatic query analysis/optimization) shows the sensitivity to this optimization.
rquery
is new package, but it is already proving to be correct (avoiding known data processing issues) and performant. For working with R
at a big data scale (say using PostgreSQL
or Spark
) rquery
is the right specialized tool for specifying data manipulation.
For deeper dives into specific topics, please see also:
rm(list = "winvector_temp_db_handle")
DBI::dbDisconnect(db)