The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
sparkbq is a sparklyr extension package providing an integration with Google BigQuery. It builds on top of spark-bigquery, which provides a Google BigQuery data source to Apache Spark.
You can install the released version of sparkbq from CRAN via
install.packages("sparkbq")
or the latest development version through
::install_github("miraisolutions/sparkbq", ref = "develop") devtools
The following table provides an overview over supported versions of Apache Spark, Scala, and Google Dataproc:
sparkbq | spark-bigquery | Apache Spark | Scala | Google Dataproc |
---|---|---|---|---|
0.1.x | 0.1.0 | 2.2.x and 2.3.x | 2.11 | 1.2.x and 1.3.x |
sparkbq is based on the Spark package spark-bigquery which is available in a separate GitHub repository.
library(sparklyr)
library(sparkbq)
library(dplyr)
<- spark_config()
config
<- spark_connect(master = "local[*]", config = config)
sc
# Set Google BigQuery default settings
bigquery_defaults(
billingProjectId = "<your_billing_project_id>",
gcsBucket = "<your_gcs_bucket>",
datasetLocation = "US",
serviceAccountKeyFile = "<your_service_account_key_file>",
type = "direct"
)
# Reading the public shakespeare data table
# https://cloud.google.com/bigquery/public-data/
# https://cloud.google.com/bigquery/sample-tables
<-
hamlet spark_read_bigquery(
sc,name = "hamlet",
projectId = "bigquery-public-data",
datasetId = "samples",
tableId = "shakespeare") %>%
filter(corpus == "hamlet") # NOTE: predicate pushdown to BigQuery!
# Retrieve results into a local tibble
%>% collect()
hamlet
# Write result into "mysamples" dataset in our BigQuery (billing) project
spark_write_bigquery(
hamlet,datasetId = "mysamples",
tableId = "hamlet",
mode = "overwrite")
When running outside of Google Cloud it is necessary to specify a
service account JSON key file. The service account key file can be
passed as parameter serviceAccountKeyFile
to
bigquery_defaults
or directly to
spark_read_bigquery
and
spark_write_bigquery
.
Alternatively, an environment variable
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service_account_keyfile.json
can be set (see
https://cloud.google.com/docs/authentication/getting-started for more
information). Make sure the variable is set before starting the R
session.
When running on Google Cloud, e.g. Google Cloud Dataproc, application default credentials (ADC) may be used in which case it is not necessary to specify a service account key file.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.