Type: | Package |
Title: | Load Avro file into 'Apache Spark' |
Version: | 0.3.0 |
Author: | Aki Ariga |
Maintainer: | Aki Ariga <chezou@gmail.com> |
Description: | Load Avro Files into 'Apache Spark' using 'sparklyr'. This allows to read files from 'Apache Avro' https://avro.apache.org/. |
License: | Apache License 2.0 | file LICENSE |
BugReports: | https://github.com/chezou/sparkavro |
Encoding: | UTF-8 |
LazyData: | true |
Imports: | sparklyr, dplyr, DBI |
RoxygenNote: | 7.0.2 |
Suggests: | testthat |
Language: | en-us |
NeedsCompilation: | no |
Packaged: | 2020-01-08 23:45:31 UTC; aki |
Repository: | CRAN |
Date/Publication: | 2020-01-10 04:40:02 UTC |
Reads a Avro File into Apache Spark
Description
Reads a Avro file into Apache Spark using sparklyr.
Usage
spark_read_avro(
sc,
name,
path,
readOptions = list(),
repartition = 0L,
memory = TRUE,
overwrite = TRUE
)
Arguments
sc |
An active |
name |
The name to assign to the newly generated table. |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3n://"’ and ‘"file://"’ protocols. |
readOptions |
A list of strings with additional options. |
repartition |
The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it already exists? |
Examples
## Not run:
## If you haven't got a Spark cluster, you can install Spark locally like this
library(sparklyr)
spark_install(version = "2.0.1")
sc <- spark_connect(master = "local")
df <- spark_read_avro(
sc,
"twitter",
system.file("extdata/twitter.avro", package = "sparkavro"),
repartition = FALSE,
memory = FALSE,
overwrite = FALSE
)
spark_disconnect(sc)
## End(Not run)
Write a Spark DataFrame to a Avro file
Description
Serialize a Spark DataFrame to the Parquet format.
Usage
spark_write_avro(x, path, mode = NULL, options = list())
Arguments
x |
A Spark DataFrame or dplyr operation |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3n://"’ and ‘"file://"’ protocols. |
mode |
Specifies the behavior when data or table already exists. |
options |
A list of strings with additional options. See http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration. |