Type: | Package |
Title: | Load WARC Files into Apache Spark |
Version: | 0.1.6 |
Maintainer: | Edgar Ruiz <edgar@rstudio.com> |
Description: | Load WARC (Web ARChive) files into Apache Spark using 'sparklyr'. This allows to read files from the Common Crawl project http://commoncrawl.org/. |
License: | Apache License 2.0 |
BugReports: | https://github.com/r-spark/sparkwarc |
Encoding: | UTF-8 |
Imports: | DBI, sparklyr, Rcpp |
RoxygenNote: | 7.1.1 |
LinkingTo: | Rcpp, |
SystemRequirements: | C++11 |
NeedsCompilation: | yes |
Packaged: | 2022-01-10 16:40:06 UTC; yitaoli |
Author: | Javier Luraschi [aut],
Yitao Li |
Repository: | CRAN |
Date/Publication: | 2022-01-11 08:50:02 UTC |
Provides WARC paths for commoncrawl.org
Description
Provides WARC paths for commoncrawl.org. To be used with
spark_read_warc
.
Usage
cc_warc(start, end = start)
Arguments
start |
The first path to retrieve. |
end |
The last path to retrieve. |
Examples
cc_warc(1)
cc_warc(2, 3)
Loads the sample warc file in Rcpp
Description
Loads the sample warc file in Rcpp
Usage
rcpp_read_warc_sample(filter = "", include = "")
Arguments
filter |
A regular expression used to filter to each warc entry
efficiently by running native code using |
include |
A regular expression used to keep only matching lines
efficiently by running native code using |
Reads a WARC File into using Rcpp
Description
Reads a WARC (Web ARChive) file using Rcpp.
Usage
spark_rcpp_read_warc(path, match_warc, match_line)
Arguments
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3n://"’ and ‘"file://"’ protocols. |
match_warc |
include only warc files mathcing this character string. |
match_line |
include only lines mathcing this character string. |
Reads a WARC File into Apache Spark
Description
Reads a WARC (Web ARChive) file into Apache Spark using sparklyr.
Usage
spark_read_warc(
sc,
name,
path,
repartition = 0L,
memory = TRUE,
overwrite = TRUE,
match_warc = "",
match_line = "",
parser = c("r", "scala"),
...
)
Arguments
sc |
An active |
name |
The name to assign to the newly generated table. |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3n://"’ and ‘"file://"’ protocols. |
repartition |
The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it already exists? |
match_warc |
include only warc files mathcing this character string. |
match_line |
include only lines mathcing this character string. |
parser |
which parser implementation to use? Options are "scala" or "r" (default). |
... |
Additional arguments reserved for future use. |
Examples
## Not run:
library(sparklyr)
library(sparkwarc)
sc <- spark_connect(master = "local")
sdf <- spark_read_warc(
sc,
name = "sample_warc",
path = system.file(file.path("samples", "sample.warc"), package = "sparkwarc"),
memory = FALSE,
overwrite = FALSE
)
spark_disconnect(sc)
## End(Not run)
Loads the sample warc file in Spark
Description
Loads the sample warc file in Spark
Usage
spark_read_warc_sample(sc, filter = "", include = "")
Arguments
sc |
An active |
filter |
A regular expression used to filter to each warc entry
efficiently by running native code using |
include |
A regular expression used to keep only matching lines
efficiently by running native code using |
Retrieves sample warc path
Description
Retrieves sample warc path
Usage
spark_warc_sample_path()
sparkwarc
Description
Sparklyr extension for loading WARC Files into Apache Spark