The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Type: Package
Title: Load WARC Files into Apache Spark
Version: 0.1.6
Maintainer: Edgar Ruiz <edgar@rstudio.com>
Description: Load WARC (Web ARChive) files into Apache Spark using 'sparklyr'. This allows to read files from the Common Crawl project http://commoncrawl.org/.
License: Apache License 2.0
BugReports: https://github.com/r-spark/sparkwarc
Encoding: UTF-8
Imports: DBI, sparklyr, Rcpp
RoxygenNote: 7.1.1
LinkingTo: Rcpp,
SystemRequirements: C++11
NeedsCompilation: yes
Packaged: 2022-01-10 16:40:06 UTC; yitaoli
Author: Javier Luraschi [aut], Yitao Li ORCID iD [aut], Edgar Ruiz [aut, cre]
Repository: CRAN
Date/Publication: 2022-01-11 08:50:02 UTC

Provides WARC paths for commoncrawl.org

Description

Provides WARC paths for commoncrawl.org. To be used with spark_read_warc.

Usage

cc_warc(start, end = start)

Arguments

start

The first path to retrieve.

end

The last path to retrieve.

Examples


cc_warc(1)
cc_warc(2, 3)


Loads the sample warc file in Rcpp

Description

Loads the sample warc file in Rcpp

Usage

rcpp_read_warc_sample(filter = "", include = "")

Arguments

filter

A regular expression used to filter to each warc entry efficiently by running native code using Rcpp.

include

A regular expression used to keep only matching lines efficiently by running native code using Rcpp.


Reads a WARC File into using Rcpp

Description

Reads a WARC (Web ARChive) file using Rcpp.

Usage

spark_rcpp_read_warc(path, match_warc, match_line)

Arguments

path

The path to the file. Needs to be accessible from the cluster. Supports the ‘⁠"hdfs://"⁠’, ‘⁠"s3n://"⁠’ and ‘⁠"file://"⁠’ protocols.

match_warc

include only warc files mathcing this character string.

match_line

include only lines mathcing this character string.


Reads a WARC File into Apache Spark

Description

Reads a WARC (Web ARChive) file into Apache Spark using sparklyr.

Usage

spark_read_warc(
  sc,
  name,
  path,
  repartition = 0L,
  memory = TRUE,
  overwrite = TRUE,
  match_warc = "",
  match_line = "",
  parser = c("r", "scala"),
  ...
)

Arguments

sc

An active spark_connection.

name

The name to assign to the newly generated table.

path

The path to the file. Needs to be accessible from the cluster. Supports the ‘⁠"hdfs://"⁠’, ‘⁠"s3n://"⁠’ and ‘⁠"file://"⁠’ protocols.

repartition

The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning.

memory

Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?)

overwrite

Boolean; overwrite the table with the given name if it already exists?

match_warc

include only warc files mathcing this character string.

match_line

include only lines mathcing this character string.

parser

which parser implementation to use? Options are "scala" or "r" (default).

...

Additional arguments reserved for future use.

Examples


## Not run: 
library(sparklyr)
library(sparkwarc)
sc <- spark_connect(master = "local")
sdf <- spark_read_warc(
  sc,
  name = "sample_warc",
  path = system.file(file.path("samples", "sample.warc"), package = "sparkwarc"),
  memory = FALSE,
  overwrite = FALSE
)

spark_disconnect(sc)

## End(Not run)


Loads the sample warc file in Spark

Description

Loads the sample warc file in Spark

Usage

spark_read_warc_sample(sc, filter = "", include = "")

Arguments

sc

An active spark_connection.

filter

A regular expression used to filter to each warc entry efficiently by running native code using Rcpp.

include

A regular expression used to keep only matching lines efficiently by running native code using Rcpp.


Retrieves sample warc path

Description

Retrieves sample warc path

Usage

spark_warc_sample_path()

sparkwarc

Description

Sparklyr extension for loading WARC Files into Apache Spark

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.