Version: | 0.1.3 |
Title: | Be Nice on the Web |
Description: | Be responsible when scraping data from websites by following polite principles: introduce yourself, ask for permission, take slowly and never ask twice. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
ByteCompile: | true |
URL: | https://github.com/dmi3kno/polite, https://dmi3kno.github.io/polite/ |
BugReports: | https://github.com/dmi3kno/polite/issues |
RoxygenNote: | 7.2.3 |
Imports: | httr, magrittr, memoise, ratelimitr, robotstxt, rvest, stats, usethis |
Suggests: | dplyr, testthat, covr, webmockr |
NeedsCompilation: | no |
Packaged: | 2023-06-27 09:31:16 UTC; dm0737pe |
Author: | Dmytro Perepolkin |
Maintainer: | Dmytro Perepolkin <dperepolkin@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2023-06-30 08:30:02 UTC |
Pipe operator
Description
See magrittr::%>%
for details.
Usage
lhs %>% rhs
Introduce yourself to the host
Description
Introduce yourself to the host
Usage
bow(
url,
user_agent = "polite R package",
delay = 5,
times = 3,
force = FALSE,
verbose = FALSE,
...
)
is.polite(x)
Arguments
url |
URL |
user_agent |
character value passed to user agent string |
delay |
desired delay between scraping attempts. Final value will be the maximum of desired and mandated delay, as stipulated by |
times |
number of times to attempt scraping. Default is 3. |
force |
refresh all memoised functions. Clears up |
verbose |
TRUE/FALSE |
... |
other curl parameters wrapped into |
x |
object of class |
Value
object of class polite
, session
Examples
library(polite)
host <- "https://www.cheese.com"
session <- bow(host)
session
Guess download file name from the URL
Description
Guess download file name from the URL
Usage
guess_basename(x)
Arguments
x |
url to guess basename from |
Value
guessed file name
Examples
guess_basename("https://bit.ly/polite_sticker")
Convert collection of html nodes into data frame
Description
Convert collection of html nodes into data frame
Usage
html_attrs_dfr(
x,
attrs = NULL,
trim = FALSE,
defaults = NA_character_,
add_text = TRUE
)
Arguments
x |
|
attrs |
character vector of attribute names. If missing, all attributes will be used |
trim |
if |
defaults |
character vector of default values to be passed to |
add_text |
if |
Value
data frame with one row per xml node, consisting of an html_text column with text and additional columns with attributes
Examples
library(polite)
library(rvest)
bow("https://en.wikipedia.org/wiki/List_of_cognitive_biases") %>%
scrape() %>%
html_nodes("tr td:nth-child(1) a") %>%
html_attrs_dfr()
Agree modification of session path with the host
Description
Agree modification of session path with the host
Usage
nod(bow, path, verbose = FALSE)
Arguments
bow |
object of class |
path |
string value of path/URL to follow. The function accepts either a path (string part of URL following domain name) or a full URL |
verbose |
|
Value
object of class polite
, session
with modified URL
Examples
library(polite)
host <- "https://www.cheese.com"
session <- bow(host) %>%
nod(path="by_type")
session
Null coalescing operator
Description
Null coalescing operator
Usage
lhs %otherwise% rhs
Give your web-scraping function good manners polite
Description
Give your web-scraping function good manners polite
Usage
politely(
fun,
user_agent = paste0("polite ", getOption("HTTPUserAgent"), " bot"),
robots = TRUE,
force = FALSE,
delay = 5,
verbose = FALSE,
cache = memoise::cache_memory()
)
Arguments
fun |
function to be turned "polite". Must contain an argument named |
user_agent |
optional, user agent string to be used. Defaults to |
robots |
optional, should robots.txt be consulted for permissions. Default is TRUE |
force |
whether or not tp force fresh download of robots.txt |
delay |
minimum delay in seconds, not less than 1. Default is 5. |
verbose |
output more information about querying process |
cache |
memoise cache function for storing results. Default |
Value
polite function
Examples
polite_GET <- politely(httr::GET)
Print host introduction object
Description
Print host introduction object
Usage
## S3 method for class 'polite'
print(x, ...)
Arguments
x |
object of class |
... |
other parameters passed to methods |
Polite file download
Description
Polite file download
Usage
rip(
bow,
destfile = NULL,
...,
mode = "wb",
path = tempdir(),
overwrite = FALSE
)
Arguments
bow |
host introduction object of class |
destfile |
optional new file name to use when saving the file. If missing, it will be guessed from 'basename(url)“ |
... |
other parameters passed to |
mode |
character. The mode with which to write the file. Useful values are |
path |
character. Path where to save the destfile. By default is temporary directory created with |
overwrite |
if |
Value
Full path to the locally saved file indicated by the user in destfile
(and path
)
Examples
bow("https://en.wikipedia.org/") %>%
nod("wiki/Flag_of_the_United_States#/media/File:Flag_of_the_United_States.svg") %>%
rip()
Scrape the content of authorized page/API
Description
Scrape the content of authorized page/API
Usage
scrape(
bow,
query = NULL,
params = NULL,
accept = "html",
content = NULL,
verbose = FALSE
)
Arguments
bow |
host introduction object of class |
query |
named list of parameters to be appended to URL in the format |
params |
deprecated. Use |
accept |
character value of expected data type to be returned by host (e.g. |
content |
MIME type (aka internet media type) used to override the content type returned by the server.
See http://en.wikipedia.org/wiki/Internet_media_type for a list of common types. You can add the |
verbose |
extra feedback from the function. Defaults to |
Value
Object of class httr::response
which can be further processed by functions in rvest
package
Examples
library(rvest)
bow("https://en.wikipedia.org/wiki/List_of_cognitive_biases") %>%
scrape(content="text/html; charset=UTF-8") %>%
html_nodes(".wikitable") %>%
html_table()
Reset scraping/ripping rate limit
Description
Reset scraping/ripping rate limit
Usage
set_scrape_delay(delay)
set_rip_delay(delay)
Arguments
delay |
Delay between subsequent requests. Default for package is 5 sec. It can be set lower only under the condition of specifying a custom user-agent string. |
Value
Updates rate-limit property of scrape
and rip
functions, respectively.
Examples
library(polite)
host <- "https://www.cheese.com"
session <- bow(host)
session
Use manners in your own package or script
Description
Creates collection of polite
functions for scraping and downloading
Usage
use_manners(save_as = "R/polite-scrape.R", open = TRUE)
Arguments
save_as |
File where function should be created Defaults to " |
open |
if |