The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
NEWS robotstxt
0.7.15 | 2024-08-24
- CRAN compliance - Packages which use Internet resources should fail
gracefully
- CRAN compliance - fix R CMD check NOTES.
0.7.14 | 2024-08-24
- CRAN compliance - Packages which use Internet resources should fail
gracefully
0.7.13 | 2020-09-03
- CRAN compliance - prevent URL forwarding (HTTP 301): add www to
URLs
0.7.12 | 2020-09-03
- CRAN compliance - prevent URL forwarding (HTTP 301): add trailing
slashes to URLs
0.7.11 | 2020-09-02
- CRAN compliance - LICENCE file wording; prevent URL forwarding (HTTP
301)
0.7.10 | 2020-08-19
- fix problem in parse_robotstxt() - comment in last line of
robots.txt file would lead to errornous parsing - reported by @gittaca,
https://github.com/ropensci/robotstxt/pull/59 and
https://github.com/ropensci/robotstxt/issues/60
0.7.9 | 2020-08-02
- fix problem is_valid_robotstxt() - robots.txt validity check was to
lax - reported by @gittaca,
https://github.com/ropensci/robotstxt/issues/58
0.7.8 | 2020-07-22
- fix problem with domain name extraction - reported by @gittaca,
https://github.com/ropensci/robotstxt/issues/57
- fix problem with vArYING CasE in robots.txt field names - reported
by @steffilazerte,
https://github.com/ropensci/robotstxt/issues/55
0.7.7 | 2020-06-17
- fix problem in rt_request_handler - reported by @MHWauben
https://github.com/dmi3kno/polite/issues/28 - patch by @dmi3kno
0.7.6 | 2020-06-13
- make info whether or not results were cached available - requested
by @dmi3kno,
https://github.com/ropensci/robotstxt/issues/53
0.7.5 | 2020-06-07
- fix passing through more parameters from
robotstxt() to get_robotstxt() - reported and implemented by @dmi3kno
0.7.3 | 2020-05-29
- minor : improve printing of robots.txt
- add request data as attribute to robots.txt
- add
as.list()
method for robots.txt
- adding several paragrpahs to the README file
- major : finishing handlers - quality check,
documentation
- fix : Partial matching warnings #51 - reported by
@mine-cetinkaya-rundel
0.7.2 | 2020-05-04
- minor : changes in dependencies were introducing
errors when no scheme/protocoll was provided in URL – fixed
https://github.com/ropensci/robotstxt/issues/50
0.7.1 | 2018-01-09
- minor : modifying robots.txt parser to be more
robust against different formatting of robots.txt files – fixed
https://github.com/ropensci/robotstxt/issues/48
0.7.0 | 2018-11-27
- major : introducing http handler to allow for
better interpretation of robots.txt files in case of certain events:
redirects, server error, client error, suspicous content, …
0.6.4 | 2018-09-14
- minor : pass through of parameter for content
encoding
0.6.3 | 2018-09-14
- minor : introduced parameter encoding to
get_robotstxt()
that defaults to “UTF-8” which does the
content function anyways - but now it will not complain about it
- minor : added comment to help files specifying use
of trailing slash in paths pointing to folders in
paths_allowed
and robotstxt
.
0.6.2 | 2018-07-18
- minor : changed from
future::future_lapply()
to
future.apply::future_lapply()
to make package compatible
with versions of future after 1.8.1
0.6.1 | 2018-05-30
- minor : package was moved to other repo location
and project status badge was added
0.6.0 | 2018-02-10
- change/fix check function paths_allowed() would not
return correct result in some edge cases, indicating that
spiderbar/rep-cpp check method is more reliable and shall be the default
and only method: see 1, see 2, see 3
0.5.2 | 2017-11-12
- fix : rt_get_rtxt() would break on Windows due
trying to readLines() from folder
0.5.1 | 2017-11-11
- change : spiderbar is now non-default second
(experimental) check method
- fix : there were warnings in case of multiple
domain guessing
0.5.0 | 2017-10-07
- feature : spiderbar’s can_fetch() was added, now
one can choose which check method to use for checking access rights
- feature : use futures (from package future) to
speed up retrieval and parsing
- feature : now there is a
get_robotstxts()
function wich is a ‘vectorized’ version of
get_robotstxt()
- feature :
paths_allowed()
now allows
checking via either robotstxt parsed robots.txt files or via
functionality provided by the spiderbar package (the latter should be
faster by approximatly factor 10)
- feature : various functions now have a
ssl_verifypeer option (analog to CURL option
https://curl.haxx.se/libcurl/c/CURLOPT_SSL_VERIFYPEER.html) which might
help with robots.txt file retrieval in some cases
- change : user_agent for robots.txt file retrieval
will now default to:
sessionInfo()$R.version$version.string
- change : robotstxt now assumes it knows how to
parse –> if it cannot parse it assumes that it got no valid
robots.txt file meaning that there are no restrictions
- fix : valid_robotstxt would not accept some actual
valid robotstxt files
0.4.1 | 2017-08-20
- restructure : put each function in separate
file
- fix : parsing would go bonkers for robots.txt of
cdc.gov (e.g. combining all robots with all permissions) due to
errornous handling of carriage return character (reported by @hrbrmstr - thanks)
0.4.0 | 2017-07-14
- user_agent parameter added to
robotstxt() and paths_allowed to allow for user defined HTTP user-agent
send when retrieving robots.txt file from domain
0.3.4 | 2017-07-08
- fix : non robots.txt files (e.g. html files
returned by server instead of the requested robots.txt / facebook.com)
would be handled as if it were non existent / empty files (reported by
@simonmunzert -
thanks)
- fix : UTF-8 encoded robots.txt with BOM (byte order
mark) would break parsing although files were otherwise valid robots.txt
files
0.3.3 | 2016-12-10
- updating NEWS file and switching to NEWS.md
0.3.2 | 2016-04-28
0.3.1 | 2016-04-27
get_robotstxt() tests for HTTP errors and handles them, warnings
might be suppressed while un-plausible HTTP status codes will lead to
stoping the function
https://github.com/ropenscilabs/robotstxt#5
dropping R6 dependency and use list implementation instead
https://github.com/ropenscilabs/robotstxt#6
use caching for get_robotstxt()
https://github.com/ropenscilabs/robotstxt#7 /
https://github.com/ropenscilabs/robotstxt/commit/90ad735b8c2663367db6a9d5dedbad8df2bc0d23
make explicit, less error prone usage of httr::content(rtxt)
https://github.com/ropenscilabs/robotstxt#
replace usage of missing for parameter check with explicit NULL
as default value for parameter
https://github.com/ropenscilabs/robotstxt#9
partial match useragent / useragents
https://github.com/ropenscilabs/robotstxt#10
explicit declaration encoding: encoding=“UTF-8” in
httr::content() https://github.com/ropenscilabs/robotstxt#11
version 0.1.2 // 2016-02-08 …
- first feature complete version on CRAN
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.