The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
This vignette for the unpivotr package demonstrates unpivoting html tables of various kinds.
The HTML files are in the package directory at
system.file("extdata", c("rowspan.html", "colspan.html", "nested.html"), package = "unpivotr")
.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
If a table has cells merged across rows or columns (or both), then
as_cells()
does not attempt to fill the cell contents
across the rows or columns. This is different from other packages,
e.g. rvest
. However, if merged cells cause a table not to
be square, then as_cells()
pads the missing cells with
blanks.
Header (1:2, 1) | Header (1, 2) |
---|---|
cell (2, 2) |
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
## [[1]]
## # A tibble: 1 × 2
## `Header (1:2, 1)` `Header (1, 2)`
## <chr> <chr>
## 1 Header (1:2, 1) cell (2, 2)
## [[1]]
## # A tibble: 4 × 4
## row col data_type html
## <int> <int> <chr> <chr>
## 1 1 1 html "<th rowspan=\"2\">Header (1:2, 1)</th>"
## 2 2 1 html <NA>
## 3 1 2 html "<th>Header (1, 2)</th>"
## 4 2 2 html "<td>cell (2, 2)</td>"
Header (1, 1:2) | |
---|---|
cell (2, 1) | cell (2, 2) |
## [[1]]
## # A tibble: 1 × 2
## `Header (1, 1:2)` `Header (1, 1:2)`
## <chr> <chr>
## 1 cell (2, 1) cell (2, 2)
## [[1]]
## # A tibble: 4 × 4
## row col data_type html
## <int> <int> <chr> <chr>
## 1 1 1 html "<th colspan=\"2\">Header (1, 1:2)</th>"
## 2 2 1 html "<td>cell (2, 1)</td>"
## 3 1 2 html <NA>
## 4 2 2 html "<td>cell (2, 2)</td>"
rowandcolspan <- system.file("extdata",
"row-and-colspan.html",
package = "unpivotr")
includeHTML(rowandcolspan)
Header (1:2, 1:2) | Header (2, 3) | |
---|---|---|
cell (3, 1) | cell (3, 2) | cell (3, 3) |
## [[1]]
## # A tibble: 1 × 5
## `Header (1:2, 1:2)` `Header (1:2, 1:2)` `Header (2, 3)` `` ``
## <chr> <chr> <chr> <chr> <chr>
## 1 Header (1:2, 1:2) Header (1:2, 1:2) cell (3, 1) cell (3, 2) cell (3, …
## [[1]]
## # A tibble: 10 × 4
## row col data_type html
## <int> <int> <chr> <chr>
## 1 1 1 html "<th colspan=\"2\" rowspan=\"2\">Header (1:2, 1:2)</th…
## 2 2 1 html <NA>
## 3 1 2 html <NA>
## 4 2 2 html <NA>
## 5 1 3 html "<th>Header (2, 3)</th>"
## 6 2 3 html "<td>cell (3, 1)</td>"
## 7 1 4 html <NA>
## 8 2 4 html "<td>cell (3, 2)</td>"
## 9 1 5 html <NA>
## 10 2 5 html "<td>cell (3, 3)</td>"
as_cells()
never descends into cells. If there is a
table inside a cell, then to parse that table use
html_table
again on that cell.
Header (1, 1) | Header (1, 2) | ||||
---|---|---|---|---|---|
cell (2, 1) |
|
## [[1]]
## # A tibble: 3 × 6
## `Header (1, 1)` `Header (1, 2)` `` `` `` ``
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 cell (2, 1) "Header (2, 2)(1, 1)\n … Head… Head… cell… cell…
## 2 Header (2, 2)(1, 1) "Header (2, 2)(1, 2)" <NA> <NA> <NA> <NA>
## 3 cell (2, 2)(2, 1) "cell (2, 2)(2, 1)" <NA> <NA> <NA> <NA>
##
## [[2]]
## # A tibble: 1 × 2
## `Header (2, 2)(1, 1)` `Header (2, 2)(1, 2)`
## <chr> <chr>
## 1 cell (2, 2)(2, 1) cell (2, 2)(2, 1)
## # A tibble: 4 × 4
## row col data_type html
## <int> <int> <chr> <chr>
## 1 1 1 html "<th>Header (1, 1)</th>"
## 2 2 1 html "<td>cell (2, 1)</td>"
## 3 1 2 html "<th>Header (1, 2)</th>"
## 4 2 2 html "<td>\n <table>\n<tr>\n<th>Header (2, 2)(1, 1)…
# The html of the table inside a cell
cell <-
x %>%
dplyr::filter(row == 2, col == 2) %>%
.$html
cell
## [1] "<td>\n <table>\n<tr>\n<th>Header (2, 2)(1, 1)</th>\n <th>Header (2, 2)(1, 2)</th>\n </tr>\n<tr>\n<td>cell (2, 2)(2, 1)</td>\n <td>cell (2, 2)(2, 1)</td>\n </tr>\n</table>\n</td>"
## [[1]]
## # A tibble: 4 × 4
## row col data_type html
## <int> <int> <chr> <chr>
## 1 1 1 html <th>Header (2, 2)(1, 1)</th>
## 2 2 1 html <td>cell (2, 2)(2, 1)</td>
## 3 1 2 html <th>Header (2, 2)(1, 2)</th>
## 4 2 2 html <td>cell (2, 2)(2, 1)</td>
A motivation for using unpivotr::as_cells()
is that it
extracts more than just text – it can extract whatever part of the HTML
you need.
Here, we extract URLs.
Scraping HTML. | ||
Sweet | as? | Yeah, right. |
cell_url <- function(x) {
if (is.na(x)) return(NA)
x %>%
read_html %>%
html_nodes("a") %>%
html_attr("href")
}
cell_text <- function(x) {
if (is.na(x)) return(NA)
x %>%
read_html %>%
html_nodes("a") %>%
html_text()
}
urls %>%
read_html() %>%
as_cells() %>%
.[[1]] %>%
mutate(text = purrr::map(html, cell_text),
url = purrr::map(html, cell_url)) %>%
tidyr::unnest(text, url)
## Warning: `unnest()` has a new interface. See `?unnest` for details.
## ℹ Try `df %>% unnest(c(text, url))`, with `mutate()` if needed.
## # A tibble: 8 × 6
## row col data_type html text url
## <int> <int> <chr> <chr> <chr> <chr>
## 1 1 1 html "<td colspan=\"2\">\n<a href=\"https://www.… Scra… http…
## 2 1 1 html "<td colspan=\"2\">\n<a href=\"https://www.… HTML. http…
## 3 2 1 html "<td><a href=\"https://cran.r-project.org/\… Sweet http…
## 4 1 2 html <NA> <NA> <NA>
## 5 2 2 html "<td><a href=\"https://cran.r-project.org/p… as? http…
## 6 1 3 html <NA> <NA> <NA>
## 7 2 3 html "<td>\n<a href=\"https://cran.r-project.org… Yeah, http…
## 8 2 3 html "<td>\n<a href=\"https://cran.r-project.org… righ… http…