The ntdr
package is an easy way to access National Transit
Database from R. The package is available on Github, and you can
install it from there with remotes::install_github()
::install_github("https://github.com/vgXhc/ntdr", build_vignettes = TRUE) remotes
In addition to loading the ntdr
package we also load the
tidyverse
.
library(ntdr)
library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.2.3
#> Warning: package 'dplyr' was built under R version 4.2.3
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.1 ✔ readr 2.1.4
#> ✔ forcats 1.0.0 ✔ stringr 1.5.0
#> ✔ ggplot2 3.4.1 ✔ tibble 3.2.1
#> ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
#> ✔ purrr 1.0.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
get_ntd()
get_ntd()
is the main function of the package. It
doesn’t have any required parameters:
<- get_ntd()
ntd_data #> New names:
#> • `22526` -> `22526...34`
#> • `22523` -> `22523...71`
#> • `22526` -> `22526...95`
#> • `23344` -> `23344...100`
#> • `22523` -> `22523...102`
#> • `23344` -> `23344...139`
#> • `19423` -> `19423...147`
#> • `19423` -> `19423...187`
ntd_data#> # A tibble: 576,300 × 12
#> ntd_id_5 ntd_id_4 agency active reporter_type uza uza_name modes tos
#> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 00001 0001 King Count… Active Full Reporte… 14 Seattle… FB DO
#> 2 00001 0001 King Count… Active Full Reporte… 14 Seattle… FB DO
#> 3 00001 0001 King Count… Active Full Reporte… 14 Seattle… FB DO
#> 4 00001 0001 King Count… Active Full Reporte… 14 Seattle… FB DO
#> 5 00001 0001 King Count… Active Full Reporte… 14 Seattle… FB DO
#> 6 00001 0001 King Count… Active Full Reporte… 14 Seattle… FB DO
#> 7 00001 0001 King Count… Active Full Reporte… 14 Seattle… FB DO
#> 8 00001 0001 King Count… Active Full Reporte… 14 Seattle… FB DO
#> 9 00001 0001 King Count… Active Full Reporte… 14 Seattle… FB DO
#> 10 00001 0001 King Count… Active Full Reporte… 14 Seattle… FB DO
#> # ℹ 576,290 more rows
#> # ℹ 3 more variables: month <date>, value <dbl>, ntd_variable <chr>
colnames(ntd_data)
#> [1] "ntd_id_5" "ntd_id_4" "agency" "active"
#> [5] "reporter_type" "uza" "uza_name" "modes"
#> [9] "tos" "month" "value" "ntd_variable"
By default, the package downloads what the NTD calls “Complete
Monthly Ridership (with adjustments and estimates).” Alternatively you
can request raw
data (“Raw Monthly Ridership (No
Adjustments or Estimates)”). For a more detailed discussion of the
difference between the two data types, see this
blog post.
You see that the package downloads a fairly large xlsx
file from the web and returns a tibble with 576300 rows and 12 columns.
The first two columns are identifiers for the transit agency; followed
by a human-readable agency name. Note that the agency name may not be
what you expect. For example, the name of our local agency in Madison
(Wisconsin) is “Metro
Transit”. But in the NTD data it is listed as “City of Madison”. So
if you cannot find your agency, use the uza_name
variable
described below.
NTD data go back as far as 2002, and some agencies no longer actively
report data, report them under a different ID, or don’t even exist
anymore. This is reflected in the active
column.
reporter_type
most commonly takes on the
Full reporter
value, but especially smaller or rural
systems may have a different value for this variable. For agencies that
aren’t full reporters, the NTD data may include projections rather than
actually reported data.
uza
is an identifier for urbanized
areas and uza_name
has the name of that area (this will
usually be how you will find your local agency).
modes
denotes the type of transit reported on.
|>
ntd_data count(modes)
#> # A tibble: 22 × 2
#> modes n
#> <chr> <int>
#> 1 AG 1275
#> 2 AR 255
#> 3 CB 35700
#> 4 CC 255
#> 5 CR 9690
#> 6 DR 237150
#> 7 FB 11985
#> 8 HR 3825
#> 9 IP 1020
#> 10 LR 8925
#> # ℹ 12 more rows
There are a lot of different modes, including rather obscure ones
like “Inclined Plane” (IP
) or “Alaska Railroad”
(AR
). You can find documentation of the different modes here.
The tos
variable represents the “type of service”:
|>
ntd_data count(tos, sort = TRUE)
#> # A tibble: 13 × 2
#> tos n
#> <chr> <int>
#> 1 DO 294270
#> 2 PT 244545
#> 3 TX 31365
#> 4 TN 3060
#> 5 Reduced Reporters 510
#> 6 Rural Reporters 510
#> 7 <NA> 510
#> 8 Rolling 12-Month Sum 255
#> 9 Rolling 12-Month Sum with Reduced Reporters 255
#> 10 Rolling 12-Month Sum with Reduced and Rural Reporter Estimates 255
#> 11 Subtotal with Reduced Reporters 255
#> 12 Subtotal with Reduced and Rural Reporter Estimates 255
#> 13 Total 255
The most common values are DO
, which is directly
operated service, i.e. a transit agency running their own service; and
PT
for “purchased transportation”, i.e. a transit agency
contracting out services. Often agencies will have an entry for both of
these, with DO
being the regular, fixed route service and
PT
being paratransit or other more specialized forms of
transit.
Finally, the month
and value
variables
provide the actual transit data for a given month. What variable is
presented by value
is in the ntd_variable
. If
you call get_ntd()
without any additional parameters, it
will return the “unlinked passenger trips” (UPT) metric for all
agencies, modes, and types of service.
The data are returned in a long format, which makes it easy to create plots:
get_ntd(agency = c("City of Madison", "Capital Area Transportation Authority"), modes = "MB") |>
::filter(tos == "DO") |>
dplyrggplot(aes(month, value, color = agency)) +
geom_line() +
labs(title = "Monthly unlinked passenger trips in Madison and Lansing") +
theme_minimal()
#> New names:
#> • `22526` -> `22526...34`
#> • `22523` -> `22523...71`
#> • `22526` -> `22526...95`
#> • `23344` -> `23344...100`
#> • `22523` -> `22523...102`
#> • `23344` -> `23344...139`
#> • `19423` -> `19423...147`
#> • `19423` -> `19423...187`
#> Warning: Removed 3 rows containing missing values (`geom_line()`).