The wordbankr
package allows you to access data in the Wordbank database from R. This vignette shows some examples of how to use the data loading functions and what the resulting data look like.
There are three different data views that you can pull out of Wordbank: by-administration, by-item, and administration-by-item.
The get_administration_data
function gives by-administration information, for either a specific language and form or for all instruments:
english_ws_admins <- get_administration_data("English", "WS")
head(english_ws_admins)
## Source: local data frame [6 x 14]
##
## data_id age comprehension production language form birth_order
## <dbl> <int> <int> <int> <chr> <chr> <fctr>
## 1 51699 27 497 497 English WS Fourth
## 2 51700 21 369 369 English WS Second
## 3 51701 26 190 190 English WS Fourth
## 4 51702 27 264 264 English WS Second
## 5 51703 19 159 159 English WS Second
## 6 51704 30 513 513 English WS Second
## Variables not shown: ethnicity <fctr>, sex <fctr>, zygosity <chr>, norming
## <lgl>, longitudinal <lgl>, source_name <chr>, mom_ed <fctr>.
all_admins <- get_administration_data()
head(all_admins)
## Source: local data frame [6 x 14]
##
## data_id age comprehension production language form birth_order
## <dbl> <int> <int> <int> <chr> <chr> <fctr>
## 1 29821 13 293 88 Croatian WG NA
## 2 29822 16 122 12 Croatian WG NA
## 3 29823 9 3 0 Croatian WG NA
## 4 29824 12 0 0 Croatian WG NA
## 5 29825 12 44 0 Croatian WG NA
## 6 29826 8 14 5 Croatian WG NA
## Variables not shown: ethnicity <fctr>, sex <fctr>, zygosity <chr>, norming
## <lgl>, longitudinal <lgl>, source_name <chr>, mom_ed <fctr>.
The get_item_data
function gives by-item information, for either a specific language and form or for all instruments:
spanish_wg_items <- get_item_data("Spanish", "WG")
head(spanish_wg_items)
## Source: local data frame [6 x 11]
##
## item_id definition language form type category lexical_category
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 item_1 ¡am! Spanish WG word sounds other
## 2 item_2 ¡ay! Spanish WG word sounds other
## 3 item_3 bee/mee Spanish WG word sounds other
## 4 item_4 cuacuá Spanish WG word sounds other
## 5 item_5 guaguá Spanish WG word sounds other
## 6 item_6 miau Spanish WG word sounds other
## Variables not shown: lexical_class <chr>, uni_lemma <chr>,
## complexity_category <chr>, num_item_id <dbl>.
all_items <- get_item_data()
head(all_items)
## Source: local data frame [6 x 11]
##
## item_id definition language form type category lexical_category
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 item_81 gristi Croatian WG word action_words predicates
## 2 item_264 puhati Croatian WG word action_words predicates
## 3 item_269 razbiti Croatian WG word action_words predicates
## 4 item_64 donijeti Croatian WG word action_words predicates
## 5 item_153 kupiti Croatian WG word action_words predicates
## 6 item_36 čistiti Croatian WG word action_words predicates
## Variables not shown: lexical_class <chr>, uni_lemma <chr>,
## complexity_category <chr>, num_item_id <dbl>.
If you are only looking at total vocabulary size, admins
is all you need, since it has both productive and receptive vocabulary sizes calculated. If you are looking at specific items or subsets of items, you need to load instrument data, using the get_instrument_data
function. Pass it an instrument language and form, along with a list of items you want to extract (by item_id
).
eng_ws_canines <- get_instrument_data(instrument_language = "English",
instrument_form = "WS",
items = c("item_26", "item_46"))
head(eng_ws_canines)
## Source: local data frame [6 x 3]
##
## data_id value num_item_id
## <dbl> <chr> <dbl>
## 1 51699 produces 26
## 2 51700 produces 26
## 3 51701 produces 26
## 4 51702 produces 26
## 5 51703 26
## 6 51704 produces 26
By default get_instrument_table
returns a data frame with columns of the administration’s data_id
, the item’s num_item_id
(numerical item_id
), and the corresponding value. To include administration information, you can set the administrations
argument to TRUE
, or pass the result of get_administration_data
as administrations
(that way you can prevent the administration data from being loaded multiple times). Similarly, you can set the iteminfo
argument to TRUE
, or pass it result of get_item_data
.
Loading the data is fast if you need only a handful of items, but the time scales about linearly with the number of items, and can get quite slow if you need many or all of them. So, it’s a good idea to filter down to only the items you need before calling get_instrument_data
.
As an example, let’s say we want to look at the production of animal words on English Words & Sentences over age. First we get the items we want:
animals <- get_item_data("English", "WS") %>%
filter(category == "animals")
Then we get the instrument data for those items:
animal_data <- get_instrument_data(instrument_language = "English",
instrument_form = "WS",
items = animals$item_id,
administrations = english_ws_admins)
Finally, we calculate how many animals words each child produces and the median number of animals of each age bin:
animal_summary <- animal_data %>%
mutate(produces = value == "produces") %>%
group_by(age, data_id) %>%
summarise(num_animals = sum(produces, na.rm = TRUE)) %>%
group_by(age) %>%
summarise(median_num_animals = median(num_animals, na.rm = TRUE))
ggplot(animal_summary, aes(x = age, y = median_num_animals)) +
geom_point()