Introduction to ripums - IPUMS Data in R

Minnesota Population Center

2017-11-02

The ripums package allows you to read in data from your extract into R along with the associated metadata like variable labels, value labels and more. IPUMS is a great source of international census and survey data.

IPUMS provides census and survey data from around the world integrated across time and space. IPUMS integration and documentation makes it easy to study change, conduct comparative research, merge information across data types, and analyze individuals within family and community context. Data and services are available free of charge.

Learn more here: https://www.ipums.org/whatIsIPUMS.shtml

Basics

This vignette gives the basic outline of the ripums package. There are also vignettes that show how to use the value labels and geographic data provided by IPUMS and others that run through examples using CPS data and NHGIS data.

To get to them run the following commands:

vignette("value-labels", package = "ripums")
vignette("ipums-geography", package = "ripums")
vignette("ipums-cps", package = "ripums")
vignette("ipums-nhgis", package = "ripums")

Getting Data from the IPUMS website

IPUMS data is downloaded from our website at https://www.ipums.org. The website provides an interactive extract system that allows you to select only the sample and variables that are relevant to your research question.

Currently we do not support the TerraPop, but we hope to add support soon!

For microdata projects (all supported projects except NHGIS), once you have created your extract, you should choose to download data as either fixed-width-files or comma separated ones.

Once your extract is complete, download the data file (.dat or .csv) and the DDI, by right-clicking on the file and selecting “Save As…”. Note that there is no R-specific syntax file, as there is for the other statistical packages like SAS, Stata and SPSS.

Annotated screenshot for downloading microdata

Annotated screenshot for downloading microdata

For NHGIS, download the table data as a csv, and, if you want the associated mapping data, the GIS data. NHGIS provides the option to download csvs with an extra header row, it does not matter which option you select.

Import Functions

Once your extract is downloaded, the ripums package functions read_*() help you load the data into R.

Metadata Functions

Once the data is into R, you can learn information about the extract using the metadata function.

Grab Bag

Survey Weights

The data from most projects contain some form of weighting variable that should be used to calculate estimates that are representative of the whole population. Many projects also provide specifications to help estimate variance given the complex design of the survey, such as replicate weights or design variables like STRATUM and PSU. The survey package provides functions that allow you to estimate variance taking this into account, and the srvyr package implements dplyr-like syntax for survey analysis, using the survey package’s functions.

For more information about what these variables mean and how to use them, see the website for the project you are interested in.

Non-Extract Data

Some projects have data that is not contained within the extract system and so no DDI is provided for this data. In this case, either use the csv if available, or the haven package to read one of the files intended for another statistical software (like Stata, SAS or SPSS).

Value Labels

The way that IPUMS treats value labels does not align with factors (the main way that R is able to store values associated with labels). R’s factor variables can only store values as an integer sequence (1, 2, 3, …), but IPUMS conventions are to store missing / not in universe codes as large numbers, to distinguish them from the normal values.

Therefore, the ripums package uses the labelled class from the haven package to store labelled values. See the “value-labels” vignette for more information (vignette('value-labels')).

In summary, it is generally best if you wish to use the labels, to convert from the labelled class to factor early on in data analysis workflow. This is because many data manipulation functions will lose the associated labels. The function as_factor() is the main function to create factors from labels, but often you will need to do more manipulation before that.

library(ripums)
library(dplyr, warn.conflicts = FALSE)

# Note that you can pass in the loaded DDI into the `read_ipums_micro()`
cps_ddi <- read_ipums_ddi(ripums_example("cps_00006.xml"))
cps_data <- read_ipums_micro(cps_ddi, verbose = FALSE)

# Show which variables have labels
cps_data %>%
  select_if(is.labelled)
#> # A tibble: 7,668 x 3
#>     STATEFIP     MONTH    INCTOT
#>    <int+lbl> <int+lbl> <dbl+lbl>
#>  1        55         3      4883
#>  2        55         3      5800
#>  3        55         3  99999998
#>  4        27         3     14015
#>  5        27         3     16552
#>  6        27         3      6375
#>  7        19         3  99999999
#>  8        19         3         0
#>  9        19         3       600
#> 10        19         3  99999999
#> # ... with 7,658 more rows

# Notice how the tibble print function shows the dbl+lbl class on top

# Investigate labels
ipums_val_labels(cps_data$STATEFIP)
#> # A tibble: 75 x 2
#>      val                  lbl
#>    <dbl>                <chr>
#>  1     1              Alabama
#>  2     2               Alaska
#>  3     4              Arizona
#>  4     5             Arkansas
#>  5     6           California
#>  6     8             Colorado
#>  7     9          Connecticut
#>  8    10             Delaware
#>  9    11 District of Columbia
#> 10    12              Florida
#> # ... with 65 more rows

# Convert the labels to factors (and drop the unused levels)
cps_data <- cps_data %>%
  mutate(STATE_factor = droplevels(as_factor(STATEFIP)))

table(cps_data$STATE_factor, useNA = "always")
#> 
#>         Iowa    Minnesota North Dakota South Dakota    Wisconsin 
#>         1892         2362          188          227         2999 
#>         <NA> 
#>            0
# Manipulating the labelled value before as_factor 
# often leads to losing the information...
# Say we want to set Iowa (STATEFIP == 19) to missing
cps_data <- cps_data %>%
  mutate(STATE_factor2 = as_factor(ifelse(STATEFIP == 19, NA, STATEFIP)))
#> Error in mutate_impl(.data, dots): Evaluation error: no applicable method for 'as_factor' applied to an object of class "c('integer', 'numeric')".
# Currently the best solution is to convert to factor first, then manipulate using
# factor. We hope to improve this.
cps_data <- cps_data %>%
  mutate(STATE_factor3 = droplevels(as_factor(STATEFIP), "Iowa"))

table(cps_data$STATE_factor3, useNA = "always")
#> 
#>    Minnesota North Dakota South Dakota    Wisconsin         <NA> 
#>         2362          188          227         2999         1892

# The as_factor function also has a "levels" argument that can 
# put both the labels and values into the factor
cps_data <- cps_data %>%
  mutate(STATE_factor4 = droplevels(as_factor(STATEFIP, levels = "both")))

table(cps_data$STATE_factor4, useNA = "always")
#> 
#>         [19] Iowa    [27] Minnesota [38] North Dakota [46] South Dakota 
#>              1892              2362               188               227 
#>    [55] Wisconsin              <NA> 
#>              2999                 0

Other IPUMS attributes

Similarly, the other attributes that ripums stores about the data are often lost during an analysis. One way to deal with this is to load the DDI or codebook in addition to the actual data using the functions read_ipums_ddi() and read_ipums_codebook(). This way, when you wish to refer to variable labels or other metadata, you can use the DDI object, which does not get modified during your analysis.

library(ripums)
library(dplyr, warn.conflicts = FALSE)

# Note that you can pass in the loaded DDI into the `read_ipums_micro()`
cps_ddi <- read_ipums_ddi(ripums_example("cps_00006.xml"))
cps_data <- read_ipums_micro(cps_ddi, verbose = FALSE)

# Currently variable description is available for year
ipums_var_desc(cps_data$YEAR)
#> [1] "YEAR reports the year in which the survey was conducted.  YEARP is repeated on person records."

# But after using ifelse it is gone
cps_data <- cps_data %>%
  mutate(YEAR = ifelse(YEAR == 1962, 62, NA))
ipums_var_desc(cps_data$YEAR)
#> [1] NA

# So you can use the DDI
ipums_var_desc(cps_ddi, "YEAR")
#> [1] "YEAR reports the year in which the survey was conducted.  YEARP is repeated on person records."

# The DDI also has file level information that is not available from just
# the data.
ipums_file_info(cps_ddi, "extract_notes")
#> [1] "User-provided description:  Minimal test extract\nSamples: 1962, 1963\nVariables: STATEFIP, INCTOT (automatically Year, SERIAL, HWTSUPP, MONTH, WTSUPP)\nSelect Cases: State - Minnesota, Iowa, Wisconsin, South Dakota, North Dakota"

“dplyr select-style” Syntax

Several functions within the ripums package allow for “dplyr select-style” syntax. This means that they accept either a character vector of values (eg c("YEAR", "AGE")), bare vectors of values (eg c(YEAR, AGE)) and the helper functions allowed in dplyr::select() (eg one_of(c("YEAR", "AGE"))).

library(ripums)
library(dplyr, warn.conflicts = FALSE)

# The vars argument for `read_ipums_micro` uses this syntax
# So these are all equivalent
cf <- ripums_example("cps_00006.xml")
read_ipums_micro(cf, vars = c("YEAR", "INCTOT"), verbose = FALSE) %>%
  names()
#> [1] "YEAR"   "INCTOT"

read_ipums_micro(cf, vars = c(YEAR, INCTOT), verbose = FALSE) %>%
  names()
#> [1] "YEAR"   "INCTOT"

read_ipums_micro(cf, vars = c(one_of("YEAR"), starts_with("INC")), verbose = FALSE) %>%
  names()
#> [1] "YEAR"   "INCTOT"

# `data_layer` and `shape_layer` arguments to `read_nhgis()` and terra functions
# also use it.
# (Sometimes extracts have multiple files, though all examples only have one)
nf <- ripums_example("nhgis0008_csv.zip")
ipums_list_files(nf)
#> # A tibble: 1 x 2
#>    type                                        file
#>   <chr>                                       <chr>
#> 1  data nhgis0008_csv/nhgis0008_ds135_1990_pmsa.csv

ipums_list_files(nf, data_layer = "nhgis0008_csv/nhgis0008_ds135_1990_pmsa.csv")
#> # A tibble: 1 x 2
#>    type                                        file
#>   <chr>                                       <chr>
#> 1  data nhgis0008_csv/nhgis0008_ds135_1990_pmsa.csv

ipums_list_files(nf, data_layer = contains("ds135"))
#> # A tibble: 1 x 2
#>    type                                        file
#>   <chr>                                       <chr>
#> 1  data nhgis0008_csv/nhgis0008_ds135_1990_pmsa.csv

Hierarchical data structures

For certain IPUMS projects, the data is hierarchical, multiple people are included in a single household, or multiple activities are performed by a single person. The ripums package provides two data structures for storing such data (for users who did not select the “rectangularize” option on the website). The data can be loaded as a "list" or "long".

List data loads each record type into a separate data.frame. The names of the recordtype data.frames are the value of the RECTYPE variable (eg “H” and “P”). Use the function read_ipums_micro_list() to load the data this way.

Long data has one row per unit, regardless of what type of record the unit is. Therefore, datasets loaded this way often contain variables with a large number of missings, for the variables that only apply to certain record types. Use the function read_ipums_micro() to load the data this way.

library(ripums)
library(dplyr, warn.conflicts = FALSE)

# List data
cps <- read_ipums_micro_list(
  ripums_example("cps_00010.xml"),
  verbose = FALSE
)

cps$PERSON
#> # A tibble: 7,668 x 6
#>      RECTYPE  YEAR SERIAL PERNUM  WTSUPP    INCTOT
#>    <chr+lbl> <dbl>  <dbl>  <dbl>   <dbl> <dbl+lbl>
#>  1         P  1962     80      1 1475.59      4883
#>  2         P  1962     80      2 1470.72      5800
#>  3         P  1962     80      3 1578.75  99999998
#>  4         P  1962     82      1 1597.61     14015
#>  5         P  1962     83      1 1706.65     16552
#>  6         P  1962     84      1 1790.25      6375
#>  7         P  1962    107      1 4355.40  99999999
#>  8         P  1962    107      2 1385.81         0
#>  9         P  1962    107      3 1629.10       600
#> 10         P  1962    107      4 1432.24  99999999
#> # ... with 7,658 more rows

cps$HOUSEHOLD
#> # A tibble: 3,385 x 6
#>      RECTYPE  YEAR SERIAL HWTSUPP  STATEFIP     MONTH
#>    <chr+lbl> <dbl>  <dbl>   <dbl> <int+lbl> <int+lbl>
#>  1         H  1962     80 1475.59        55         3
#>  2         H  1962     82 1597.61        27         3
#>  3         H  1962     83 1706.65        27         3
#>  4         H  1962     84 1790.25        27         3
#>  5         H  1962    107 4355.40        19         3
#>  6         H  1962    108 1479.05        19         3
#>  7         H  1962    122 3602.75        27         3
#>  8         H  1962    124 4104.41        55         3
#>  9         H  1962    125 2182.17        55         3
#> 10         H  1962    126 1826.38        55         3
#> # ... with 3,375 more rows

# Long data
cps <- read_ipums_micro(
  ripums_example("cps_00010.xml"),
  verbose = FALSE
)

cps
#> # A tibble: 11,053 x 9
#>      RECTYPE  YEAR SERIAL HWTSUPP  STATEFIP     MONTH PERNUM  WTSUPP
#>    <chr+lbl> <dbl>  <dbl>   <dbl> <int+lbl> <int+lbl>  <dbl>   <dbl>
#>  1         H  1962     80 1475.59        55         3     NA      NA
#>  2         P  1962     80      NA        NA        NA      1 1475.59
#>  3         P  1962     80      NA        NA        NA      2 1470.72
#>  4         P  1962     80      NA        NA        NA      3 1578.75
#>  5         H  1962     82 1597.61        27         3     NA      NA
#>  6         P  1962     82      NA        NA        NA      1 1597.61
#>  7         H  1962     83 1706.65        27         3     NA      NA
#>  8         P  1962     83      NA        NA        NA      1 1706.65
#>  9         H  1962     84 1790.25        27         3     NA      NA
#> 10         P  1962     84      NA        NA        NA      1 1790.25
#> # ... with 11,043 more rows, and 1 more variables: INCTOT <dbl+lbl>

Geospatial Packages: sf vs sp

The ripums package allows for loading geospatial data in two formats (sf for Simple Features and sp for Spatial). The sf package is relatively new, and so does not have as widespread support as the sp package. However, (in my opinion) it does allow for easier analysis, and so may be a better place to start if you have not used GIS data in R before.

For more details about how to load geographic data using ripums, see the vignette “ipums-geography” (vignette("ipums-geography", package = "ripums"))