Getting started with pulso

Loading GEIH microdata with pulso

pulso provides programmatic access to Colombia’s Gran Encuesta Integrada de Hogares (GEIH), the household labor force survey published monthly by DANE (Departamento Administrativo Nacional de Estadistica).

Quick start

library(pulso)

# 2024-06 is a validated period -- loads without any warning
df <- pulso_load(year = 2024, month = 6, module = "ocupados")

The result is a tibble with the survey microdata. By default, all columns are returned with their original DANE codes (e.g., P6020, P3271).

Validated periods and the allow_unvalidated parameter

pulso maintains a registry of periods that have been manually verified against DANE published figures. As of v0.1.0-rc2, 5 periods are validated:

For all other periods, pulso_load() raises a pulso_data_not_validated error by default:

# Raises pulso_data_not_validated -- 2024-09 is not yet validated
df <- pulso_load(year = 2024, month = 9, module = "ocupados")

# Explicitly allow unvalidated periods -- emits a visible warning
df <- pulso_load(year = 2024, month = 9, module = "ocupados",
                 allow_unvalidated = TRUE)

To check the validation status of a specific period:

pulso_validation_status(2024, 6)

Or list all validated periods:

pulso_list_validated_range()

Accessing variable metadata

Pass metadata = TRUE to get DANE codebook information attached to the result:

df <- pulso_load(year = 2024, month = 6, module = "ocupados",
                 metadata = TRUE)

You can describe individual columns:

cat(pulso_describe_column(df, "p6020"))

Or list metadata for all columns:

metadata_summary <- pulso_list_columns_metadata(df)
print(metadata_summary)

Exploring the variable catalog

pulso ships a canonical variable catalog (variable_map.json) that maps harmonized variable names to their epoch-specific DANE source codes. These catalog functions work offline – no data download needed.

List all canonical variables (first 10 rows):

library(pulso)
vars <- pulso_list_variables()
head(vars[, c("canonical_name", "module", "has_warning")], 10)
#> # A tibble: 10 × 3
#>    canonical_name      module                    has_warning
#>    <chr>               <chr>                     <lgl>      
#>  1 alfabetiza          caracteristicas_generales FALSE      
#>  2 anios_educ          caracteristicas_generales TRUE       
#>  3 area                caracteristicas_generales TRUE       
#>  4 asiste_educ         caracteristicas_generales FALSE      
#>  5 busco_trabajo       desocupados               TRUE       
#>  6 condicion_actividad caracteristicas_generales TRUE       
#>  7 cotiza_pension      ocupados                  FALSE      
#>  8 departamento        caracteristicas_generales FALSE      
#>  9 disponible          desocupados               TRUE       
#> 10 edad                caracteristicas_generales FALSE

Describe a single canonical variable and its epoch mappings:

cat(pulso_describe_variable("sexo"))
#> Variable: sexo
#> Module: caracteristicas_generales
#> Description: Sexo de la persona.
#> Epochs:
#>   geih_2006_2020: P6020
#>   geih_2021_present: P3271
#> WARNING: El código de variable cambió de P6020 (marco 2005) a P3271 (marco 2018). La Phase 1 Curator reportó P6016 como sexo en GEIH-2, pero los datos del June 2024 muestran P6016 con 17+ valores. P3271 (binario 1/2, cubre 70.020 personas) es el candidato confirmado; requiere verificación humana contra el cuestionario DANE.

Describe a survey module (reads sources.json bundled in the package):

cat(pulso_describe("ocupados"))
#> Module: ocupados
#> Level: persona
#> Description: Información laboral de las personas ocupadas en la semana de referencia.
#> Available in epochs: geih_2006_2020, geih_2021_present
#> Harmonized variables (11): cotiza_pension, hogar_id, horas_trabajadas_sem, ingreso_laboral, ocupacion ... and 6 more

What is GEIH?

GEIH is Colombia’s primary labor market survey, conducted monthly since 2007. It collects data on:

Microdata is freely published by DANE in monthly zip files. pulso automates the download, parsing, and harmonization across the four GEIH design epochs (2007-2018, 2019-2023, 2024-present, plus the historical ECH 2000-2006).

Comparison with the Python package

pulso (R) mirrors the API of pulso-co (Python). For example:

# Python
import pulso
df = pulso.load(year=2024, month=6, module="ocupados", metadata=True)
print(pulso.describe_column(df, "P6020"))
# R
library(pulso)
df <- pulso_load(year = 2024, month = 6, module = "ocupados",
                 metadata = TRUE)
cat(pulso_describe_column(df, "p6020"))

Both packages share the same canonical data files (sources.json, variable_map.json, dane_codebook.json) via the monorepo at https://github.com/Stebandido77/pulso.

Caching

Downloaded microdata is cached at tools::R_user_dir("pulso", "cache") to avoid re-downloading. Pass cache = FALSE to force re-download.

Breaking changes in 0.1.0-rc2

If you used pulso_load() in earlier development versions, note that the default behavior has changed for unvalidated periods:

This change aligns the R package with pulso-co (Python) and protects users from inadvertently using unvalidated data.

Coverage and limitations

pulso v0.1.0-rc2 supports the following:

Known limitations:

See the GitHub issues for roadmap and known limitations.