pulso provides programmatic access to Colombia’s Gran
Encuesta Integrada de Hogares (GEIH), the household labor force survey
published monthly by DANE (Departamento Administrativo Nacional de
Estadistica).
library(pulso)
# 2024-06 is a validated period -- loads without any warning
df <- pulso_load(year = 2024, month = 6, module = "ocupados")The result is a tibble with the survey microdata. By default, all columns are returned with their original DANE codes (e.g., P6020, P3271).
pulso maintains a registry of periods that have been manually verified against DANE published figures. As of v0.1.0-rc2, 5 periods are validated:
For all other periods, pulso_load() raises a
pulso_data_not_validated error by default:
# Raises pulso_data_not_validated -- 2024-09 is not yet validated
df <- pulso_load(year = 2024, month = 9, module = "ocupados")
# Explicitly allow unvalidated periods -- emits a visible warning
df <- pulso_load(year = 2024, month = 9, module = "ocupados",
allow_unvalidated = TRUE)To check the validation status of a specific period:
Or list all validated periods:
Pass metadata = TRUE to get DANE codebook information
attached to the result:
You can describe individual columns:
Or list metadata for all columns:
pulso ships a canonical variable catalog
(variable_map.json) that maps harmonized variable names to
their epoch-specific DANE source codes. These catalog functions work
offline – no data download needed.
List all canonical variables (first 10 rows):
library(pulso)
vars <- pulso_list_variables()
head(vars[, c("canonical_name", "module", "has_warning")], 10)
#> # A tibble: 10 × 3
#> canonical_name module has_warning
#> <chr> <chr> <lgl>
#> 1 alfabetiza caracteristicas_generales FALSE
#> 2 anios_educ caracteristicas_generales TRUE
#> 3 area caracteristicas_generales TRUE
#> 4 asiste_educ caracteristicas_generales FALSE
#> 5 busco_trabajo desocupados TRUE
#> 6 condicion_actividad caracteristicas_generales TRUE
#> 7 cotiza_pension ocupados FALSE
#> 8 departamento caracteristicas_generales FALSE
#> 9 disponible desocupados TRUE
#> 10 edad caracteristicas_generales FALSEDescribe a single canonical variable and its epoch mappings:
cat(pulso_describe_variable("sexo"))
#> Variable: sexo
#> Module: caracteristicas_generales
#> Description: Sexo de la persona.
#> Epochs:
#> geih_2006_2020: P6020
#> geih_2021_present: P3271
#> WARNING: El código de variable cambió de P6020 (marco 2005) a P3271 (marco 2018). La Phase 1 Curator reportó P6016 como sexo en GEIH-2, pero los datos del June 2024 muestran P6016 con 17+ valores. P3271 (binario 1/2, cubre 70.020 personas) es el candidato confirmado; requiere verificación humana contra el cuestionario DANE.Describe a survey module (reads sources.json bundled in
the package):
cat(pulso_describe("ocupados"))
#> Module: ocupados
#> Level: persona
#> Description: Información laboral de las personas ocupadas en la semana de referencia.
#> Available in epochs: geih_2006_2020, geih_2021_present
#> Harmonized variables (11): cotiza_pension, hogar_id, horas_trabajadas_sem, ingreso_laboral, ocupacion ... and 6 moreGEIH is Colombia’s primary labor market survey, conducted monthly since 2007. It collects data on:
Microdata is freely published by DANE in monthly zip files.
pulso automates the download, parsing, and harmonization
across the four GEIH design epochs (2007-2018, 2019-2023, 2024-present,
plus the historical ECH 2000-2006).
pulso (R) mirrors the API of pulso-co
(Python). For example:
# Python
import pulso
df = pulso.load(year=2024, month=6, module="ocupados", metadata=True)
print(pulso.describe_column(df, "P6020"))# R
library(pulso)
df <- pulso_load(year = 2024, month = 6, module = "ocupados",
metadata = TRUE)
cat(pulso_describe_column(df, "p6020"))Both packages share the same canonical data files (sources.json, variable_map.json, dane_codebook.json) via the monorepo at https://github.com/Stebandido77/pulso.
Downloaded microdata is cached at
tools::R_user_dir("pulso", "cache") to avoid
re-downloading. Pass cache = FALSE to force
re-download.
If you used pulso_load() in earlier development
versions, note that the default behavior has changed for
unvalidated periods:
pulso_data_not_validated
unless allow_unvalidated = TRUE is specifiedThis change aligns the R package with pulso-co (Python)
and protects users from inadvertently using unvalidated data.
pulso v0.1.0-rc2 supports the following:
pulso_load()pulso_load_merged()pulso_describe_column() and
pulso_list_columns_metadata()pulso_describe()pulso_describe_variable() and
pulso_list_variables()pulso_validation_status()
and pulso_list_validated_range()Known limitations:
allow_unvalidated = TRUE for the rest, with awareness that
results may differ from DANE official tables.variable_map.json are theoretical
mappings pending empirical verification. Use has_warning
from pulso_list_variables() to identify these entries.pulso_load_merged() are deferred to v0.2.0.See the GitHub issues for roadmap and known limitations.