---
title: "Getting started with pulso"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting started with pulso}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = TRUE
)
```

# Loading GEIH microdata with pulso

`pulso` provides programmatic access to Colombia's Gran Encuesta
Integrada de Hogares (GEIH), the household labor force survey
published monthly by DANE (Departamento Administrativo Nacional
de Estadistica).

## Quick start

```{r, eval=FALSE}
library(pulso)

# 2024-06 is a validated period -- loads without any warning
df <- pulso_load(year = 2024, month = 6, module = "ocupados")
```

The result is a tibble with the survey microdata. By default, all
columns are returned with their original DANE codes (e.g., P6020,
P3271).

## Validated periods and the allow_unvalidated parameter

pulso maintains a registry of periods that have been manually verified
against DANE published figures. As of v0.1.0-rc2, **5 periods are
validated**:

- 2007-12
- 2015-06
- 2021-12
- 2022-01
- 2024-06

For all other periods, `pulso_load()` raises a `pulso_data_not_validated`
error by default:

```{r, eval=FALSE}
# Raises pulso_data_not_validated -- 2024-09 is not yet validated
df <- pulso_load(year = 2024, month = 9, module = "ocupados")

# Explicitly allow unvalidated periods -- emits a visible warning
df <- pulso_load(year = 2024, month = 9, module = "ocupados",
                 allow_unvalidated = TRUE)
```

To check the validation status of a specific period:

```{r, eval=FALSE}
pulso_validation_status(2024, 6)
```

Or list all validated periods:

```{r, eval=FALSE}
pulso_list_validated_range()
```

## Accessing variable metadata

Pass `metadata = TRUE` to get DANE codebook information attached
to the result:

```{r, eval=FALSE}
df <- pulso_load(year = 2024, month = 6, module = "ocupados",
                 metadata = TRUE)
```

You can describe individual columns:

```{r, eval=FALSE}
cat(pulso_describe_column(df, "p6020"))
```

Or list metadata for all columns:

```{r, eval=FALSE}
metadata_summary <- pulso_list_columns_metadata(df)
print(metadata_summary)
```

## Exploring the variable catalog

pulso ships a canonical variable catalog (`variable_map.json`) that
maps harmonized variable names to their epoch-specific DANE source
codes. These catalog functions work offline -- no data download needed.

List all canonical variables (first 10 rows):

```{r}
library(pulso)
vars <- pulso_list_variables()
head(vars[, c("canonical_name", "module", "has_warning")], 10)
```

Describe a single canonical variable and its epoch mappings:

```{r}
cat(pulso_describe_variable("sexo"))
```

Describe a survey module (reads `sources.json` bundled in the package):

```{r}
cat(pulso_describe("ocupados"))
```

## What is GEIH?

GEIH is Colombia's primary labor market survey, conducted monthly
since 2007. It collects data on:

- Labor force participation (employed, unemployed, inactive)
- Wages and informal employment
- Demographic characteristics (age, sex, education)
- Household composition

Microdata is freely published by DANE in monthly zip files.
`pulso` automates the download, parsing, and harmonization across
the four GEIH design epochs (2007-2018, 2019-2023, 2024-present,
plus the historical ECH 2000-2006).

## Comparison with the Python package

`pulso` (R) mirrors the API of `pulso-co` (Python). For example:

```python
# Python
import pulso
df = pulso.load(year=2024, month=6, module="ocupados", metadata=True)
print(pulso.describe_column(df, "P6020"))
```

```{r, eval=FALSE}
# R
library(pulso)
df <- pulso_load(year = 2024, month = 6, module = "ocupados",
                 metadata = TRUE)
cat(pulso_describe_column(df, "p6020"))
```

Both packages share the same canonical data files (sources.json,
variable_map.json, dane_codebook.json) via the monorepo at
https://github.com/Stebandido77/pulso.

## Caching

Downloaded microdata is cached at
`tools::R_user_dir("pulso", "cache")` to avoid re-downloading.
Pass `cache = FALSE` to force re-download.

## Breaking changes in 0.1.0-rc2

If you used `pulso_load()` in earlier development versions, note that
**the default behavior has changed for unvalidated periods**:

- **Before:** Loaded silently even if data was not validated
- **After:** Raises `pulso_data_not_validated` unless
  `allow_unvalidated = TRUE` is specified

This change aligns the R package with `pulso-co` (Python) and protects
users from inadvertently using unvalidated data.

## Coverage and limitations

`pulso` v0.1.0-rc2 supports the following:

- Single year/month/module loads via `pulso_load()`
- Multi-module persona-level merges via `pulso_load_merged()`
- Column metadata via `pulso_describe_column()` and
  `pulso_list_columns_metadata()`
- Module discovery via `pulso_describe()`
- Canonical variable catalog via `pulso_describe_variable()` and
  `pulso_list_variables()`
- Validation status queries via `pulso_validation_status()` and
  `pulso_list_validated_range()`
- Coverage: 2007-01 to present (sources.json)

Known limitations:

- Only 5 of 230 periods are validated. Use `allow_unvalidated = TRUE`
  for the rest, with awareness that results may differ from DANE
  official tables.
- Curator entries in `variable_map.json` are theoretical mappings
  pending empirical verification. Use `has_warning` from
  `pulso_list_variables()` to identify these entries.
- Nested-zip periods (2024-03, 2024-04) are deferred to v0.2.0.
- Mixed-level merges (persona + hogar) in `pulso_load_merged()` are
  deferred to v0.2.0.
- ECH epoch (2000-2006) not yet supported. Planned for v0.2.0.

See the GitHub issues for roadmap and known limitations.
