The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

introduction to xtsum

Joao Claudio Macosso

2023-12-07

Introduction

xtsum is an R wrapper based on STATA xtsum command, it used to provide summary statistics for a panel data set. It decomposes the variable \(x_{it}\) into a between \((\bar{x_i})\) and within \((x_{it} − \bar{x_i} + \bar{\bar{x}})\), the global mean x being added back in make results comparable, see (StataCorp 2023).

Installation

install.packages("xtsum")

# For dev version
# install.packages("devtools")
devtools::install_github("macosso/xtsum")

Getting Started

# Load the librarry
library(xtsum)

xtsum

This function computes summary statistics for panel data, including overall statistics, between-group statistics, and within-group statistics.

Usage

xtsum(
  data,
  variables = NULL,
  id = NULL,
  t = NULL,
  na.rm = FALSE,
  return.data.frame = TRUE,
  dec = 3
)

Arguments

data A data.frame or pdata.frame object representing panel data.
variables (Optional) Vector of variable names for which to calculate statistics. If not provided, all numeric variables in the data will be used.
id (Optional) Name of the individual identifier variable.
t (Optional) Name of the time identifier variable.
na.rm Logical indicating whether to remove NAs when calculating statistics.
return.data.frame If the return object should be a dataframe
dec Number of significant digits to report

Example

Genral example

Based on National Longitudinal Survey of Young Women, 14-24 years old in 1968

data("nlswork", package = "sampleSelection")
xtsum(nlswork, "hours", id = "idcode", t = "year", na.rm = T, dec = 6)

Variable	Dim	Mean	SD	Min	Max	Observations
___________	_________
hours	overall	36.55956	9.869623	1	168	N = 28467
	between		7.846585	1	83.5	n = 4710
	within		7.520712	-2.154726	130.05956	T = 6.043949

The table above can be interpreted as below paraphrased from (StataCorp 2023).

The overall and within are calculated over N = 28,467 person-years of data. The between is calculated over n = 4,710 persons, and the average number of years a person was observed in the hours data isT = 6.

xtsum also reports standard deviation(SD), minimums(Min), and maximums(Max).

Hours worked varied between Overal Min = 1 and Overall Max = 168. Average hours worked for each woman varied between between Min = 1 and between Max = 83.5. “Hours worked within” varied between within Min = −2.15 and within Max = 130.1, which is not to say that any woman actually worked negative hours. The within number refers to the deviation from each individual’s average, and naturally, some of those deviations must be negative. Then the negative value is not disturbing but the positive value is. Did some woman really deviate from her average by +130.1 hours? No. In our definition of within, we add back in the global average of 36.6 hours. Some woman did deviate from her average by 130.1 − 36.6 = 93.5 hours, which is still large.

The reported standard deviations tell us that the variation in hours worked last week across women is nearly equal to that observed within a woman over time. That is, if you were to draw two women randomly from our data, the difference in hours worked is expected to be nearly equal to the difference for the same woman in two randomly selected years.

More detailed interpretation can be found in handout(Porter n.d.)

Using pdata.frame object

data("Gasoline", package = "plm")
Gas <- pdata.frame(Gasoline, index = c("country", "year"), drop.index = TRUE)
xtsum(Gas)

Variable	Dim	Mean	SD	Min	Max	Observations
___________	_________
lgaspcar	overall	4.296	0.549	3.38	6.157	N = 342
	between		0.515	3.73	5.766	n = 18
	within		0.224	3.545	5.592	T = 19
___________	_________
lincomep	overall	-6.139	0.635	-8.073	-5.221	N = 342
	between		0.609	-7.816	-5.449	n = 18
	within		0.225	-6.877	-5.6	T = 19
___________	_________
lrpmg	overall	-0.523	0.678	-2.896	1.125	N = 342
	between		0.684	-2.709	0.739	n = 18
	within		0.127	-1.057	-0.137	T = 19
___________	_________
lcarpcap	overall	-9.042	1.219	-13.475	-7.536	N = 342
	between		1.114	-12.459	-7.781	n = 18
	within		0.557	-11.332	-7.691	T = 19

Using regular data.frame with id and t specified

data("Crime", package = "plm")
xtsum(Crime, variables = c("polpc", "avgsen", "crmrte"), id = "county", t = "year")

Variable	Dim	Mean	SD	Min	Max	Observations
___________	_________
polpc	overall	0.002	0.003	0	0.036	N = 630
	between		0.002	0.001	0.016	n = 90
	within		0.002	-0.013	0.022	T = 7
___________	_________
avgsen	overall	8.955	2.658	4.22	25.83	N = 630
	between		1.498	6.277	14.581	n = 90
	within		2.201	1.313	20.203	T = 7
___________	_________
crmrte	overall	0.032	0.018	0.002	0.164	N = 630
	between		0.017	0.004	0.089	n = 90
	within		0.007	-0.011	0.126	T = 7

Specifying variables to include in the summary

xtsum(Gas, variables = c("lincomep", "lgaspcar"))

Variable	Dim	Mean	SD	Min	Max	Observations
___________	_________
lincomep	overall	-6.139	0.635	-8.073	-5.221	N = 342
	between		0.609	-7.816	-5.449	n = 18
	within		0.225	-6.877	-5.6	T = 19
___________	_________
lgaspcar	overall	4.296	0.549	3.38	6.157	N = 342
	between		0.515	3.73	5.766	n = 18
	within		0.224	3.545	5.592	T = 19

Returning a data.frame object

Returning a data.frame might be useful if one wishes to perform additional manipulation with the data or if you intend to use other rporting packages such as stargazer (Hlavac 2018) or kabel(Zhu 2021).

xtsum(Gas, variables = c("lincomep", "lgaspcar"), return.data.frame = TRUE)
#> # A tibble: 8 × 7
#>   Variable    Dim       Mean   SD    Min    Max    Observations
#>   <chr>       <chr>     <chr>  <chr> <chr>  <chr>  <chr>       
#> 1 ___________ _________ <NA>   <NA>  <NA>   <NA>   <NA>        
#> 2 lincomep    overall   -6.139 0.635 -8.073 -5.221 N = 342     
#> 3 <NA>        between   <NA>   0.609 -7.816 -5.449 n = 18      
#> 4 <NA>        within    <NA>   0.225 -6.877 -5.6   T = 19      
#> 5 ___________ _________ <NA>   <NA>  <NA>   <NA>   <NA>        
#> 6 lgaspcar    overall   4.296  0.549 3.38   6.157  N = 342     
#> 7 <NA>        between   <NA>   0.515 3.73   5.766  n = 18      
#> 8 <NA>        within    <NA>   0.224 3.545  5.592  T = 19

Other Functions

The functions below can serve as a helper when the user is not interested in a full report but rather check a specific value.

between_max

This function computes the maximum between-group in a panel data.

Usage

between_max(data, variable, id = NULL, t = NULL, na.rm = FALSE)

Arguments * data: A data.frame or pdata.frame object containing the panel data.

variable: The variable for which the maximum between-group effect is calculated.
id (Optional) Name of the individual identifier variable.
t (Optional) Name of the time identifier variable.
na.rm Logical. Should missing values be removed? Default is FALSE.

Example

Using pdata.frame

data("Gasoline", package = "plm")
Gas <- pdata.frame(Gasoline, index = c("country", "year"), drop.index = TRUE)
between_max(Gas, variable = "lgaspcar")
#> [1] 5.766355

Using regular data.frame with id and t specified

data("Crime", package = "plm")
between_max(Crime, variable = "crmrte", id = "county", t = "year")
#> [1] 0.08868547

between_min

This function computes the minimum between-group of a panel data.

Usage between_min(data, variable, id = NULL, t = NULL, na.rm = FALSE)

Arguments

data A data.frame or pdata.frame object containing the panel data.
variable The variable for which the minimum between-group effect is calculated.
id (Optional) Name of the individual identifier variable.
t (Optional) Name of the time identifier variable.
na.rm Logical. Should missing values be removed? Default is FALSE.

Value The minimum between-group effect.

Example

Using pdata.frame

data("Gasoline", package = "plm")
Gas <- pdata.frame(Gasoline, index = c("country", "year"), drop.index = TRUE)
between_min(Gas, variable = "lgaspcar")
#> [1] 3.729646

Using regular data.frame with id and t specified

data("Crime", package = "plm")
between_min(Crime, variable = "crmrte", id = "county", t = "year")
#> [1] 0.003969886

between_sd

This function calculates the standard deviation of between-group in a panel data.

Usage

between_sd(data, variable, id = NULL, t = NULL, na.rm = FALSE)

Arguments

data A data.frame or pdata.frame object containing the panel data.
variable The variable for which the standard deviation of between-group effects is calculated.
id (Optional) Name of the individual identifier variable.
t (Optional) Name of the time identifier variable.
na.rm Logical. Should missing values be removed? Default is FALSE.

Value The standard deviation of between-group effects.

Examples

using pdata.frame

data("Gasoline", package = "plm")
Gas <- pdata.frame(Gasoline, index = c("country", "year"), drop.index = TRUE)
between_sd(Gas, variable = "lgaspcar")
#> [1] 0.5150439

Using regular data.frame with id and t specified

data("Crime", package = "plm")
between_sd(Crime, variable = "crmrte", id = "county", t = "year")
#> [1] 0.01698929

within_max

This function computes the maximum within-group for a panel data.

Usage

within_max(data, variable, id = NULL, t = NULL, na.rm = FALSE)

Arguments

data A data.frame or pdata.frame object containing the panel data.
variable The variable for which the maximum within-group effect is calculated.
id (Optional) Name of the individual identifier variable.
t (Optional) Name of the time identifier variable.
na.rm Logical. Should missing values be removed? Default is FALSE.

Value The maximum within-group effect.

Example

Using pdata.frame

data("Gasoline", package = "plm")
Gas <- pdata.frame(Gasoline, index = c("country", "year"), drop.index = TRUE)
within_max(Gas, variable = "lgaspcar")
#> [1] 5.591887

Using regular data.frame with id and t specified

data("Crime", package = "plm")
within_max(Crime, variable = "crmrte", id = "county", t = "year")
#> [1] 0.1258057

within_min

This function computes the minimum within-group for a panel data.

Usage

within_min(data, variable, id = NULL, t = NULL, na.rm = FALSE)

Arguments

data A data.frame or pdata.frame object containing the panel data.
variable The variable for which the minimum within-group effect is calculated.
id (Optional) Name of the individual identifier variable.
t (Optional) Name of the time identifier variable.
na.rm Logical. Should missing values be removed? Default is FALSE.

Value The minimum within-group effect.

Example

Using pdata.frame

data("Gasoline", package = "plm")
Gas <- pdata.frame(Gasoline, index = c("country", "year"), drop.index = TRUE)
within_min(Gas, variable = "lgaspcar")
#> [1] 3.545347

Using regular data.frame with id and t specified

data("Crime", package = "plm")
within_min(Crime, variable = "crmrte", id = "county", t = "year")
#> [1] -0.01128364

within_sd

This function computes the standard deviation of within-group for a panel data.

Usage

within_sd(data, variable, id = NULL, t = NULL, na.rm = FALSE)

Arguments

dataA data.frame or pdata.frame object containing the panel data.
variable The variable for which the standard deviation of within-group effects is calculated.
id (Optional) Name of the individual identifier variable.
t (Optional) Name of the time identifier variable.
na.rm Logical. Should missing values be removed? Default is FALSE.

Value The standard deviation of within-group effects.

Example

Using pdata.frame

data("Gasoline", package = "plm")
Gas <- pdata.frame(Gasoline, index = c("country", "year"), drop.index = TRUE)
within_sd(Gas, variable = "lgaspcar")
#> [1] 0.2236768

Using regular data.frame with id and t specified

data("Crime", package = "plm")
within_sd(Crime, variable = "crmrte", id = "county", t = "year")
#> [1] 0.006517892

References

Hlavac, Marek. 2018. “Stargazer.” CRAN.R-project.org. https://CRAN.R-project.org/package=stargazer.

Porter, Stephen. n.d. Understanding Xtsum Output. stephenporter.org. Accessed December 6, 2023. https://stephenporter.org/files/xtsum_handout.pdf.

StataCorp. 2023. “STATA LONGITUDINALDATA/PANELDATA REFERENCEMANUAL RELEASE 18.” A Stata Press Publication.

Zhu, Hao. 2021. “kableExtra.” CRAN.R-project.org. https://CRAN.R-project.org/package=kableExtra.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.