Labelr - An Introduction

A Whirlwind Tour

labelr is an experimental (“beta”) R package that supports creation and use of multiple types of labels for data.frames and data.frame variables (columns). This vignette provides an ad hoc introduction to core and ancillary labelr functionalities and uses cases.

Types of Labels

labelr supports three core types of data.frame labels, the last of which comes in three flavors:

  1. Frame labels - Each data.frame may be given a single “frame label” of 500 characters or fewer, which may describe key general features or characteristics of the data set (e.g., source, date produced or published, high-level contents).

  2. Name labels - Each variable may be given exactly one name label, which is an extended variable name or brief description of the variable. For example, if a variable called “st_b” refers to a survey respondent’s state of birth, then a sensible and useful name label might be “State of Birth”. Or, if a variable called “trust1” consisted of responses to the consumer survey question, “How much do you trust BBC news to give you unbiased information?,” a sensible name label might be “BBC News Trust.” As such, name labels are comparable to what Stata and SAS call “variable labels.”

  3. Value labels - labelr offers three kinds of value labels.

    • One-to-one labels - The canonical value-labeling use case entails mapping distinct values of a variable to distinct labels in a one-to-one fashion, so that each value label uniquely identifies a substantive value. For instance, an administrative data set might assign the integers 1-7 to seven distinct racial/ethnic groups, and value labels would be critical in mapping those numbers to socially substantive racial/ethnic category concepts (e.g., Which number corresponds to the category “Asian American?”).

    • Many-to-one labels - In an alternative use case, value labels may serve to distill or “bucket” distinct variable values in a way that deliberately “throws away” information for purposes of simplification. For example, one may wish to give the single label “Agree” to the responses “Very Strongly Agree,” “Strongly Agree,” and “Agree.” Or one may wish to differentiate self-identified “White” respondents from “People of Color,” applying the latter value label to all categories other than “White.”

    • Numerical range labels - Finally, one may wish to carve a numerical variable into an ordinal or qualitative range, such as dichotomizing a variable or dividing it into quantiles. Numerical range labels support one-to-many assignment of a single value label to a range of numerical values for a given variable.

Core Use Cases and Capabilities

More specifically, labelr functions support the following actions:

  1. Assigning variable value labels, name labels, and a frame label to data.frames and modifying those labels thereafter.

  2. Generating and accessing simple look-up table-style data.frames to inform or remind you about a given variable’s name label, frame label, or the value labels that correspond to its unique values (i.e., Which racial/ethnic identity category label corresponds to a value of 3?).

  3. Swapping out variable names for variable labels and back again.

  4. Replacing variables’ values with their corresponding labels.

  5. Augmenting a data.frame by adding columns of variable labels that can exist alongside the original columns (variables) from which they were derived.

  6. Engaging in base::subset()-like row-filtering, using value labels to guide the filtering but returning a subsetted data.frame in terms of the original variable values.

  7. Tabulating value frequencies that can be expressed in terms of raw values or value labels – again, without explicitly modifying or converting the raw data.frame values.

  8. Preserving and restoring a data.frame’s labels in the event that some unsupported R operation destroys them.

  9. Applying a single value-labeling scheme to many variables at once (for example, assigning the same set of Likert-scale labels (“Strongly Agree,” etc.) to all variables that share a common variable name character substring.

Adding and Looking up Frame, Name, and Value Labels

We’ll start our exploration of core labelr functions with a fake “demographic” data.frame. First, though, let’s load the package labelr.

# install.packages("devtools")
# devtools::install_github("rhartmano/labelr")
library(labelr)

Note: To minimize dependencies and reduce unexpected behaviors, labelr works exclusively with Base R data.frames and vectors and will coerce any augmented data.frame (e.g., tibble, data.table) to a base data.frame. The suggested workflow is to affix and use labels before coercing to an augmented data.frame if at all. While some augmented data.frames and their functions may “play well” with labelr-style labels, this is not guaranteed.

Make Toy Demographic Data.Frame

We’ll use make_demo_data() (included with labelr) to create the fictional data set.

set.seed(555) # for reproducibility
df <- make_demo_data(n = 1000) # you can specify the number of fictional obs.

# make a backup for later comparison
df_copy <- df

Add a Variable “FRAME label” Using add_frame_lab()

We’ll start our labeling session by providing a fittingly fictional high-level description of this fictional data set (labelr calls this a FRAME label).

df <- add_frame_lab(df, frame.lab = "Demographic and reaction time test score
                    records collected by Royal Statistical Agency of
                    Fictionaslavica. Data fictionally collected in the year
                    1987. As published in A. Smithee (1988). Some Fictional Data
                    for Your Amusement. Mad Magazine, 10(1), 1-24.")


get_frame_lab(df)
### >   data.frame
### > 1         df
### >                                                                                                                                                                                                                                                       frame.lab
### > 1 Demographic and reaction time test score records collected by Royal Statistical Agency of Fictionaslavica. Data fictionally collected in the year 1987. As published in A. Smithee (1988). Some Fictional Data for Your Amusement. Mad Magazine, 10(1), 1-24.

Add Variable NAME Labels Using add_name_labs()

Now, let’s add (some fairly trivial) variable NAME labels

df <- add_name_labs(df, name.labs = c(
  "age" = "Age in years",
  "raceth" = "Racial/ethnic identity group category",
  "gender" = "Gender identity category",
  "edu" = "Highest education level attained",
  "x1" = "Space Invaders reaction time test scores",
  "x2" = "Galaga reaction time test scores"
))

Even if we do nothing else with these name labels, we can access or manipulate a simple lookup table as needed.

get_name_labs(df)
### >      var                                      lab
### > 1     id                                       id
### > 2    age                             Age in years
### > 3 gender                 Gender identity category
### > 4 raceth    Racial/ethnic identity group category
### > 5    edu         Highest education level attained
### > 6     x1 Space Invaders reaction time test scores
### > 7     x2         Galaga reaction time test scores

Add VALUE labels Using add_val_labs()

Now, let’s do some VALUE labeling. First, let’s use add_val_labs() to add one-to-one value labels for the variable “raceth”. Note: max.unique.vals sets an upper limit on the number of unique values a variable may have and still be considered “value-label-able.” Additionally, labelr sets an overall upper limit of 5000 unique value labels permissible per variable.

df <- add_val_labs(df, # data.frame with to-be-value-labeled column
  vars = "raceth", # quoted variable name of to-be-labeled variable/column
  vals = c(1:7), # label values 1 through 7, inclusive
  labs = c(
    "White", "Black", "Hispanic", # apply these labels in this order to vals 1-7
    "Asian", "AIAN", "Multi", "Other"
  ),
  max.unique.vals = 10 # maximum number of unique values permitted
)

Note that the NA label is generated regardless of whether there are any actual NA values, as a way of letting you know that labelr handles NA (and “irregular”) value-labeling without your help. We’ll illustrate this further later on.

Add Value Labels Using add_val1()

Now let’s add value labels for variable “gender.” Function add_val1 is a variant of add_val_labs that allows you to supply the variable name unquoted, provided you are value-labeling only one variable. (It’s not evident from the above, but add_val_labs supports labeling multiple variables at once).

df <- add_val1(
  data = df,
  var = gender, # contrast this var argument to the vars argument demo'd above
  vals = c(0, 1, 2), # the values to be labeled
  labs = c("Male", "Female", "Other"), # the labels, applied in order to the vals
  max.unique.vals = 10
)

Once again, we can create a lookup table with the labels-to-values mappings. Because we used add_val_labs() and add_val(), each unique value of our value-labeled variables will (must) have one unique label (one-to-one mapping), and any unique values that were not explicitly assigned a label will be given one automatically (the value itself, coerced to character as needed).

get_val_labs(df)
### >       var vals     labs
### > 1  gender    0     Male
### > 2  gender    1   Female
### > 3  gender    2    Other
### > 4  gender   NA       NA
### > 5  raceth    1    White
### > 6  raceth    2    Black
### > 7  raceth    3 Hispanic
### > 8  raceth    4    Asian
### > 9  raceth    5     AIAN
### > 10 raceth    6    Multi
### > 11 raceth    7    Other
### > 12 raceth   NA       NA

Add NUMERICAL RANGE Labels Using add_quant_labs()

Traditionally, value labels are intended for categorical variables, such as binary, nominal, or (integer) ordinal variables with limited numbers of distinct categories. Further, as just noted, value labels that are added using add_val_labs (or add_val1) are constrained to map one-to-one to distinct values: No two distinct values can share a label or vice versa.

If you wish to apply a label to a range of values of a numerical variable, such as labeling each value according to the quintile or decile to which it belongs, you can use add_quant_labs() (or add_quant1) to do so.

Here, we will use add_quant_labs with the partial argument set to TRUE to apply quintile range labels to all variables of df that have an “x” in their names (i.e., vars “x1” and “x2”). We’ll demonstrate this capability further at the end of this vignette.

df_temp <- add_quant_labs(data = df, vars = "x", qtiles = 5, partial = TRUE)

Be careful with setting partial to TRUE like this: If your data set featured a column called “sex” or that featured the suffix “max,” add_quant_labs() would attempt to apply the value labeling scheme to that column as well!

We can use the same function to assign arbitrary, user-specified range labels. Here, we assign numerical range labels based on an arbitrary cutpoint that differentiate values of “x1” and “x2” that are at or below 100 from values that are at or below 150 (but greater than 100).

df_temp <- add_quant_labs(
  data = df_temp, vars = "x", vals = c(100, 150),
  partial = TRUE
)
### > Warning in add_quant_labs(data = df_temp, vars = "x", vals = c(100, 150), : 
### > 
### > Some of the supplied vals argument values are outside
### > the observed range of var --x2-- values

Having demonstrated the basic functionality, let’s use add_quant1 to apply decile range labeling to the single variable “x1” only. This function only accepts one variable, but its name can be supplied unquoted.

df <- add_quant1(df, # data.frame
  x1, # variable to value-label
  qtiles = 5
) # number quintiles to define numerical range labels

We’ll preserve the “x1” range labels going forward, keeping “x2” unlabeled.

Add MANY-TO-ONE VALUE Labels Using add_m1_lab()

If you wish to apply a single label to multiple distinct values, this can be done through successive calls to add_m1_lab() (or add1m1(), if working with a single variable). Here “m1” is shorthand for “many to one” (many values get the same one value label).

Note that each call to add_m1_lab() applies a single value label, so,
multiple calls are needed to apply multiple labels. Here, we illustrate this workflow, applying the label “Some College+” to values 3, 4, or 5 of the variable “edu”, then applying other distinct labels to values 1 and 2, respectively.

df <- add_m1_lab(df, "edu", vals = c(3:5), lab = "Some College+")

df <- add_m1_lab(df, "edu", vals = 1, lab = "Not HS Grad")

df <- add_m1_lab(df, "edu", vals = 2, lab = "HSG, No College")

get_val_labs(df)
### >       var    vals            labs
### > 1  gender       0            Male
### > 2  gender       1          Female
### > 3  gender       2           Other
### > 4  gender      NA              NA
### > 5  raceth       1           White
### > 6  raceth       2           Black
### > 7  raceth       3        Hispanic
### > 8  raceth       4           Asian
### > 9  raceth       5            AIAN
### > 10 raceth       6           Multi
### > 11 raceth       7           Other
### > 12 raceth      NA              NA
### > 13    edu       1     Not HS Grad
### > 14    edu       2 HSG, No College
### > 15    edu       3   Some College+
### > 16    edu       4   Some College+
### > 17    edu       5   Some College+
### > 18    edu      NA              NA
### > 19     x1  82.976            q020
### > 20     x1  95.238            q040
### > 21     x1 106.142            q060
### > 22     x1 117.524            q080
### > 23     x1  157.98            q100
### > 24     x1      NA              NA

Where Do We Stand?

All of this is nice, but have we really accomplished anything? A casual view of the data.frame raises doubts: it does not appear to have changed from its its initial state.

head(df_copy, 3) # our pre-labeling copy of the data.frame
### >     id age gender raceth edu     x1     x2
### > T-1  1  59      1      4   5 120.25 0.5928
### > N-2  2  56      1      1   2  67.12 0.9116
### > D-3  3  54      1      6   3  79.28 0.6993

head(df, 3) # our latest, post-labeling version of same data.frame
### >     id age gender raceth edu     x1     x2
### > T-1  1  59      1      4   5 120.25 0.5928
### > N-2  2  56      1      1   2  67.12 0.9116
### > D-3  3  54      1      6   3  79.28 0.6993

But labeling has introduced unobtrusive but important features for us to use. We’ll put them to work in a moment. But first let’s back them up in case we lose them.

Preserving and Restoring Labels

Lose them, you say? labelr labels are data.frame attributes, and certain Base R functions (like some forms of subsetting) are known to destroy attributes. For this reason, once you’re done labeling your data.frame, it’s wise to create an in-session backup of your label information by assigning it to a stand-alone object. You can do this with get_all_lab_atts(), which will return all labels (frame, name, and value) as a list that you can subsequently (re-) attach to a data.frame.

labs.df <- get_all_lab_atts(df)

Now, we can remove them explicitly, simulating what certain R functions do implicitly.

df <- strip_labs(df) # remove our labels
get_all_lab_atts(df) # show that they're gone
### > named list()

Now, let’s restore them, using the labs.df list object we just created.

df <- add_lab_atts(df, labs.df)

get_all_lab_atts(df)
### > $frame.lab
### > [1] "Demographic and reaction time test score records collected by Royal Statistical Agency of Fictionaslavica. Data fictionally collected in the year 1987. As published in A. Smithee (1988). Some Fictional Data for Your Amusement. Mad Magazine, 10(1), 1-24."
### > 
### > $name.labs
### >                                         id 
### >                                       "id" 
### >                                        age 
### >                             "Age in years" 
### >                                     gender 
### >                 "Gender identity category" 
### >                                     raceth 
### >    "Racial/ethnic identity group category" 
### >                                        edu 
### >         "Highest education level attained" 
### >                                         x1 
### > "Space Invaders reaction time test scores" 
### >                                         x2 
### >         "Galaga reaction time test scores" 
### > 
### > $val.labs.gender
### >        0        1        2       NA 
### >   "Male" "Female"  "Other"     "NA" 
### > 
### > $val.labs.raceth
### >          1          2          3          4          5          6          7 
### >    "White"    "Black" "Hispanic"    "Asian"     "AIAN"    "Multi"    "Other" 
### >         NA 
### >       "NA" 
### > 
### > $val.labs.edu
### >                 1                 2                 3                 4 
### >     "Not HS Grad" "HSG, No College"   "Some College+"   "Some College+" 
### >                 5                NA 
### >   "Some College+"              "NA" 
### > 
### > $val.labs.x1
### >  82.976  95.238 106.142 117.524  157.98      NA 
### >  "q020"  "q040"  "q060"  "q080"  "q100"    "NA"

We’re back(ed up)!

In addition to this hack, labelr provides label-preserving variants of common data management functions, including sfilter(), sselect(), ssubset(), srename(), ssort(), and others (the “s” prefix is for “safely,” as in, “your labels will be safely retained”). Other popular packages (e.g., “dplyr”) also preserve label attributes. An advantage of labelr functions like sselect() is that they they will update the label attributes of affected columns. For example, if your use of sselect() or sdrop removes a column from the returned data.frame, any labels associated with that column will be removed from the data.frame’s attributes, as well.

“Using” Value Labels

Now that our data.frame is labeled (and our labels backed up), let’s demonstrate some ways that we can use them.

Show First, Last, or Random Rows with Value Labels Overlaid

Base R includes the head() and tail() functions, which allow you to show the first n or last n rows of a data.frame. In addition, the “car” package offers a similar function called some(), which allows you to show a random n rows of a data.frame.

labelr provides versions of these functions that will display value labels in place of values (without actually altering the values in the underlying data.frame). Let’s demonstrate each of the three standard functions, followed by its labelr counterpart (Note: the unconventional rownames, e.g., “T-1,” “N-2,” are unique row identifiers, provided as aid to help you visually locate a literal row that may appear across calls.

head(df, 5) # Base R function utils::head()
### >     id age gender raceth edu     x1     x2
### > T-1  1  59      1      4   5 120.25 0.5928
### > N-2  2  56      1      1   2  67.12 0.9116
### > D-3  3  54      1      6   3  79.28 0.6993
### > Q-4  4  46      1      5   4  99.59 0.2243
### > E-5  5  18      1      6   4  90.49 0.0099

headl(df, 5) # labelr function headl() (note the "l")
### >     id age gender raceth             edu   x1     x2
### > T-1  1  59 Female  Asian   Some College+ q100 0.5928
### > N-2  2  56 Female  White HSG, No College q020 0.9116
### > D-3  3  54 Female  Multi   Some College+ q020 0.6993
### > Q-4  4  46 Female   AIAN   Some College+ q060 0.2243
### > E-5  5  18 Female  Multi   Some College+ q040 0.0099

tail(df, 5) # Base R function utils::tail()
### >          id age gender raceth edu     x1     x2
### > Z-996   996  63      0      1   4  92.36 0.0447
### > S-997   997  18      0      4   4 147.40 0.2252
### > K-998   998  45      0      5   2 106.87 0.1610
### > I-999   999  46      1      4   2 119.13 0.7666
### > H-1000 1000  68      0      6   5  70.38 0.5123

taill(df, 5) # labelr function taill() (note the extra "l")
### >          id age gender raceth             edu   x1     x2
### > Z-996   996  63   Male  White   Some College+ q040 0.0447
### > S-997   997  18   Male  Asian   Some College+ q100 0.2252
### > K-998   998  45   Male   AIAN HSG, No College q080 0.1610
### > I-999   999  46 Female  Asian HSG, No College q100 0.7666
### > H-1000 1000  68   Male  Multi   Some College+ q020 0.5123

set.seed(293)
car::some(df, 5) # car package function car::some()
### >        id age gender raceth edu     x1     x2
### > F-181 181  44      1      5   2  87.46 0.0965
### > K-248 248  30      1      2   3 129.62 0.4484
### > N-341 341  19      1      5   2  45.21 0.6074
### > F-457 457  58      1      5   4 124.84 0.9890
### > P-458 458  30      1      7   3  96.22 0.5607

set.seed(293)
somel(df, 5) # labelr function somel() (note the "l")
### >        id age gender raceth             edu   x1     x2
### > F-181 181  44 Female   AIAN HSG, No College q040 0.0965
### > N-341 341  19 Female   AIAN HSG, No College q020 0.6074
### > P-458 458  30 Female  Other   Some College+ q060 0.5607
### > F-457 457  58 Female   AIAN   Some College+ q100 0.9890
### > K-248 248  30 Female  Black   Some College+ q100 0.4484

Note that some() and somel() both return random rows, but they will not necessarily return the same random rows, even with the same random number seed.

Swap out Values for Labels with use_val_labs() and uvl()

With use_val_labs(), we can generalize this overlaying (aka “turning on” aka “swapping in”) of value labels to the entire data.frame. We might do this temporarily, to visualize the labels in place of values.

use_val_labs(df)[1:20, ] # headl() is just a more compact shortcut for this
### >      id age gender   raceth             edu   x1     x2
### > T-1   1  59 Female    Asian   Some College+ q100 0.5928
### > N-2   2  56 Female    White HSG, No College q020 0.9116
### > D-3   3  54 Female    Multi   Some College+ q020 0.6993
### > Q-4   4  46 Female     AIAN   Some College+ q060 0.2243
### > E-5   5  18 Female    Multi   Some College+ q040 0.0099
### > K-6   6  45   Male    Black   Some College+ q020 0.9250
### > Y-7   7  57   Male    White HSG, No College q060 0.9446
### > C-8   8  46   Male Hispanic HSG, No College q080 0.4053
### > W-9   9  37 Female    Black   Some College+ q020 0.3998
### > A-10 10  12 Female    Other HSG, No College q060 0.5857
### > A-11 11  46   Male    Other   Some College+ q020 0.7027
### > S-12 12  28   Male Hispanic   Some College+ q020 0.6538
### > Z-13 13  15 Female     AIAN   Some College+ q080 0.6267
### > H-14 14  39 Female     AIAN   Some College+ q020 0.8989
### > A-15 15  18 Female    White   Some College+ q100 0.2974
### > B-16 16  48   Male    Multi   Some College+ q080 0.2212
### > H-17 17  39   Male     AIAN   Some College+ q060 0.3127
### > F-18 18  52   Male Hispanic   Some College+ q060 0.4350
### > F-19 19  33   Male    Other   Some College+ q100 0.2809
### > A-20 20  29   Male    White   Some College+ q060 0.8188

We can wrap a call to this function around our data.frame and pass to other functions, which may yield more interpretable output, depending on the function. Here is an illustration that passes a use_val_labvs() -wrapped data.frame to the qsu()function of the collapse package. To save typing, we’ll use uvl(), a more compact alias for use_val_labs().

First we show the unwrapped call to collapse::qsu(), followed by an otherwise identical call that wraps the data.frame in uvl(). Focus your eyes on the leftmost column of the console outputs of the respective calls.

# `collapse::qsu()`
# with labels "off" (i.e., using regular values of "raceth" as by var)
(by_demog_val <- collapse::qsu(df, cols = c("x2"), by = ~raceth))
### >      N    Mean      SD     Min     Max
### > 1  156  0.5067  0.2696  0.0018  0.9966
### > 2  147  0.4922  0.2755  0.0041  0.9951
### > 3  144  0.4951   0.299  0.0172  0.9992
### > 4  127  0.5461  0.2873   0.006  0.9885
### > 5  155  0.5476  0.2995  0.0076   0.994
### > 6  140  0.5163  0.2798  0.0099  0.9915
### > 7  131  0.5132  0.2786  0.0014  0.9918

# with labels "on" (i.e., using labels, thanks to `uvl()`)
(by_demog_lab <- collapse::qsu(uvl(df), cols = c("x2"), by = ~raceth))
### >             N    Mean      SD     Min     Max
### > AIAN      155  0.5476  0.2995  0.0076   0.994
### > Asian     127  0.5461  0.2873   0.006  0.9885
### > Black     147  0.4922  0.2755  0.0041  0.9951
### > Hispanic  144  0.4951   0.299  0.0172  0.9992
### > Multi     140  0.5163  0.2798  0.0099  0.9915
### > Other     131  0.5132  0.2786  0.0014  0.9918
### > White     156  0.5067  0.2696  0.0018  0.9966

Note that the second call would achieve the same result if we used use_val_labs(), but uvl() is more compact for typing and printing purposes.

Non-standard Evaluation using with_val_labs() and wvn

labelr also offers an option to overlay (“swap out”) value labels using base::with()-like non-standard evaluation. This is helpful in a few specific cases.

with(df, table(gender, raceth)) # base::with()
### >       raceth
### > gender  1  2  3  4  5  6  7
### >      0 83 66 62 65 64 68 61
### >      1 70 74 79 56 83 69 64
### >      2  3  7  3  6  8  3  6

with_val_labs(df, table(gender, raceth)) # labelr::with_val_labs()
### >         raceth
### > gender   AIAN Asian Black Hispanic Multi Other White
### >   Female   83    56    74       79    69    64    70
### >   Male     64    65    66       62    68    61    83
### >   Other     8     6     7        3     3     6     3

wvl(df, table(gender, raceth)) # labelr::wvl is a more compact alias
### >         raceth
### > gender   AIAN Asian Black Hispanic Multi Other White
### >   Female   83    56    74       79    69    64    70
### >   Male     64    65    66       62    68    61    83
### >   Other     8     6     7        3     3     6     3

with(use_val_labs(df), table(gender, raceth)) # this does same thing
### >         raceth
### > gender   AIAN Asian Black Hispanic Multi Other White
### >   Female   83    56    74       79    69    64    70
### >   Male     64    65    66       62    68    61    83
### >   Other     8     6     7        3     3     6     3

In a little bit, we’ll see that we have some parallel options for overlaying (“turning on”) NAME labels.

Add value labels back to the data.frame with add_lab_cols()

If all this wrapping and interactive toggling back and forth is making you dizzy, we could do something more permanent.

For example, we can assign the result of a use_val_labs() call to an object. The result will be a data.frame with the same names and dimensions as the one supplied, with value labels replacing values for all value-labeled variables (or for a subset of those variables, if you specify them). Those variables will coerced to character (if they were not already). Since there is no “undo” shortcut for this action, it is safest to assign the result to a new object.

df_labd <- use_val_labs(df)
head(df_labd) # note, this is utils::head(), not labelr::headl()
### >     id age gender raceth             edu   x1     x2
### > T-1  1  59 Female  Asian   Some College+ q100 0.5928
### > N-2  2  56 Female  White HSG, No College q020 0.9116
### > D-3  3  54 Female  Multi   Some College+ q020 0.6993
### > Q-4  4  46 Female   AIAN   Some College+ q060 0.2243
### > E-5  5  18 Female  Multi   Some College+ q040 0.0099
### > K-6  6  45   Male  Black   Some College+ q020 0.9250

Better still, we do not strictly need to choose between values and labels. We can use add_lab_cols() to preserve all existing variables (columns), including the value-labeled ones, while adding to our data.frame an additional labels-as-values column for each value-labeled column.

Easier done than said, perhaps. Take a look:

df_plus_labs <- add_lab_cols(df)
head(df_plus_labs[c("gender", "gender_lab", "raceth", "raceth_lab")])
### >     gender gender_lab raceth raceth_lab
### > T-1      1     Female      4      Asian
### > N-2      1     Female      1      White
### > D-3      1     Female      6      Multi
### > Q-4      1     Female      5       AIAN
### > E-5      1     Female      6      Multi
### > K-6      0       Male      2      Black

“Filter values using labels” with flab()

We can filter a value-labeled data.frame on the basis for value labels, returning the subsetted data.frame expressed in terms of the original values (i.e., with the labels still in the background). For example, here we use the more semantically meaningful value labels to filter our data.frame.

head(df)
### >     id age gender raceth edu     x1     x2
### > T-1  1  59      1      4   5 120.25 0.5928
### > N-2  2  56      1      1   2  67.12 0.9116
### > D-3  3  54      1      6   3  79.28 0.6993
### > Q-4  4  46      1      5   4  99.59 0.2243
### > E-5  5  18      1      6   4  90.49 0.0099
### > K-6  6  45      0      2   4  78.55 0.9250

df1 <- flab(df, raceth == "Asian" & gender == "Female")

head(df1, 5) # returned df1 is in terms of values, just like df
### >      id age gender raceth edu     x1     x2
### > T-1   1  59      1      4   5 120.25 0.5928
### > D-40 40  60      1      4   4  78.12 0.9885
### > E-67 67  39      1      4   5  98.21 0.6244
### > I-73 73  36      1      4   2  98.42 0.2102
### > V-80 80  27      1      4   4 122.62 0.3137

headl(df1, 5) # note use of labelr::headl; labels are there
### >      id age gender raceth             edu   x1     x2
### > T-1   1  59 Female  Asian   Some College+ q100 0.5928
### > D-40 40  60 Female  Asian   Some College+ q020 0.9885
### > E-67 67  39 Female  Asian   Some College+ q060 0.6244
### > I-73 73  36 Female  Asian HSG, No College q060 0.2102
### > V-80 80  27 Female  Asian   Some College+ q100 0.3137

“Subset using labels” with slab()

As with base::subset(), we can also limit which columns we return.

head(slab(df, raceth == "Black" & gender == "Male", gender, raceth), 10)
### >       gender raceth
### > K-6        0      2
### > F-22       0      2
### > E-30       0      2
### > O-46       0      2
### > Q-48       0      2
### > F-72       0      2
### > T-117      0      2
### > K-149      0      2
### > M-161      0      2
### > A-167      0      2

In the case of slab(), we simply list the desired columns – unquoted and comma-separated – after the filter

Tabulate frequencies with tabl()

labelr’s tabl() function supports count tabulations with labels turned “on” or “off” and offers some other functionalities. For example, tables can be generated…

…in terms of values

head(tabl(df), 20) # labs.on = FALSE is default
### > Warning in tabl(df): 
### > Excluding variable --id-- (includes decimals or exceeds max.unique.vals).
### > Warning in tabl(df): 
### > Excluding variable --age-- (includes decimals or exceeds max.unique.vals).
### > Warning in tabl(df): 
### > Excluding variable --x1-- (includes decimals or exceeds max.unique.vals).
### > Warning in tabl(df): 
### > Excluding variable --x2-- (includes decimals or exceeds max.unique.vals).
### >    gender raceth edu  n
### > 1       1      2   2 29
### > 2       0      1   3 28
### > 3       1      1   2 28
### > 4       1      5   2 27
### > 5       1      5   3 26
### > 6       1      6   2 26
### > 7       1      7   2 25
### > 8       1      3   3 24
### > 9       0      2   3 23
### > 10      0      4   2 23
### > 11      0      5   2 23
### > 12      0      1   2 22
### > 13      0      3   3 22
### > 14      0      6   3 22
### > 15      1      3   4 22
### > 16      0      1   4 21
### > 17      0      2   2 21
### > 18      1      2   3 21
### > 19      1      3   2 21
### > 20      0      6   2 20

…or in terms of labels

head(tabl(df, labs.on = TRUE), 20) # labs.on = TRUE is not the default
### > Warning in tabl(df, labs.on = TRUE): 
### > Excluding variable --id-- (includes decimals or exceeds max.unique.vals).
### > Warning in tabl(df, labs.on = TRUE): 
### > Excluding variable --age-- (includes decimals or exceeds max.unique.vals).
### > Warning in tabl(df, labs.on = TRUE): 
### > Excluding variable --x2-- (includes decimals or exceeds max.unique.vals).
### >    gender   raceth           edu   x1  n
### > 1    Male    Other Some College+ q020 18
### > 2    Male    White Some College+ q020 14
### > 3    Male    White Some College+ q060 14
### > 4  Female     AIAN Some College+ q020 13
### > 5  Female     AIAN Some College+ q060 13
### > 6  Female Hispanic Some College+ q040 13
### > 7  Female Hispanic Some College+ q100 13
### > 8    Male     AIAN Some College+ q040 13
### > 9  Female     AIAN Some College+ q080 12
### > 10 Female    Asian Some College+ q080 12
### > 11 Female    Black Some College+ q060 12
### > 12   Male    Multi Some College+ q020 12
### > 13   Male    Multi Some College+ q100 12
### > 14   Male    White Some College+ q080 12
### > 15 Female Hispanic Some College+ q020 11
### > 16 Female    White Some College+ q080 11
### > 17   Male    Asian Some College+ q040 11
### > 18   Male    Black Some College+ q040 11
### > 19   Male Hispanic Some College+ q100 11
### > 20 Female     AIAN Some College+ q040 10

…in proportions

head(tabl(df, labs.on = TRUE, prop.digits = 3), 20)
### > Warning in tabl(df, labs.on = TRUE, prop.digits = 3): 
### > Excluding variable --id-- (includes decimals or exceeds max.unique.vals).
### > Warning in tabl(df, labs.on = TRUE, prop.digits = 3): 
### > Excluding variable --age-- (includes decimals or exceeds max.unique.vals).
### > Warning in tabl(df, labs.on = TRUE, prop.digits = 3): 
### > Excluding variable --x2-- (includes decimals or exceeds max.unique.vals).
### >    gender   raceth           edu   x1     n
### > 1    Male    Other Some College+ q020 0.018
### > 2    Male    White Some College+ q020 0.014
### > 3    Male    White Some College+ q060 0.014
### > 4  Female     AIAN Some College+ q020 0.013
### > 5  Female     AIAN Some College+ q060 0.013
### > 6  Female Hispanic Some College+ q040 0.013
### > 7  Female Hispanic Some College+ q100 0.013
### > 8    Male     AIAN Some College+ q040 0.013
### > 9  Female     AIAN Some College+ q080 0.012
### > 10 Female    Asian Some College+ q080 0.012
### > 11 Female    Black Some College+ q060 0.012
### > 12   Male    Multi Some College+ q020 0.012
### > 13   Male    Multi Some College+ q100 0.012
### > 14   Male    White Some College+ q080 0.012
### > 15 Female Hispanic Some College+ q020 0.011
### > 16 Female    White Some College+ q080 0.011
### > 17   Male    Asian Some College+ q040 0.011
### > 18   Male    Black Some College+ q040 0.011
### > 19   Male Hispanic Some College+ q100 0.011
### > 20 Female     AIAN Some College+ q040 0.010

…cross-tab style

head(tabl(df, labs.on = TRUE, wide.col = "gender"), 20)
### > Warning in tabl(df, labs.on = TRUE, wide.col = "gender"): 
### > Excluding variable --id-- (includes decimals or exceeds max.unique.vals).
### > Warning in tabl(df, labs.on = TRUE, wide.col = "gender"): 
### > Excluding variable --age-- (includes decimals or exceeds max.unique.vals).
### > Warning in tabl(df, labs.on = TRUE, wide.col = "gender"): 
### > Excluding variable --x2-- (includes decimals or exceeds max.unique.vals).
### >      raceth           edu   x1 Male Female Other
### > 1     Other Some College+ q020   18      5     1
### > 2     White Some College+ q020   14      5     0
### > 3     White Some College+ q060   14      5     2
### > 4      AIAN Some College+ q020    6     13     1
### > 5      AIAN Some College+ q060    8     13     0
### > 6  Hispanic Some College+ q040    8     13     0
### > 7  Hispanic Some College+ q100   11     13     0
### > 8      AIAN Some College+ q040   13     10     1
### > 9      AIAN Some College+ q080    4     12     3
### > 10    Asian Some College+ q080    5     12     0
### > 11    Black Some College+ q060    9     12     1
### > 12    Multi Some College+ q020   12     10     1
### > 13    Multi Some College+ q100   12     10     0
### > 14    White Some College+ q080   12     11     0
### > 15 Hispanic Some College+ q020    8     11     2
### > 16    Asian Some College+ q040   11      6     0
### > 17    Black Some College+ q040   11      6     1
### > 18    Black Some College+ q020    9     10     2
### > 19 Hispanic Some College+ q060   10     10     1
### > 20 Hispanic Some College+ q080    5     10     0

…with non-value-labeled data.frames

tabl(iris, "Species") # explicit vars arg with one-var ("Species")
### >      Species  n
### > 1     setosa 50
### > 2 versicolor 50
### > 3  virginica 50

tabl(mtcars, zero.rm = TRUE) # vars arg null
### > Warning in tabl(mtcars, zero.rm = TRUE): 
### > Excluding variable --mpg-- (includes decimals or exceeds max.unique.vals).
### > Warning in tabl(mtcars, zero.rm = TRUE): 
### > Excluding variable --disp-- (includes decimals or exceeds max.unique.vals).
### > Warning in tabl(mtcars, zero.rm = TRUE): 
### > Excluding variable --hp-- (includes decimals or exceeds max.unique.vals).
### > Warning in tabl(mtcars, zero.rm = TRUE): 
### > Excluding variable --drat-- (includes decimals or exceeds max.unique.vals).
### > Warning in tabl(mtcars, zero.rm = TRUE): 
### > Excluding variable --wt-- (includes decimals or exceeds max.unique.vals).
### > Warning in tabl(mtcars, zero.rm = TRUE): 
### > Excluding variable --qsec-- (includes decimals or exceeds max.unique.vals).
### >    cyl vs am gear carb n
### > 1    8  0  0    3    4 5
### > 2    4  1  1    4    1 4
### > 3    8  0  0    3    2 4
### > 4    8  0  0    3    3 3
### > 5    4  1  0    4    2 2
### > 6    4  1  1    4    2 2
### > 7    6  0  1    4    4 2
### > 8    6  1  0    3    1 2
### > 9    6  1  0    4    4 2
### > 10   4  0  1    5    2 1
### > 11   4  1  0    3    1 1
### > 12   4  1  1    5    2 1
### > 13   6  0  1    5    6 1
### > 14   8  0  1    5    4 1
### > 15   8  0  1    5    8 1

“Using” NAME labels

Just as we used use_val_labs() to swap out values for value labels, we can use use_name_labs() to swap out variable names for variable NAME labels. Let’s illustrate this with the mtcars data.frame.

First we’ll construct a vector of named labels.

names_labs_vec <- c(
  "mpg" = "Miles/(US) gallon",
  "cyl" = "Number of cylinders",
  "disp" = "Displacement (cu.in.)",
  "hp" = "Gross horsepower",
  "drat" = "Rear axle ratio",
  "wt" = "Weight (1000 lbs)",
  "qsec" = "1/4 mile time",
  "vs" = "Engine (0 = V-shaped, 1 = straight)",
  "am" = "Transmission (0 = automatic, 1 = manual)",
  "gear" = "Number of forward gears",
  "carb" = "Number of carburetors"
)

Now, we will apply them to mtcars and assign the resulting data.frame to a new data.frame called mt2.

mt2 <- add_name_labs(mtcars,
  vars = names(names_labs_vec),
  labs = names_labs_vec
)

Here is an alternative syntax (same end state)

mt2 <- add_name_labs(mtcars,
  name.labs = c(
    "mpg" = "Miles/(US) gallon",
    "cyl" = "Number of cylinders",
    "disp" = "Displacement (cu.in.)",
    "hp" = "Gross horsepower",
    "drat" = "Rear axle ratio",
    "wt" = "Weight (1000 lbs)",
    "qsec" = "1/4 mile time",
    "vs" = "Engine (0 = V-shaped, 1 = straight)",
    "am" = "Transmission (0 = automatic, 1 = manual)",
    "gear" = "Number of forward gears",
    "carb" = "Number of carburetors"
  )
)

Now, let’s swap out names for NAME labels.

mt2 <- use_name_labs(mt2)

head(mt2[c(1, 2)])
### >                   Miles/(US) gallon Number of cylinders
### > Mazda RX4                      21.0                   6
### > Mazda RX4 Wag                  21.0                   6
### > Datsun 710                     22.8                   4
### > Hornet 4 Drive                 21.4                   6
### > Hornet Sportabout              18.7                   8
### > Valiant                        18.1                   6

Yikes, the longer column names stretch things out quite a bit.

One thing we can do is use get_name_labs to get a look-up table, then use copy-and-paste to work with these. For example:

lm(`Miles/(US) gallon` ~ `Number of cylinders`, data = mt2) # pasting in var names
### > 
### > Call:
### > lm(formula = `Miles/(US) gallon` ~ `Number of cylinders`, data = mt2)
### > 
### > Coefficients:
### >           (Intercept)  `Number of cylinders`  
### >                37.885                 -2.876
lm(mpg ~ cyl, data = use_var_names(mt2)) # same result if name labels are "off"
### > 
### > Call:
### > lm(formula = mpg ~ cyl, data = use_var_names(mt2))
### > 
### > Coefficients:
### > (Intercept)          cyl  
### >      37.885       -2.876

But freehand typing or copy-paste is clunky and tedious. There are other less painful ways we can use these NAME labels, once we’ve turned them on.

sapply(mt2, median) # get the median for every name-labeled variable
### >                        Miles/(US) gallon 
### >                                   19.200 
### >                      Number of cylinders 
### >                                    6.000 
### >                    Displacement (cu.in.) 
### >                                  196.300 
### >                         Gross horsepower 
### >                                  123.000 
### >                          Rear axle ratio 
### >                                    3.695 
### >                        Weight (1000 lbs) 
### >                                    3.325 
### >                            1/4 mile time 
### >                                   17.710 
### >      Engine (0 = V-shaped, 1 = straight) 
### >                                    0.000 
### > Transmission (0 = automatic, 1 = manual) 
### >                                    0.000 
### >                  Number of forward gears 
### >                                    4.000 
### >                    Number of carburetors 
### >                                    2.000

collapse::qsu(mt2) # use an external package for more informative descriptives
### >                                            N      Mean        SD    Min    Max
### > Miles/(US) gallon                         32   20.0906    6.0269   10.4   33.9
### > Number of cylinders                       32    6.1875    1.7859      4      8
### > Displacement (cu.in.)                     32  230.7219  123.9387   71.1    472
### > Gross horsepower                          32  146.6875   68.5629     52    335
### > Rear axle ratio                           32    3.5966    0.5347   2.76   4.93
### > Weight (1000 lbs)                         32    3.2173    0.9785  1.513  5.424
### > 1/4 mile time                             32   17.8487    1.7869   14.5   22.9
### > Engine (0 = V-shaped, 1 = straight)       32    0.4375     0.504      0      1
### > Transmission (0 = automatic, 1 = manual)  32    0.4063     0.499      0      1
### > Number of forward gears                   32    3.6875    0.7378      3      5
### > Number of carburetors                     32    2.8125    1.6152      1      8

Okay, let’s revert back to our original variable names.

mt2 <- use_var_names(mt2)
head(mt2[c(1, 2)])
### >                    mpg cyl
### > Mazda RX4         21.0   6
### > Mazda RX4 Wag     21.0   6
### > Datsun 710        22.8   4
### > Hornet 4 Drive    21.4   6
### > Hornet Sportabout 18.7   8
### > Valiant           18.1   6

We can use with_name_labs() (or the more compact alias wnl()) to display name labels in place of column names in fairly flexible ways.

First, let’s show that mt2’s name labels are “off,” then we’ll verify that the labels are still there in the background.

# first, show mt2 with name labels off but verify that we still have name labels
head(mt2)
### >                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
### > Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
### > Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
### > Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
### > Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
### > Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
### > Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
get_name_labs(mt2)
### >     var                                      lab
### > 1   mpg                        Miles/(US) gallon
### > 2   cyl                      Number of cylinders
### > 3  disp                    Displacement (cu.in.)
### > 4    hp                         Gross horsepower
### > 5  drat                          Rear axle ratio
### > 6    wt                        Weight (1000 lbs)
### > 7  qsec                            1/4 mile time
### > 8    vs      Engine (0 = V-shaped, 1 = straight)
### > 9    am Transmission (0 = automatic, 1 = manual)
### > 10 gear                  Number of forward gears
### > 11 carb                    Number of carburetors

Now, pay attention to the variable names in the console output of the following calls:

# demo wnl() (note that with_name_labs() will achieve same result)
wnl(mt2, t.test(mpg ~ am))
### > 
### >   Welch Two Sample t-test
### > 
### > data:  Miles/(US) gallon by Transmission (0 = automatic, 1 = manual)
### > t = -3.7671, df = 18.332, p-value = 0.001374
### > alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
### > 95 percent confidence interval:
### >  -11.280194  -3.209684
### > sample estimates:
### > mean in group 0 mean in group 1 
### >        17.14737        24.39231

wnl(mt2, lm(mpg ~ am))
### > 
### > Call:
### > lm(formula = `Miles/(US) gallon` ~ `Transmission (0 = automatic, 1 = manual)`)
### > 
### > Coefficients:
### >                                (Intercept)  
### >                                     17.147  
### > `Transmission (0 = automatic, 1 = manual)`  
### >                                      7.245

wnl(mt2, summary(mt2))
### >  Miles/(US) gallon Number of cylinders Displacement (cu.in.) Gross horsepower
### >  Min.   :10.40     Min.   :4.000       Min.   : 71.1         Min.   : 52.0   
### >  1st Qu.:15.43     1st Qu.:4.000       1st Qu.:120.8         1st Qu.: 96.5   
### >  Median :19.20     Median :6.000       Median :196.3         Median :123.0   
### >  Mean   :20.09     Mean   :6.188       Mean   :230.7         Mean   :146.7   
### >  3rd Qu.:22.80     3rd Qu.:8.000       3rd Qu.:326.0         3rd Qu.:180.0   
### >  Max.   :33.90     Max.   :8.000       Max.   :472.0         Max.   :335.0   
### >  Rear axle ratio Weight (1000 lbs) 1/4 mile time  
### >  Min.   :2.760   Min.   :1.513     Min.   :14.50  
### >  1st Qu.:3.080   1st Qu.:2.581     1st Qu.:16.89  
### >  Median :3.695   Median :3.325     Median :17.71  
### >  Mean   :3.597   Mean   :3.217     Mean   :17.85  
### >  3rd Qu.:3.920   3rd Qu.:3.610     3rd Qu.:18.90  
### >  Max.   :4.930   Max.   :5.424     Max.   :22.90  
### >  Engine (0 = V-shaped, 1 = straight) Transmission (0 = automatic, 1 = manual)
### >  Min.   :0.0000                      Min.   :0.0000                          
### >  1st Qu.:0.0000                      1st Qu.:0.0000                          
### >  Median :0.0000                      Median :0.0000                          
### >  Mean   :0.4375                      Mean   :0.4062                          
### >  3rd Qu.:1.0000                      3rd Qu.:1.0000                          
### >  Max.   :1.0000                      Max.   :1.0000                          
### >  Number of forward gears Number of carburetors
### >  Min.   :3.000           Min.   :1.000        
### >  1st Qu.:3.000           1st Qu.:2.000        
### >  Median :4.000           Median :2.000        
### >  Mean   :3.688           Mean   :2.812        
### >  3rd Qu.:4.000           3rd Qu.:4.000        
### >  Max.   :5.000           Max.   :8.000

wnl(mt2, xtabs(~gear))
### > Number of forward gears
### >  3  4  5 
### > 15 12  5

with(mt2, xtabs(~gear)) # compare to directly above
### > gear
### >  3  4  5 
### > 15 12  5

Keep in mind that wnl() is intended for self-contained calls involving exploratory analysis activities, like simple plots, descriptives, and models. It’s based on fairly brittle regular expressions and will throw an error if you are using particularly exotic operators, trying out multi-step workflows, or attempting to use it for data management or cleaning. Still, as shown above, it works reasonably well for a range of “workhorse” commands.

NA and “Irregular” Values

labelr is no fan of NA values or other “irregular” values, which are defined as infinite values, not-a-number values, and character values that look like them (e.g., “NAN”, “INF”, “inf”, “Na”).

When value-labeling a column / variable, such values are automatically given the catch-all label “NA” (which will be converted to an actual NA in any columns created by add_lab_cols() or use_val_labs()). You do not need (and should not try) to specify this yourself, and you should not try to over-ride labelr on this. If you want to use labelr AND you present with these sorts of values, your options are to accept the default “NA” label or convert these values to something else before labeling. The reasoning is that value labels are rarely appropriate for the types of variables and scenarios where you absolutely need to preserve the nuances of exotic values like -Inf and NaN.

With that said, let’s see how labelr handles this, with an assist from our old friend mtcars (packaged with R’s base distribution).

First, let’s assign mtcars to a new data.frame object that we will besmirch.

mtbad <- mtcars

Let’s get on with the besmirching.

mtbad[1, 1:11] <- NA
rownames(mtbad)[1] <- "Missing Car"
mtbad[2, "am"] <- Inf
mtbad[3, "gear"] <- -Inf
mtbad[5, "carb"] <- NaN
mtbad[2, "mpg"] <- Inf
mtbad[3, "mpg"] <- NaN

# add a character variable, for demonstration purposes
# if it makes you feel better, you can pretend these are Consumer Reports or
# ...JD Power ratings or something
set.seed(9202) # for reproducibility
mtbad$grade <- sample(c("A", "B", "C"), nrow(mtbad), replace = TRUE)
mtbad[4, "grade"] <- NA
mtbad[5, "grade"] <- "NA"
mtbad[6, "grade"] <- "Inf"

# see where this leaves us
head(mtbad)
### >                    mpg cyl disp  hp drat    wt  qsec vs  am gear carb grade
### > Missing Car         NA  NA   NA  NA   NA    NA    NA NA  NA   NA   NA     B
### > Mazda RX4 Wag      Inf   6  160 110 3.90 2.875 17.02  0 Inf    4    4     C
### > Datsun 710         NaN   4  108  93 3.85 2.320 18.61  1   1 -Inf    1     C
### > Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1   0    3    1  <NA>
### > Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0   0    3  NaN    NA
### > Valiant           18.1   6  225 105 2.76 3.460 20.22  1   0    3    1   Inf

sapply(mtbad, class)
### >         mpg         cyl        disp          hp        drat          wt 
### >   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
### >        qsec          vs          am        gear        carb       grade 
### >   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" "character"

Now, let’s add value labels to this unruly data.frame.

mtlabs <- mtbad |>
  add_val1(grade,
    vals = c("A", "B", "C"),
    labs = c("Gold", "Silver", "Bronze")
  ) |>
  add_val1(am,
    vals = c(0, 1),
    labs = c("auto", "stick")
  ) |>
  add_val1(carb,
    vals = c(1, 2, 3, 4, 6, 8), # not the most inspired use of labels
    labs = c(
      "1c", "2c", "3c",
      "4c", "6c", "8c"
    )
  ) |>
  add_val1(gear,
    vals = 3:5, # again, not the most compelling use case
    labs = c(
      "3-speed",
      "4-speed",
      "5-speed"
    )
  ) |>
  add_quant1(mpg, qtiles = 4) # add quartile-based value labels
get_val_labs(mtlabs, "am") # NA values were detected and dealt with
### >   var vals  labs
### > 6  am    0  auto
### > 7  am    1 stick
### > 8  am   NA    NA

Let’s streamline the data.frame with sselect() to make it more manageable.

mtless <- sselect(mtlabs, mpg, cyl, am, gear, carb, grade) # safely select

head(mtless, 5) # note that the irregular values are still here
### >                    mpg cyl  am gear carb grade
### > Missing Car         NA  NA  NA   NA   NA     B
### > Mazda RX4 Wag      Inf   6 Inf    4    4     C
### > Datsun 710         NaN   4   1 -Inf    1     C
### > Hornet 4 Drive    21.4   6   0    3    1  <NA>
### > Hornet Sportabout 18.7   8   0    3  NaN    NA

Notice how all irregular values are coerced to NA when we substitute labels for values with use_val_labs().

head(use_val_labs(mtless), 5) # but they all go to NA if we `use_val_labs`
### >                    mpg cyl    am    gear carb  grade
### > Missing Car       <NA>  NA  <NA>    <NA> <NA> Silver
### > Mazda RX4 Wag     <NA>   6  <NA> 4-speed   4c Bronze
### > Datsun 710        <NA>   4 stick    <NA>   1c Bronze
### > Hornet 4 Drive    q075   6  auto 3-speed   1c     NA
### > Hornet Sportabout q050   8  auto 3-speed <NA>     NA

Now, let’s try an add_lab_cols() view.

mtlabs_plus <- add_lab_cols(mtlabs, c("mpg", "am")) # this creates and adds "am_lab" col
mtlabs_plus <- sselect(mtlabs_plus, mpg, mpg_lab, am, am_lab) # let's select down to these two

head(mtlabs_plus) # here's where we landed
### >                    mpg mpg_lab  am am_lab
### > Missing Car         NA    <NA>  NA   <NA>
### > Mazda RX4 Wag      Inf    <NA> Inf   <NA>
### > Datsun 710         NaN    <NA>   1  stick
### > Hornet 4 Drive    21.4    q075   0   auto
### > Hornet Sportabout 18.7    q050   0   auto
### > Valiant           18.1    q050   0   auto

What if we had tried to explicitly label the NA values and/or irregular values themselves? We would have failed.

# Trying to Label an Irregular Value (-Inf)
mtbad <- add_val1(
  data = mtcars,
  var = gear,
  vals = -Inf,
  labs = c("neg.inf")
)
### > Error in add_val1(data = mtcars, var = gear, vals = -Inf, labs = c("neg.inf")): 
### > Cannot supply NA, NaN, Inf, or character variants as a val or lab arg.
### > These are handled automatically.

# Trying to Label an Irregular Value (NA)
mtbad <- add_val_labs(
  data = mtbad,
  vars = "grade",
  vals = NA,
  labs = c("miss")
)
### > Error in add_val_labs(data = mtbad, vars = "grade", vals = NA, labs = c("miss")): 
### > Cannot supply NA, NaN, Inf, or character variants as a val or lab arg.
### > These are handled automatically.

# Trying to Label an Irregular Value (NaN)
mtbad <- add_val_labs(
  data = mtbad,
  vars = "carb",
  vals = NaN,
  labs = c("nan-v")
)
### > Error in add_val_labs(data = mtbad, vars = "carb", vals = NaN, labs = c("nan-v")): 
### > Cannot supply NA, NaN, Inf, or character variants as a val or lab arg.
### > These are handled automatically.

# labelr also treats "character variants" of irregular values as irregular values.
mtbad <- add_val1(
  data = mtbad,
  var = carb,
  vals = "NAN",
  labs = c("nan-v")
)
### > Error in add_val1(data = mtbad, var = carb, vals = "NAN", labs = c("nan-v")): 
### > Cannot supply NA, NaN, Inf, or character variants as a val or lab arg.
### > These are handled automatically.

Again, labelr handles NA and irregular values and resists our efforts to take such matters into our own hands.

Factors and Value Labels

R’s concept of a factor variable shares some affinities with the concept of a value-labeled variable and can be viewed as one approach to value labeling. However, factors can manifest idiosyncratic and surprising behaviors depending on the function to which you’re trying to apply them. They are character-like, but they are not character values. They are built on top of integers, but they won’t submit to all of the operations that integers do. They do some very handy things in certain model-fitting applications, but their behavior “under the hood” can be counter-intuitive or opaque. Simply put, they are their own thing.

So, while factors have their purposes, it would be nice to associate value labels with the distinct values of data.frame variables in a manner that preserves the integrity and transparency of the underlying values (factors tend to be a bit opaque about this) and that allows you to view or use the labels in flexible ways.

And if you wanted to work with a factor, it would be nice if you could add value labels to it without it ceasing to be and behave like a factor.

Adding Labels to a Factor

With that said, let’s see if we can have our label-factor cake and eat it, too, using the iris data.frame that comes pre-packaged with R.

unique(iris$Species)
### > [1] setosa     versicolor virginica 
### > Levels: setosa versicolor virginica

sapply(iris, class) # nothing up our sleeve -- "Species" is a factor
### > Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
### >    "numeric"    "numeric"    "numeric"    "numeric"     "factor"

Let’s add value labels to “Species” and assign the result to a new data.frame that we’ll call irlab. For our value labels, we’ll use “se”,“ve”, and “vi”, which are not adding much new information, but they will help to illustrate what we can do with labelr and a factor variable.

irlab <- add_val_labs(iris,
  vars = "Species",
  vals = c("setosa", "versicolor", "virginica"),
  labs = c("se", "ve", "vi")
)

# this also would've worked
# irlab_dos <- add_val1(iris, Species,
#   vals = c("setosa", "versicolor", "virginica"),
#   labs = c("se", "ve", "vi")
# )

Note that we could have just as (or even more) easily used add_val1(), which works for a single variable at a time and allows us to avoid quoting our column name, if that matters to us. In contrast, add_val_labs() requires us to put our variable name(s) in quotes, but it also gives us the option to apply a common value-label scheme to several variables at once (e.g., Likert-style survey responses). We’ll see an example of this type of use case in action in a little bit.

For now, though, let’s prove that the iris and irlab data.frames are functionally identical.

First, note that irlab looks and acts just like iris in the usual ways that matter

summary(iris)
### >   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
### >  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
### >  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
### >  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
### >  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
### >  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
### >  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
### >        Species  
### >  setosa    :50  
### >  versicolor:50  
### >  virginica :50  
### >                 
### >                 
### > 

summary(irlab)
### >   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
### >  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
### >  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
### >  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
### >  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
### >  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
### >  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
### >        Species  
### >  setosa    :50  
### >  versicolor:50  
### >  virginica :50  
### >                 
### >                 
### > 

head(iris, 4)
### >   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
### > 1          5.1         3.5          1.4         0.2  setosa
### > 2          4.9         3.0          1.4         0.2  setosa
### > 3          4.7         3.2          1.3         0.2  setosa
### > 4          4.6         3.1          1.5         0.2  setosa

head(irlab, 4)
### >   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
### > 1          5.1         3.5          1.4         0.2  setosa
### > 2          4.9         3.0          1.4         0.2  setosa
### > 3          4.7         3.2          1.3         0.2  setosa
### > 4          4.6         3.1          1.5         0.2  setosa

lm(Sepal.Length ~ Sepal.Width + Species, data = iris)
### > 
### > Call:
### > lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
### > 
### > Coefficients:
### >       (Intercept)        Sepal.Width  Speciesversicolor   Speciesvirginica  
### >            2.2514             0.8036             1.4587             1.9468

lm(Sepal.Length ~ Sepal.Width + Species, data = irlab) # values are same
### > 
### > Call:
### > lm(formula = Sepal.Length ~ Sepal.Width + Species, data = irlab)
### > 
### > Coefficients:
### >       (Intercept)        Sepal.Width  Speciesversicolor   Speciesvirginica  
### >            2.2514             0.8036             1.4587             1.9468

Note also that irlab’s “Species” is still a factor, just like its iris counterpart/parent.

sapply(irlab, class)
### > Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
### >    "numeric"    "numeric"    "numeric"    "numeric"     "factor"

levels(irlab$Species)
### > [1] "setosa"     "versicolor" "virginica"

But irlab’s “Species” has value labels!

get_val_labs(irlab, "Species")
### >       var       vals labs
### > 1 Species     setosa   se
### > 2 Species versicolor   ve
### > 3 Species  virginica   vi
### > 4 Species         NA   NA

And they work.

head(use_val_labs(irlab))
### >   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
### > 1          5.1         3.5          1.4         0.2      se
### > 2          4.9         3.0          1.4         0.2      se
### > 3          4.7         3.2          1.3         0.2      se
### > 4          4.6         3.1          1.5         0.2      se
### > 5          5.0         3.6          1.4         0.2      se
### > 6          5.4         3.9          1.7         0.4      se
ir_v <- flab(irlab, Species == "vi")
head(ir_v, 5)
### >     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
### > 101          6.3         3.3          6.0         2.5 virginica
### > 102          5.8         2.7          5.1         1.9 virginica
### > 103          7.1         3.0          5.9         2.1 virginica
### > 104          6.3         2.9          5.6         1.8 virginica
### > 105          6.5         3.0          5.8         2.2 virginica

Our take-aways so far? Factors can be value-labeled while staying factors, and we can use the labels to do labelr-y things with those factors. We can have both.

We may want to go further and add the labeled variable alongside the factor version.

irlab_aug <- add_lab_cols(irlab, vars = "Species")

This gives us a new variable called “Species_lab”. Let’s get select rows of the resulting data.frame, since we want to see all the different species.

set.seed(231)
sample_rows <- sample(seq_len(nrow(irlab)), 10, replace = FALSE)

irlab_aug[sample_rows, ]
### >     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species Species_lab
### > 7            4.6         3.4          1.4         0.3     setosa          se
### > 91           5.5         2.6          4.4         1.2 versicolor          ve
### > 41           5.0         3.5          1.3         0.3     setosa          se
### > 133          6.4         2.8          5.6         2.2  virginica          vi
### > 130          7.2         3.0          5.8         1.6  virginica          vi
### > 19           5.7         3.8          1.7         0.3     setosa          se
### > 104          6.3         2.9          5.6         1.8  virginica          vi
### > 43           4.4         3.2          1.3         0.2     setosa          se
### > 8            5.0         3.4          1.5         0.2     setosa          se
### > 68           5.8         2.7          4.1         1.0 versicolor          ve

sapply(irlab_aug, class)
### > Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species  Species_lab 
### >    "numeric"    "numeric"    "numeric"    "numeric"     "factor"  "character"

with(irlab_aug, table(Species, Species_lab))
### >             Species_lab
### > Species      se ve vi
### >   setosa     50  0  0
### >   versicolor  0 50  0
### >   virginica   0  0 50

Caution: Replacing the entire data.frame using use_val_labs() WILL coerce factors to character, since the value labels are character values, not recognized factor levels

ir_char <- use_val_labs(irlab) # we assign this to a new data.frame
sapply(ir_char, class)
### > Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
### >    "numeric"    "numeric"    "numeric"    "numeric"  "character"

head(ir_char, 3)
### >   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
### > 1          5.1         3.5          1.4         0.2      se
### > 2          4.9         3.0          1.4         0.2      se
### > 3          4.7         3.2          1.3         0.2      se

class(ir_char$Species) # it's character
### > [1] "character"

Of course, even then, we could explicitly coerce the labels to be factors if we wanted

ir_fact <- use_val_labs(irlab)

ir_fact$Species <- factor(ir_char$Species,
  levels = c("se", "ve", "vi"),
  labels = c("se", "ve", "vi")
)
head(ir_fact, 3)
### >   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
### > 1          5.1         3.5          1.4         0.2      se
### > 2          4.9         3.0          1.4         0.2      se
### > 3          4.7         3.2          1.3         0.2      se

class(ir_fact$Species) # it's a factor
### > [1] "factor"

levels(ir_fact$Species) # it's a factor
### > [1] "se" "ve" "vi"

We’ve recovered.

with(ir_fact, tapply(Sepal.Width, Species, mean))
### >    se    ve    vi 
### > 3.428 2.770 2.974
with(irlab, tapply(Sepal.Width, Species, mean))
### >     setosa versicolor  virginica 
### >      3.428      2.770      2.974
with(iris, tapply(Sepal.Width, Species, mean))
### >     setosa versicolor  virginica 
### >      3.428      2.770      2.974

Ordered factors

Value labels work with ordered factors, too. Let’s make a fictional ordered factor that we add to ir_ord. We can pretend that this is some sort of judge’s overall quality rating, if that helps.

ir_ord <- iris

set.seed(293)
qrating <- c("AAA", "AA", "A", "BBB", "AA", "BBB", "A")

ir_ord$qrat <- sample(qrating, 150, replace = TRUE)

ir_ord$qrat <- factor(ir_ord$qrat,
  ordered = TRUE,
  levels = c("AAA", "AA", "A", "BBB")
)

Where do we stand with this factor?

levels(ir_ord$qrat)
### > [1] "AAA" "AA"  "A"   "BBB"

class(ir_ord$qrat)
### > [1] "ordered" "factor"

Now, let’s add value labels to it.

ir_ord <- add_val_labs(ir_ord,
  vars = "qrat",
  vals = c("AAA", "AA", "A", "BBB"),
  labs = c(
    "unimpeachable",
    "excellent",
    "very good",
    "meh"
  )
)

Let’s add a separate column with those labels as a distinct (character) variable unto itself, existing in addition to (not replacing) “qrat”.

ir_ord <- add_lab_cols(ir_ord, vars = "qrat")

head(ir_ord, 10)
### >    Sepal.Length Sepal.Width Petal.Length Petal.Width Species qrat      qrat_lab
### > 1           5.1         3.5          1.4         0.2  setosa   AA     excellent
### > 2           4.9         3.0          1.4         0.2  setosa   AA     excellent
### > 3           4.7         3.2          1.3         0.2  setosa   AA     excellent
### > 4           4.6         3.1          1.5         0.2  setosa  AAA unimpeachable
### > 5           5.0         3.6          1.4         0.2  setosa   AA     excellent
### > 6           5.4         3.9          1.7         0.4  setosa  BBB           meh
### > 7           4.6         3.4          1.4         0.3  setosa  AAA unimpeachable
### > 8           5.0         3.4          1.5         0.2  setosa   AA     excellent
### > 9           4.4         2.9          1.4         0.2  setosa    A     very good
### > 10          4.9         3.1          1.5         0.1  setosa    A     very good

with(ir_ord, table(qrat_lab, qrat))
### >                qrat
### > qrat_lab        AAA AA  A BBB
### >   excellent       0 49  0   0
### >   meh             0  0  0  43
### >   unimpeachable  11  0  0   0
### >   very good       0  0 47   0

class(ir_ord$qrat)
### > [1] "ordered" "factor"

levels(ir_ord$qrat)
### > [1] "AAA" "AA"  "A"   "BBB"

class(ir_ord$qrat_lab)
### > [1] "character"

get_val_labs(ir_ord, "qrat") # labs are still there for qrat
### >    var vals          labs
### > 1 qrat    A     very good
### > 2 qrat   AA     excellent
### > 3 qrat  AAA unimpeachable
### > 4 qrat  BBB           meh
### > 5 qrat   NA            NA

get_val_labs(ir_ord, "qrat_lab") # no labs here; this is just a character var
### > Warning in get_val_labs(ir_ord, "qrat_lab"): 
### >  
### >   No val.labs found.
### > [1] var  vals labs
### > <0 rows> (or 0-length row.names)

It appears that you really can have it all, where “it all” is defined as “factors and labels.”

Larger Data Frames

labelr is not intended for “large” data.frames, which is a fuzzy concept. To give a sense of what labelr can handle, let’s see it in action with the NYC Flights 2013 data set: a moderate-not-big data.frame of ~340K rows.

Let’s load labelr and the nycflights13 package.

opening_ding <- Sys.time() # to time labelr

library(nycflights13)
### > Warning: package 'nycflights13' was built under R version 4.3.2

We’ll assign the data.frame to one we call df.

df <- flights

nrow(df)
### > [1] 336776

We’ll add a “frame label,” which describes the data.frame overall.

df <- add_frame_lab(df, frame.lab = "On-time data for all flights that
                    departed NYC (i.e. JFK, LGA or EWR) in 2013.")
### > Warning in as_base_data_frame(data): 
### > data argument object coerced from augmented to conventional (Base R) data.frame.

Let’s see what this did.

attr(df, "frame.lab") # check for attribute
### > [1] "On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013."

get_frame_lab(df) # return frame.lab alongside data.frame name as a data.frame
### >   data.frame
### > 1         df
### >                                                                        frame.lab
### > 1 On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.

get_frame_lab(df)$frame.lab
### > [1] "On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013."

Now, let’s assign variable NAME labels.

names_labs_vec <- c(
  "year" = "Year of departure",
  "month" = "Month of departure",
  "year" = "Day of departure",
  "dep_time" = "Actual departure time (format HHMM or HMM), local tz",
  "arr_time" = "Actual arrival time (format HHMM or HMM), local tz",
  "sched_dep_time" = "Scheduled departure times (format HHMM or HMM)",
  "sched_arr_time" = "Scheduled arrival time (format HHMM or HMM)",
  "dep_delay" = "Departure delays, in minutes",
  "arr_delay" = "Arrival delays, in minutes",
  "carrier" = "Two letter airline carrier abbreviation",
  "flight" = "Flight number",
  "tailnum" = "Plane tail number",
  "origin" = "Flight origin airport code",
  "dest" = "Flight destination airport code",
  "air_time" = "Minutes spent in the air",
  "distance" = "Miles between airports",
  "hour" = "Hour of scheduled departure time",
  "minute" = "Minutes component of scheduled departure time",
  "time_hour" = "Scheduled date and hour of the flight as a POSIXct date"
)

df <- add_name_labs(df, name.labs = names_labs_vec)

get_name_labs(df) # show that they've been added
### >               var                                                     lab
### > 1            year                                        Day of departure
### > 2           month                                      Month of departure
### > 3             day                                                     day
### > 4        dep_time    Actual departure time (format HHMM or HMM), local tz
### > 5  sched_dep_time          Scheduled departure times (format HHMM or HMM)
### > 6       dep_delay                            Departure delays, in minutes
### > 7        arr_time      Actual arrival time (format HHMM or HMM), local tz
### > 8  sched_arr_time             Scheduled arrival time (format HHMM or HMM)
### > 9       arr_delay                              Arrival delays, in minutes
### > 10        carrier                 Two letter airline carrier abbreviation
### > 11         flight                                           Flight number
### > 12        tailnum                                       Plane tail number
### > 13         origin                              Flight origin airport code
### > 14           dest                         Flight destination airport code
### > 15       air_time                                Minutes spent in the air
### > 16       distance                                  Miles between airports
### > 17           hour                        Hour of scheduled departure time
### > 18         minute           Minutes component of scheduled departure time
### > 19      time_hour Scheduled date and hour of the flight as a POSIXct date

Let’s add variable VALUE labels for variable “carrier.” Helpfully, this ships with the nycflights13 package itself.

airlines <- nycflights13::airlines

head(airlines)
### > # A tibble: 6 × 2
### >   carrier name                    
### >   <chr>   <chr>                   
### > 1 9E      Endeavor Air Inc.       
### > 2 AA      American Airlines Inc.  
### > 3 AS      Alaska Airlines Inc.    
### > 4 B6      JetBlue Airways         
### > 5 DL      Delta Air Lines Inc.    
### > 6 EV      ExpressJet Airlines Inc.

The carrier field of airlines matches the carrier column of df (formerly, flights)

ny_val <- airlines$carrier

The name field of airlines gives us the full airline names.

ny_lab <- airlines$name

df (flights) also has an integer month variable. We will “hand-jam” month value labels

ny_month_vals <- c(1:12) # values
ny_month_labs <- c(
  "JAN", "FEB", "MAR", "APR", "MAY", "JUN",
  "JUL", "AUG", "SEP", "OCT", "NOV", "DEC"
) # labels

Let’s add these value labels. First, we’ll demo add_val1(), then add_val_labs(), then add_quant_labs().

df <- add_val1(df,
  var = carrier, vals = ny_val,
  labs = ny_lab,
  max.unique.vals = 20
)
### > Warning in add_val1(df, var = carrier, vals = ny_val, labs = ny_lab, max.unique.vals = 20): 
### > 
### > Note: labelr is not optimized for data.frames this large.
df <- add_val_labs(df,
  vars = "month",
  vals = ny_month_vals,
  labs = ny_month_labs,
  max.unique.vals = 20
)
### > Warning in add_val_labs(df, vars = "month", vals = ny_month_vals, labs = ny_month_labs, : 
### > 
### > Note: labelr is not optimized for data.frames this large.
df <- add_quant_labs(df, "dep_time", qtiles = 5)
### > Warning in add_quant_labs(df, "dep_time", qtiles = 5): 
### > 
### > Note: labelr is not optimized for data.frames this large.

Let’s see where this leaves us.

get_val_labs(df)
### >         var vals                        labs
### > 1     month    1                         JAN
### > 2     month    2                         FEB
### > 3     month    3                         MAR
### > 4     month    4                         APR
### > 5     month    5                         MAY
### > 6     month    6                         JUN
### > 7     month    7                         JUL
### > 8     month    8                         AUG
### > 9     month    9                         SEP
### > 10    month   10                         OCT
### > 11    month   11                         NOV
### > 12    month   12                         DEC
### > 13    month   NA                          NA
### > 14 dep_time  827                        q020
### > 15 dep_time 1200                        q040
### > 16 dep_time 1536                        q060
### > 17 dep_time 1830                        q080
### > 18 dep_time 2400                        q100
### > 19 dep_time   NA                          NA
### > 20  carrier   9E           Endeavor Air Inc.
### > 21  carrier   AA      American Airlines Inc.
### > 22  carrier   AS        Alaska Airlines Inc.
### > 23  carrier   B6             JetBlue Airways
### > 24  carrier   DL        Delta Air Lines Inc.
### > 25  carrier   EV    ExpressJet Airlines Inc.
### > 26  carrier   F9      Frontier Airlines Inc.
### > 27  carrier   FL AirTran Airways Corporation
### > 28  carrier   HA      Hawaiian Airlines Inc.
### > 29  carrier   MQ                   Envoy Air
### > 30  carrier   OO       SkyWest Airlines Inc.
### > 31  carrier   UA       United Air Lines Inc.
### > 32  carrier   US             US Airways Inc.
### > 33  carrier   VX              Virgin America
### > 34  carrier   WN      Southwest Airlines Co.
### > 35  carrier   YV          Mesa Airlines Inc.
### > 36  carrier   NA                          NA

We can use head() to get a baseline look at select rows and variables

head(df[c("origin", "dep_time", "dest", "year", "month", "carrier")])
### >   origin dep_time dest year month carrier
### > 1    EWR      517  IAH 2013     1      UA
### > 2    LGA      533  IAH 2013     1      UA
### > 3    JFK      542  MIA 2013     1      AA
### > 4    JFK      544  BQN 2013     1      B6
### > 5    LGA      554  ATL 2013     1      DL
### > 6    EWR      554  ORD 2013     1      UA

Now, let’s do the same for a version we modified with use_val_labs(). Note that this cannot be “undone” (except for the usual clunky way of re-running our script up to this point and not doing this!).

df_swapd <- use_val_labs(df)
### > Warning in use_val_labs(df): 
### > Note: labelr is not optimized for data.frames this large.

head(df_swapd[c("origin", "dep_time", "dest", "year", "month", "carrier")])
### >   origin dep_time dest year month                carrier
### > 1    EWR     q020  IAH 2013   JAN  United Air Lines Inc.
### > 2    LGA     q020  IAH 2013   JAN  United Air Lines Inc.
### > 3    JFK     q020  MIA 2013   JAN American Airlines Inc.
### > 4    JFK     q020  BQN 2013   JAN        JetBlue Airways
### > 5    LGA     q020  ATL 2013   JAN   Delta Air Lines Inc.
### > 6    EWR     q020  ORD 2013   JAN  United Air Lines Inc.

Instead of replacing values (which we can’t undo), it might be safer to simply add “value-labels-on” character variables to the data.frame. This adds nearly 675K new cells, but let’s throw caution to the wind with add_lab_cols().

df_plus <- add_lab_cols(df, vars = c("carrier", "month", "dep_time"))
### > Warning in add_lab_cols(df, vars = c("carrier", "month", "dep_time")): 
### > 
### > Note: labelr is not optimized for data.frames this large.

head(df_plus[c(
  "origin", "dest", "year",
  "month", "month_lab",
  "dep_time", "dep_time_lab",
  "carrier", "carrier_lab"
)])
### >   origin dest year month month_lab dep_time dep_time_lab carrier
### > 1    EWR  IAH 2013     1       JAN      517         q020      UA
### > 2    LGA  IAH 2013     1       JAN      533         q020      UA
### > 3    JFK  MIA 2013     1       JAN      542         q020      AA
### > 4    JFK  BQN 2013     1       JAN      544         q020      B6
### > 5    LGA  ATL 2013     1       JAN      554         q020      DL
### > 6    EWR  ORD 2013     1       JAN      554         q020      UA
### >              carrier_lab
### > 1  United Air Lines Inc.
### > 2  United Air Lines Inc.
### > 3 American Airlines Inc.
### > 4        JetBlue Airways
### > 5   Delta Air Lines Inc.
### > 6  United Air Lines Inc.

We can use flab() to filter df based on month and carrier, even when value labels are “invisible” (i.e., existing only as attributes() meta-data.

# labels are not visible (they exist only as attributes() meta-data)
head(df[c("carrier", "arr_delay")])
### >   carrier arr_delay
### > 1      UA        11
### > 2      UA        20
### > 3      AA        33
### > 4      B6       -18
### > 5      DL       -25
### > 6      UA        12

# we still can use them to filter (note: we're filtering on "JetBlue Airways",
# ...NOT its obscure code "B6")
df_fl <- flab(df, carrier == "JetBlue Airways" & arr_delay > 20)
### > Warning in use_val_labs(data): 
### > Note: labelr is not optimized for data.frames this large.

# here's what's returned when we filtered on "JetBlue Airways" using flab()
head(df_fl[c("carrier", "arr_delay")])
### >     carrier arr_delay
### > 70       B6        44
### > 129      B6        24
### > 174      B6        40
### > 203      B6        42
### > 292      B6        29
### > 314      B6        38

# double-check that this is JetBlue
head(use_val_labs(df_fl)[c("carrier", "arr_delay")])
### >             carrier arr_delay
### > 70  JetBlue Airways        44
### > 129 JetBlue Airways        24
### > 174 JetBlue Airways        40
### > 203 JetBlue Airways        42
### > 292 JetBlue Airways        29
### > 314 JetBlue Airways        38

How long did this entire NYC Flights session take (results will vary)?

the_buzzer <- Sys.time()
the_buzzer - opening_ding
### > Time difference of 1.409979 mins

Value-Labeling Many Variables at Once

As shown earlier, functions for adding value labels (e.g., add_val_labs, add_quant_labs and add_m1_lab) will do partial matching if the partial argument is set to TRUE. Let’s use labelr’s make_likert_data() function to generate some fake Likert scale-style survey data to demonstrate this more fully.

set.seed(272) # for reproducibility
dflik <- make_likert_data(scale = 1:7) # another labelr function
head(dflik)
### >     id x1 x2 x3 x4 x5 y1 y2 y3 y4 y5
### > U-1  1  5  7  2  2  2  7  1  1  4  2
### > O-2  2  6  2  7  6  2  3  5  4  1  4
### > H-3  3  7  7  5  5  6  6  4  1  5  7
### > Z-4  4  4  5  5  4  5  6  3  7  3  4
### > C-5  5  3  3  3  1  6  2  7  6  3  5
### > P-6  6  7  3  5  3  7  5  7  1  6  2

We’ll put the values we wish to label and the labels we wish to use in stand-alone vectors, which we will supply to add_val_labs in a moment.

vals2label <- 1:7
labs2use <- c(
  "VSD",
  "SD",
  "D",
  "N",
  "A",
  "SA",
  "VSA"
)

Now, let’s associate/apply the value labels to ALL vars with “x” in their name and also to var “y3.” Note: partial = TRUE.

dflik <- add_val_labs(
  data = dflik, vars = c("x", "y3"), ###  note the vars args
  vals = vals2label,
  labs = labs2use,
  partial = TRUE # applying to all cols with "x" or "y3" substring in names
)

Let’s compare dflik with value labels present but “off” to labels “on.”

First, present but “off.”

head(dflik)
### >     id x1 x2 x3 x4 x5 y1 y2 y3 y4 y5
### > U-1  1  5  7  2  2  2  7  1  1  4  2
### > O-2  2  6  2  7  6  2  3  5  4  1  4
### > H-3  3  7  7  5  5  6  6  4  1  5  7
### > Z-4  4  4  5  5  4  5  6  3  7  3  4
### > C-5  5  3  3  3  1  6  2  7  6  3  5
### > P-6  6  7  3  5  3  7  5  7  1  6  2

Now, let’s “turn on” (use) these value labels.

lik1 <- uvl(dflik) # assign to new object, since we can't "undo"
head(lik1) # we could have skipped previous call by using labelr::headl(dflik)
### >     id  x1  x2  x3  x4  x5 y1 y2  y3 y4 y5
### > U-1  1   A VSA  SD  SD  SD  7  1 VSD  4  2
### > O-2  2  SA  SD VSA  SA  SD  3  5   N  1  4
### > H-3  3 VSA VSA   A   A  SA  6  4 VSD  5  7
### > Z-4  4   N   A   A   N   A  6  3 VSA  3  4
### > C-5  5   D   D   D VSD  SA  2  7  SA  3  5
### > P-6  6 VSA   D   A   D VSA  5  7 VSD  6  2

Yea, verily: All variables with “x” in their name (and “y3”) got the labels!

Suppose we want to drop these value labels for a select few, but not all, of these variables. drop_val_labs can get the job done.

dfdrop <- drop_val_labs(dflik,
  c("x2", "y3"),
  partial = FALSE
)

Most of our previously labeled columns remain so; but not “x2” and “y3.”

get_val_labs(dfdrop, "x2")
### > Warning in get_val_labs(dfdrop, "x2"): 
### >  
### >   No val.labs found.
### > [1] var  vals labs
### > <0 rows> (or 0-length row.names)

Compare to values for variable “x1” (we did not drop value labels from this one)

get_val_labs(dfdrop, "x1")
### >   var vals labs
### > 1  x1    1  VSD
### > 2  x1    2   SD
### > 3  x1    3    D
### > 4  x1    4    N
### > 5  x1    5    A
### > 6  x1    6   SA
### > 7  x1    7  VSA
### > 8  x1   NA   NA

Just like we did with add_val_labs(), we also can use a single command to drop value labels from all variables with “x” in their variable names.

dfxgone <- drop_val_labs(dflik,
  c("x"),
  partial = TRUE # note
)

“y3” still has value labels, but now all “x” var value labels are gone.

get_val_labs(dfxgone)
### >   var vals labs
### > 1  y3    1  VSD
### > 2  y3    2   SD
### > 3  y3    3    D
### > 4  y3    4    N
### > 5  y3    5    A
### > 6  y3    6   SA
### > 7  y3    7  VSA
### > 8  y3   NA   NA

Alias Functions and Conclusion

This concludes our whirlwind tour of labelr functionalities. You’ve graduated.

Well, almost. Before you go, here is a list of aliases for common functions. Aside from having a different name, each alias function is identical to (i.e., performs the same operations, returning the same result as) the parent function that it aliases. More concise and more cryptic, these alias functions will save you some typing at the console (and some characters in your scripts).

The available aliases are as follows: