Skip to contents

Labelled data in R

Historically, software like SPSS, SAS or Stata have worked with labelled data. The closest function in R base corresponds to factors, which actually work in reverse, assigning values to text variables with few distinct values.

Being able to work with labelled data is one of the features that R users who come from the previously mentioned systems miss the most.

In the R ecosystem there are 3 fundamental packages for working with labelled data haven, labelled y sjlabelled1.

However, these packages work in a similar way. They all expect you to load previously labelled data or assign the labels manually for each of the variables.

If you work with a dozen variables, this scheme is possible. However, as the number of variables grows, this scheme becomes difficult to handle. It should even be noted that some operations in R remove most of the attributes on objects, so perhaps some labels may even have to be assigned repeatedly.

In this sense, labeler offers a new scheme to work with labelled data in R, which is based on:

  1. The data does not need to be pre-labelled or manually labelled. For these purposes a dictionary, that may have been received with the data or built on the fly, is used.
  2. The assignment of the labels is done at the end of the pipeline, and in a semi-automated way.

The dictionary

The dictionary object consists of a list that contains each of the variables to be labeled. The variables in turn, consist of lists that contain inside:

  1. lab: A character string that describes the variable.
  2. labs: A named vector of the possible values that the variable takes.

In this sense, a valid dictionary in labeler can be built and is displayed as follows:

dict <- list(
  GENDER = list(
    lab = "Gender of the person",
    labs = c("Man" = 1, "Woman" = 2)
  ),
  AGE = list(
    lab = "Age of the person"
  ),
  SEX = list(
    lab = "link::GENDER",
    labs = "link::GENDER"
  ),
  Married = list(
    lab = "Indicates whether or not the person is married",
    labs = c("No" = 0, "Yes" = 1)
  )
)

Following these basic criteria, you can create a custom dictionary, either by building it from scratch, or by adding the necessary labels for the variables that you create through your analysis.

Parsing the dictionary

Working with non-ASCII characters in R is a tedious task. labeler offers a way to work with non-ASCII characters in dictionaries. For these purposes there is the parse_dict() function that allows parsing a dictionary to replace these characters. Note that this process is necessary only if the dictionary is intended to be saved as a valid R object, as part of a package for example. The other functions in labeler are designed to reverse this automatic forming process.

Browsing the dictionary

Using the browse_dict() function, the user will be able to interactively consult the dictionary in order to become familiar with its content.

Setting the labels

The function set_labels (), has as main objective to take the labels of variables and values available in a dictionary and apply them to a data table or dataframe.

The following are basic examples of using this feature. For the sake of familiarity, we will call the data set enft which is the acronym for the National Survey of the Dominican Republic Workforce, hinting that the way in which the functions are used in these examples is the way in which the one you will use when working with these surveys.

enft <- data.frame(
  GENDER = c(rep(1, 5), rep(2, 5)),
  AGE = c(seq(1, 30, 3)),
  SEX = c(rep(1, 5), rep(2, 5)),
  MARRIED = c(rep(0, 4), rep(1, 6))
)

Notice in the code below that just by applying the set_labels () function our data now contains labels we can use in our analysis.

str(enft)
#> 'data.frame':    10 obs. of  4 variables:
#>  $ GENDER : num  1 1 1 1 1 2 2 2 2 2
#>  $ AGE    : num  1 4 7 10 13 16 19 22 25 28
#>  $ SEX    : num  1 1 1 1 1 2 2 2 2 2
#>  $ MARRIED: num  0 0 0 0 1 1 1 1 1 1

str(set_labels(enft, dict = dict))
#> 'data.frame':    10 obs. of  4 variables:
#>  $ GENDER : num  1 1 1 1 1 2 2 2 2 2
#>   ..- attr(*, "label")= chr "Gender of the person"
#>   ..- attr(*, "labels")= Named num [1:2] 1 2
#>   .. ..- attr(*, "names")= chr [1:2] "Man" "Woman"
#>  $ AGE    : num  1 4 7 10 13 16 19 22 25 28
#>  $ SEX    : num  1 1 1 1 1 2 2 2 2 2
#>   ..- attr(*, "label")= chr "Gender of the person"
#>   ..- attr(*, "labels")= Named num [1:2] 1 2
#>   .. ..- attr(*, "names")= chr [1:2] "Man" "Woman"
#>  $ MARRIED: num  0 0 0 0 1 1 1 1 1 1

On the other hand, it is important to mention that as indicated in the description of the function’s arguments, it is possible to assign the data labels to specific variables.

str(set_labels(enft, dict = dict, vars = c("GENDER")))
#> 'data.frame':    10 obs. of  4 variables:
#>  $ GENDER : num  1 1 1 1 1 2 2 2 2 2
#>   ..- attr(*, "label")= chr "Gender of the person"
#>   ..- attr(*, "labels")= Named num [1:2] 1 2
#>   .. ..- attr(*, "names")= chr [1:2] "Man" "Woman"
#>  $ AGE    : num  1 4 7 10 13 16 19 22 25 28
#>  $ SEX    : num  1 1 1 1 1 2 2 2 2 2
#>  $ MARRIED: num  0 0 0 0 1 1 1 1 1 1

Compare this result with the previous one to notice that in this case the GENDER variable contains data labels, as indicated in the function call.

If you have paid enough attention to the output of the code, you will have noticed that the data labels have never been assigned to the MARRIED variable. This is so, because R is case-sensitive, which means that MARRIED as stated in the dataset is not the same as Married as stated in the dictionary. Fortunately, the ignore_case argument of the set_labels function is designed especially for this purpose. As shown below, using this argument you can make labels are assigned to this variable as well.

str(set_labels(enft, dict = dict, ignore_case = T))
#> 'data.frame':    10 obs. of  4 variables:
#>  $ GENDER : num  1 1 1 1 1 2 2 2 2 2
#>   ..- attr(*, "label")= chr "Gender of the person"
#>   ..- attr(*, "labels")= Named num [1:2] 1 2
#>   .. ..- attr(*, "names")= chr [1:2] "Man" "Woman"
#>  $ AGE    : num  1 4 7 10 13 16 19 22 25 28
#>  $ SEX    : num  1 1 1 1 1 2 2 2 2 2
#>   ..- attr(*, "label")= chr "Gender of the person"
#>   ..- attr(*, "labels")= Named num [1:2] 1 2
#>   .. ..- attr(*, "names")= chr [1:2] "Man" "Woman"
#>  $ MARRIED: num  0 0 0 0 1 1 1 1 1 1
#>   ..- attr(*, "label")= chr "Indicates whether or not the person is married"
#>   ..- attr(*, "labels")= Named num [1:2] 0 1
#>   .. ..- attr(*, "names")= chr [1:2] "No" "Yes"

However, depending on the size and origin of your data set and dictionary, you should be cautious with this option, to avoid unexpected results. So it might be a good idea to use this argument when specifying target variables.

Another detail, which if you are a good observer you will have noticed, is the use of the wildcard link:: in the definition of the labels of the SEX variable. In this case, as you will guess when you see the labels that are assigned to this last variable, what is achieved by using this wildcard is to reference the labels between one variable and another. This way, if you create a variable that uses exactly the same labels as another, it will be much easier to add the same to the dictionary. Of course, a more plausible option is to create your own custom dictionary of frequently created variables that you can reuse from analysis to analysis.

Also, if for some reason you want to overwrite some of the predefined tags in the dictionary, you just need to take into account the structure of the dictionary and make the required replacements. For example, if instead of Man / Woman you want to use the abbreviations of these words, you can get it with the code below.

dict[["SEX"]]$labs <- c("M" = 1, "W" = 2)

str(set_labels(enft, dict = dict, vars = c("SEX")))
#> 'data.frame':    10 obs. of  4 variables:
#>  $ GENDER : num  1 1 1 1 1 2 2 2 2 2
#>  $ AGE    : num  1 4 7 10 13 16 19 22 25 28
#>  $ SEX    : num  1 1 1 1 1 2 2 2 2 2
#>   ..- attr(*, "label")= chr "Gender of the person"
#>   ..- attr(*, "labels")= Named num [1:2] 1 2
#>   .. ..- attr(*, "names")= chr [1:2] "M" "W"
#>  $ MARRIED: num  0 0 0 0 1 1 1 1 1 1

##chec Using the labels

As you may have noticed in all the previous examples, we have used the str() function to show the changes in the data, resulting from the application of the labels. This is so, because even having assigned the labels, your data still looks exactly the same. At this point the use_labels() function comes into play, which allows us to substitute the values of a variable for the corresponding labels.

use_labels(enft, dict = dict, ignore_case = T)
#>    GENDER AGE SEX MARRIED
#> 1     Man   1   M      No
#> 2     Man   4   M      No
#> 3     Man   7   M      No
#> 4     Man  10   M      No
#> 5     Man  13   M     Yes
#> 6   Woman  16   W     Yes
#> 7   Woman  19   W     Yes
#> 8   Woman  22   W     Yes
#> 9   Woman  25   W     Yes
#> 10  Woman  28   W     Yes

You can use the set_labels() function to assign the labels at a point in the pipeline, and the use_labels() function at a later stage to use them. Or you can, as shown below and the easiest way, pass the dictionary directly to the use_labels() function which will take care of calling the set_labels() function automatically. Anyway, a more convenient option is to assign / use the labels once a summary table of the data has been generated. What has the advantage of operating on a smaller number of variables.

enft %>% 
  dplyr::group_by(
    gender = GENDER,
    united = MARRIED
  ) %>% 
  dplyr::count() %>% 
  set_labels(dict = dict, ignore_case = T) %>% 
  use_labels(NULL)
#> # A tibble: 3 × 3
#> # Groups:   gender, united [3]
#>   gender united     n
#>   <fct>   <dbl> <int>
#> 1 Man         0     4
#> 2 Man         1     1
#> 3 Woman       1     5

As can be seen in the previous example, even though the GENDER variable has been renamed, just by assigning the ignore_case = TRUE argument, the function was able to assign the data labels. While in the case of the variable MARRIED the name has been changed to one that is not present in the dictionary, so the function was not able to find the corresponding labels. This could have been achieved in the following way, for example:

enft %>% 
  dplyr::group_by(
    gender = GENDER, 
    married = MARRIED
    ) %>% 
  dplyr::count() %>% 
  use_labels(dict = dict, ignore_case = T) %>% 
  dplyr::rename("united" = "married")
#> # A tibble: 3 × 3
#> # Groups:   gender, united [3]
#>   gender united     n
#>   <fct>  <fct>  <int>
#> 1 Man    No         4
#> 2 Man    Yes        1
#> 3 Woman  Yes        5

Furthermore, regardless of which variables have assigned labels, it is possible to specify which of them you want to use. This is possible using the vars argument. Note also that the arguments can be combined, arriving at the desired result.

enft %>% 
  dplyr::group_by(
    gender = GENDER, 
    married = MARRIED
    ) %>% 
  dplyr::count() %>% 
  use_labels(dict = dict, vars = c("gender"), ignore_case = T) %>% 
  dplyr::rename("united" = "married")
#> # A tibble: 3 × 3
#> # Groups:   gender, united [3]
#>   gender united     n
#>   <fct>   <dbl> <int>
#> 1 Man         0     4
#> 2 Man         1     1
#> 3 Woman       1     5

When variables contain values that are not in the dictionary, use_labels() by default prevents labels from being assigned to these variables and prints a warning in the console indicating the variables that could not be labeled. You can assign the argument use_labels(..., check = FALSE) so that the labels are used regardless of those values that are not present in the dictionary, which will be converted to NA. In this sense, the best option is to add to the dictionary all the values that the variable has to take. If you just want to avoid the console warning, use the vars argument to specify the variables to be labelled, excluding the conflicting variables.

enft %>% 
  dplyr::group_by(
    GENDER, 
    SEX = SEX + 1
    ) %>% 
  dplyr::count() %>% 
  use_labels(dict = dict)
#> Warning in use_labels(., dict = dict): The following (1) variables contain values
#>       that are not in the dictionary and were not labeled: 
#> SEX.
#>       Please see "https://adatar-do.github.io/labeler/articles/labeler.html" for more details.
#> # A tibble: 2 × 3
#> # Groups:   GENDER, SEX [2]
#>   GENDER   SEX     n
#>   <fct>  <dbl> <int>
#> 1 Man        2     5
#> 2 Woman      3     5