In addition to interlacing values and missing reasons, many statistical software packages will store categorical values and missing reasons as alphanumeric codes. Working with these files can be a pain because the codes are often arbitrary magic numbers that obfuscate the meaning of your syntax and results.
To facilitate working with such data, interlacer provides a new
cfactor type. The cfactor allows you to attach
labels to coded data and work with it as a regular R
factor. Unlike a regular R factor, however, a
cfactor can be converted back into its coded representation
at any time (whereas R factor values lose their original
codes).
⚠️ ⚠️ ⚠️ WARNING ⚠️ ⚠️ ⚠️
The cfactor type is a highly experimental feature (even
compared to the rest of interlacer) and has not been thoroughly tested!
I’m sharing them in a super pre-alpha, unstable state to get feedback on
them before I invest more time polishing their implementation.
SPSS-style codes
As a motivating example, consider this coded version of the
colors.csv example:
library(readr)
library(dplyr, warn.conflicts = FALSE)
library(interlacer, warn.conflicts = FALSE)
read_file(
interlacer_example("colors_coded.csv")
) |>
cat()
#> person_id,age,favorite_color
#> 1,20,1
#> 2,-98,1
#> 3,21,-98
#> 4,30,-97
#> 5,1,-99
#> 6,41,2
#> 7,50,-97
#> 8,30,3
#> 9,-98,-98
#> 10,-97,2
#> 11,10,-98Where missing reasons are:
-99: N/A
-98: REFUSED
-97: OMITTED
And colors are coded:
1: BLUE
2: RED
3: YELLOW
This style of coding, with positive values representing categorical levels and negative values representing missing values, is a common format used by SPSS.
These data can be loaded as interlaced numeric values as follows:
(df_coded <- read_interlaced_csv(
interlacer_example("colors_coded.csv"),
na = c(-99, -98, -97)
))
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl,int> <dbl,int> <dbl,int>
#> 1 1 20 1
#> 2 2 <-98> 1
#> 3 3 21 <-98>
#> 4 4 30 <-97>
#> 5 5 1 <-99>
#> 6 6 41 2
#> 7 7 50 <-97>
#> 8 8 30 3
#> 9 9 <-98> <-98>
#> 10 10 <-97> 2
#> 11 11 10 <-98>This representation is awkward to work with because the codes are
meaningless and obfuscate the significance of any code you write or any
results you output. If you wanted select everyone with a
BLUE favorite color, for example, you would write:
df_coded |>
filter(favorite_color == 1)
#> # A tibble: 2 × 3
#> person_id age favorite_color
#> <dbl,int> <dbl,int> <dbl,int>
#> 1 1 20 1
#> 2 2 <-98> 1Similarly, if you wanted to filter for OMITTED favorite
colors, you would write:
df_coded |>
filter(favorite_color == na(-97))
#> # A tibble: 0 × 3
#> # ℹ 3 variables: person_id <dbl,int>, age <dbl,int>, favorite_color <dbl,int>To make these data more ergnomic to work with, you can use
interlacer’s v_col_cfactor() and
na_col_cfactor() collector types to load these values as a
cfactor instead, which allows you to associate codes with
human-readable labels:
(df_decoded <- read_interlaced_csv(
interlacer_example("colors_coded.csv"),
col_types = x_cols(
favorite_color = v_col_cfactor(codes = c(BLUE = 1, RED = 2, YELLOW = 3)),
),
na = na_col_cfactor(REFUSED = -99, OMITTED = -98, `N/A` = -97)
))
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl,cfct> <dbl,cfct> <cfct,cfct>
#> 1 1 20 BLUE
#> 2 2 <OMITTED> BLUE
#> 3 3 21 <OMITTED>
#> 4 4 30 <N/A>
#> 5 5 1 <REFUSED>
#> 6 6 41 RED
#> 7 7 50 <N/A>
#> 8 8 30 YELLOW
#> 9 9 <OMITTED> <OMITTED>
#> 10 10 <N/A> RED
#> 11 11 10 <OMITTED>Now human-readable labels, instead of the magic codes, can be used when working with the data:
df_decoded |>
filter(favorite_color == "BLUE")
#> # A tibble: 2 × 3
#> person_id age favorite_color
#> <dbl,cfct> <dbl,cfct> <cfct,cfct>
#> 1 1 20 BLUE
#> 2 2 <OMITTED> BLUE
df_decoded |>
filter(favorite_color == na("OMITTED"))
#> # A tibble: 0 × 3
#> # ℹ 3 variables: person_id <dbl,cfct>, age <dbl,cfct>,
#> # favorite_color <cfct,cfct>But you can still convert the labels of values or missing reasons
back to codes if you wish, using as.codes(). The following
will convert the missing reason channel of age and the
value channel of the favorite_color into their coded
representation:
df_decoded |>
mutate(
age = map_na_channel(age, as.codes),
favorite_color = map_value_channel(favorite_color, as.codes)
)
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl,cfct> <dbl,int> <int,cfct>
#> 1 1 20 1
#> 2 2 <-98> 1
#> 3 3 21 <OMITTED>
#> 4 4 30 <N/A>
#> 5 5 1 <REFUSED>
#> 6 6 41 2
#> 7 7 50 <N/A>
#> 8 8 30 3
#> 9 9 <-98> <OMITTED>
#> 10 10 <-97> 2
#> 11 11 10 <OMITTED>To recode all cfactor channels in a data frame into
their coded representation you can do the following:
df_decoded |>
mutate(
across_value_channels(where_value_channel(is.cfactor), as.codes),
across_na_channels(where_na_channel(is.cfactor), as.codes),
)
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl,int> <dbl,int> <int,int>
#> 1 1 20 1
#> 2 2 <-98> 1
#> 3 3 21 <-98>
#> 4 4 30 <-97>
#> 5 5 1 <-99>
#> 6 6 41 2
#> 7 7 50 <-97>
#> 8 8 30 3
#> 9 9 <-98> <-98>
#> 10 10 <-97> 2
#> 11 11 10 <-98>SAS- and Stata-style codes
Like SPSS, SAS and Stata will encode factor levels as numeric values, but instead of representing missing reasons as negative codes, they are given character codes:
read_file(
interlacer_example("colors_coded_char.csv")
) |>
cat()
#> person_id,age,favorite_color
#> 1,20,1
#> 2,.a,1
#> 3,21,.a
#> 4,30,.b
#> 5,1,.
#> 6,41,2
#> 7,50,.b
#> 8,30,3
#> 9,.a,.a
#> 10,.b,2
#> 11,10,.aIn this example, the same value coding scheme is used for
favorite_color as the previous example, except the missing
reason channels are coded as follows:
“.”: N/A
“.a”: REFUSED
“.b”: OMITTED
These data can be easily loaded by interlacer into a
cfactor missing reason channel as follows:
read_interlaced_csv(
interlacer_example("colors_coded_char.csv"),
col_types = x_cols(
favorite_color = v_col_cfactor(codes = c(BLUE = 1, RED = 2, YELLOW = 3)),
),
na = c(`N/A` = ".", REFUSED = ".a", OMITTED = ".b"),
)
#> # A tibble: 11 × 3
#> person_id age favorite_color
#> <dbl,cfct> <dbl,cfct> <cfct,cfct>
#> 1 1 20 BLUE
#> 2 2 <REFUSED> BLUE
#> 3 3 21 <REFUSED>
#> 4 4 30 <OMITTED>
#> 5 5 1 <N/A>
#> 6 6 41 RED
#> 7 7 50 <OMITTED>
#> 8 8 30 YELLOW
#> 9 9 <REFUSED> <REFUSED>
#> 10 10 <OMITTED> RED
#> 11 11 10 <REFUSED>The cfactor type
The cfactor is an extension of base R’s
factor type. They are created from numeric or
character codes using the cfactor()
function:
(example_cfactor <- cfactor(
c(10, 20, 30, 10, 20, 30),
codes = c(LEVEL_A = 10, LEVEL_B = 20, LEVEL_C = 30)
))
#> <cfactor<int+bd96a>[6]>
#> [1] LEVEL_A LEVEL_B LEVEL_C LEVEL_A LEVEL_B LEVEL_C
#>
#> Categorical levels:
#> label code
#> LEVEL_A 10
#> LEVEL_B 20
#> LEVEL_C 30
(example_cfactor2 <- cfactor(
c("a", "b", "c", "a", "b", "c"),
codes = c(LEVEL_A = "a", LEVEL_B = "b", LEVEL_C = "c")
))
#> <cfactor<chr+99cda>[6]>
#> [1] LEVEL_A LEVEL_B LEVEL_C LEVEL_A LEVEL_B LEVEL_C
#>
#> Categorical levels:
#> label code
#> LEVEL_A a
#> LEVEL_B b
#> LEVEL_C ccfactor vectors can be used wherever regular base R
factor types are used, because they are fully-compatible
factor types:
is.factor(example_cfactor)
#> [1] TRUE
levels(example_cfactor)
#> [1] "LEVEL_A" "LEVEL_B" "LEVEL_C"
is.factor(example_cfactor2)
#> [1] TRUE
levels(example_cfactor2)
#> [1] "LEVEL_A" "LEVEL_B" "LEVEL_C"But unlike a regular factor, a cfactor
additionally stores the codes for the factor levels. This means you can
convert it back into its coded representation at any time, if
desired:
codes(example_cfactor)
#> LEVEL_A LEVEL_B LEVEL_C
#> 10 20 30
as.codes(example_cfactor)
#> [1] 10 20 30 10 20 30
codes(example_cfactor2)
#> LEVEL_A LEVEL_B LEVEL_C
#> "a" "b" "c"
as.codes(example_cfactor2)
#> [1] "a" "b" "c" "a" "b" "c"IMPORTANT: The as.numeric() and
as.integer() functions do not convert a
cfactor with numeric codes into its coded representation.
Instead, in order to retain full compatibility with the base R
factor type, it always returns a result coded by the
index of each level in the factor:
as.numeric(example_cfactor)
#> [1] 1 2 3 1 2 3
as.numeric(example_cfactor2)
#> [1] 1 2 3 1 2 3When the levels are changed, the cfactor will drop its
codes and degrade into a regular R factor:
cfactor_copy <- example_cfactor
# cfactory_copy is a cfactor and a factor
is.cfactor(cfactor_copy)
#> [1] TRUE
is.factor(cfactor_copy)
#> [1] TRUE
levels(cfactor_copy)
#> [1] "LEVEL_A" "LEVEL_B" "LEVEL_C"
codes(cfactor_copy)
#> LEVEL_A LEVEL_B LEVEL_C
#> 10 20 30
# modify the levels of the cfactor as if it was a regular factor
levels(cfactor_copy) <- c("C", "B", "A")
# now cfactor_copy is just a regular factor
is.cfactor(cfactor_copy)
#> [1] FALSE
is.factor(cfactor_copy)
#> [1] TRUE
levels(cfactor_copy)
#> [1] "C" "B" "A"
codes(cfactor_copy)
#> NULLFinally, if you have a base R factor or character vector
of labels, you can add codes to them via as.cfactor():
as.cfactor(
c("LEVEL_A", "LEVEL_B", "LEVEL_C", "LEVEL_A", "LEVEL_B", "LEVEL_C"),
codes = c(LEVEL_A = 10, LEVEL_B = 20, LEVEL_C = 30)
)
#> <cfactor<int+bd96a>[6]>
#> [1] LEVEL_A LEVEL_B LEVEL_C LEVEL_A LEVEL_B LEVEL_C
#>
#> Categorical levels:
#> label code
#> LEVEL_A 10
#> LEVEL_B 20
#> LEVEL_C 30Re-coding and writing an interlaced data frame.
Re-coding and writing an interlaced data frame is as simple as
calling as.codes() on all cfactor type value
and missing reason channels, and then calling one of the
write_interlaced_*() family of functions:
df_decoded |>
mutate(
across_value_channels(where_value_channel(is.cfactor), as.codes),
across_na_channels(where_na_channel(is.cfactor), as.codes),
) |>
write_interlaced_csv("output.csv")haven
The haven package has
functions for loading native SPSS, SAS, and Stata native file formats
into special data frames that use column attributes and special values
to keep track of value labels and missing reasons. For a complete
discussion of how this compares to interlacer’s approach, see
vignette("other-approaches").
