Wide to long datasets the tidyverse way

In my collaborative stats work, I often analyze longitudinal or clustered data that need to be in a long, tidy format for analysis. These datasets typically arrive as an Excel spreadsheet, CSV, or in a REDCap database, and are almost always in a wide format with separate columns for baseline and each follow-up variable.

When there are many variables collected at each timepoint, the task of reshaping and tidying the dataset can be tedious. After fixing any inconsistencies in the variable naming, the approach shown here scales by only adding to a vector of measure names, which is πŸ‘πŸ‘πŸ‘ .

You can find more info on tidy data in Hadley Wickham’s article in JSS.

# Load R packages
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
library(tibble)
library(rlang)

The dataset

The dataset my_example has 3 constant or baseline variables – id, group, and age – and four repeated measures variables plus the date of collection for each measure. Each repeated measure is collected ten times.

my_example
## # A tibble: 50 x 53
##    study_id tx_group   age visit_date_1 first_measure_1 second_measure_1
##       <int> <fct>    <dbl> <date>                 <dbl> <chr>           
##  1        1 Tx        3.00 2017-01-01           -0.366  g               
##  2        2 Ctrl      5.00 2017-01-01            0.838  j               
##  3        3 Tx        8.00 2017-01-01            0.547  b               
##  4        4 Ctrl      8.00 2017-01-01           -0.484  g               
##  5        5 Tx        5.00 2017-01-01           -0.468  n               
##  6        6 Ctrl      2.00 2017-01-01            0.119  k               
##  7        7 Tx        6.00 2017-01-01            1.07   n               
##  8        8 Ctrl      5.00 2017-01-01           -0.0386 k               
##  9        9 Tx        8.00 2017-01-01           -0.0960 b               
## 10       10 Ctrl      4.00 2017-01-01            1.30   n               
## # ... with 40 more rows, and 47 more variables: third_measure_1 <dbl>,
## #   fourth_measure_1 <dbl>, visit_date_2 <date>, first_measure_2 <dbl>,
## #   second_measure_2 <chr>, third_measure_2 <dbl>, fourth_measure_2 <dbl>,
## #   visit_date_3 <date>, first_measure_3 <dbl>, second_measure_3 <chr>,
## #   third_measure_3 <dbl>, fourth_measure_3 <dbl>, visit_date_4 <date>,
## #   first_measure_4 <dbl>, second_measure_4 <chr>, third_measure_4 <dbl>,
## #   fourth_measure_4 <dbl>, visit_date_5 <date>, first_measure_5 <dbl>,
## #   second_measure_5 <chr>, third_measure_5 <dbl>, fourth_measure_5 <dbl>,
## #   visit_date_6 <date>, first_measure_6 <dbl>, second_measure_6 <chr>,
## #   third_measure_6 <dbl>, fourth_measure_6 <dbl>, visit_date_7 <date>,
## #   first_measure_7 <dbl>, second_measure_7 <chr>, third_measure_7 <dbl>,
## #   fourth_measure_7 <dbl>, visit_date_8 <date>, first_measure_8 <dbl>,
## #   second_measure_8 <chr>, third_measure_8 <dbl>, fourth_measure_8 <dbl>,
## #   visit_date_9 <date>, first_measure_9 <dbl>, second_measure_9 <chr>,
## #   third_measure_9 <dbl>, fourth_measure_9 <dbl>, visit_date_10 <date>,
## #   first_measure_10 <dbl>, second_measure_10 <chr>,
## #   third_measure_10 <dbl>, fourth_measure_10 <dbl>

The workhorse function

The split_gather() function does most of the work here. It selects all variables with a given prefix in their variable names, gathers them into a long format including a column for the variable names, then finally removes the prefix from the variable names column. The remaining text in the names column represents the visit number and can be further cleaned up as needed.

split_gather <- function(.data, prefix) {

  # Remove any trailing special characters for variable name
  prefix_symbol <- stringr::str_replace_all(prefix, "[[:punct:]]$", "") 

  # Select vars, gather (using prefix_symbol value with rlang's !!), and remove
  # prefix from visit_num values
  rtn <- .data %>% 
    dplyr::select(.data$study_id, tx_group, starts_with(prefix)) %>% 
    tidyr::gather(key = "visit_num", value = !!(prefix_symbol), 
                  -study_id, -tx_group) %>% 
    dplyr::mutate(visit_num = visit_num %>% 
                    stringr::str_replace_all(prefix, ""))

  return(rtn)

}

Do the work

Now that we have our workhorse function, we just collect the unique variable name prefixes and use map/reduce to apply the function to each prefix then join the resulting dataframes. The final step is to merge any non-time-varying data back on to the long dataset.

# Non-time-varying data
my_example_base <- my_example %>% select(study_id:age) 

# Grab varname prefixes
measure_names <- my_example %>% 
  select(-c(study_id:age)) %>% names %>%
  str_replace("[0-9]$|[0-9][0-9]$", "") %>% unique

# Create longitudinal dataframes then join them
long <- measure_names %>%
    map(~ split_gather(.data = my_example, prefix = .x)) %>%
    reduce(full_join) 

# Combine long data with baseline for final long dataset
long <- full_join(my_example_base, long)

Final steps

We now have a long dataset, shown below, with a column for each repeated measure, one for visit number, visit date, and the non-time-varying columns. Next up would be the final cleaning and analysis.

long[1:10, ] %>% knitr::kable(.)
study_id tx_group age visit_num visit_date first_measure second_measure third_measure fourth_measure
1 Tx 3 1 2017-01-01 -0.3661234 g 54.11452 0.1910377
1 Tx 3 2 2017-01-01 0.5324592 h 62.65747 2.8440664
1 Tx 3 3 2017-01-01 0.4580568 j 68.84082 0.8803083
1 Tx 3 4 2017-01-01 0.4275926 k 54.53255 2.4082683
1 Tx 3 5 2017-01-01 -0.3909250 v 29.03957 4.0649792
1 Tx 3 6 2017-01-01 -0.0283963 q 52.03553 1.5210179
1 Tx 3 7 2017-01-01 -0.3904764 v 86.47805 0.5549501
1 Tx 3 8 2017-01-01 1.5038701 k 76.68763 5.8673193
1 Tx 3 9 2017-01-01 0.1429862 f 22.17073 1.0447512
1 Tx 3 10 2017-01-01 -1.2442755 y 98.68397 0.1927802

A few departing notes

The workhorse function will need some fine tuning based on the dataset but overall this code has been useful on a number of occasions.

Parsing data from variable names is a fairly problem/dataset-specific task, so the prefix portions of the workhorse function will need to be adapted as-needed. Note that the join in the reduce() function occurs on any matching column names. Here study_id and visit_num are used. In this case, the variable name minus your prefix (or common name for a measure) is saved as the visit_num variable.

Also, this batch approach works better without factor variables – import with stringsAsFactors = FALSE and create factors after reshaping – otherwise you will get a warning messages about attributes being not identical.







P.S. Making the example data

Data was created by doing the split-gather in reverse.

numobs     <- 50
numreps    <- 10
my_example <- data.frame(study_id = 1:numobs, 
                         tx_group = rep(c("Tx","Ctrl"), numobs/2),
                         age = runif(numobs, 1, 10) %>% floor) 

create_visit_dat <- function(x) {
  rtn <- data.frame(v1 = rep(as.Date("2017-01-01"), numobs),
                    v2 = rnorm(numobs),
                    v3 = sample(letters, numobs, replace = TRUE),
                    v4 = runif(numobs, 10, 100),
                    v5 = rexp(numobs, 0.5),
                    stringsAsFactors = FALSE)
  colnames(rtn) <- paste0(c("visit_date_", "first_measure_", 
                            "second_measure_", "third_measure_",
                            "fourth_measure_"), x)
  return(rtn)
}

my_example <- map(1:numreps, ~ create_visit_dat(.x)) %>%
  reduce(cbind) %>% 
  cbind(my_example, .) %>% tbl_df

str(my_example, list.len = 10)
## Classes 'tbl_df', 'tbl' and 'data.frame':    50 obs. of  53 variables:
##  $ study_id         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ tx_group         : Factor w/ 2 levels "Ctrl","Tx": 2 1 2 1 2 1 2 1 2 1 ...
##  $ age              : num  5 9 5 9 9 2 7 1 8 3 ...
##  $ visit_date_1     : Date, format: "2017-01-01" "2017-01-01" ...
##  $ first_measure_1  : num  -0.588 0.325 -0.704 1.598 0.115 ...
##  $ second_measure_1 : chr  "x" "h" "j" "z" ...
##  $ third_measure_1  : num  97 47.8 91.2 53.2 62.7 ...
##  $ fourth_measure_1 : num  0.565 3.258 0.991 0.391 1.491 ...
##  $ visit_date_2     : Date, format: "2017-01-01" "2017-01-01" ...
##  $ first_measure_2  : num  0.1497 -0.7329 0.4616 -1.3273 -0.0183 ...
##   [list output truncated]

Session info

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2    rlang_0.2.0     tibble_1.4.2    stringr_1.3.0  
##  [5] purrr_0.2.4     tidyr_0.8.0     dplyr_0.7.4     emo_0.0.0.9000 
##  [9] devtools_1.13.4 nvimcom_0.9-57 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.15     knitr_1.19       bindr_0.1        magrittr_1.5    
##  [5] tidyselect_0.2.3 R6_2.2.2         highr_0.6        tools_3.4.3     
##  [9] xfun_0.1         utf8_1.1.3       cli_1.0.0        withr_2.1.1     
## [13] htmltools_0.3.6  yaml_2.1.16      rprojroot_1.3-2  digest_0.6.15   
## [17] assertthat_0.2.0 crayon_1.3.4     bookdown_0.6     glue_1.2.0      
## [21] memoise_1.1.0    evaluate_0.10.1  rmarkdown_1.8    blogdown_0.5    
## [25] stringi_1.1.6    pillar_1.1.0     compiler_3.4.3   backports_1.1.2 
## [29] lubridate_1.7.3  pkgconfig_2.0.1