Data Manipulation of Longitudinal Data

1 Preamble

1.1 Load Libraries

Code

library("tidyverse")

1.2 Simulate Data

First, let’s generate some example data.

Code

set.seed(52242) # for a reproducible result

exampleData_long <- expand.grid(
  ID = 1001:1003,
  timepoint = 1:4
)

exampleData_long$score <- sample(
  x = 50:100,
  size = nrow(exampleData_long),
  replace = TRUE
)

The data are in long form—each participant (identified by ID) has multiple rows (i.e., each row is uniquely identified by the combination of ID and timepoint; thus, we would say the data are in ID–timepoint form):

Code

exampleData_long

2 Transform Data From Long to Wide

To transform the data from long to wide, we can use the following syntax:

Code

exampleData_wide <- exampleData_long |>
  tidyr::pivot_wider(
    names_from  = timepoint,
    values_from = score,
    names_prefix = "timepoint_"
  )

Now the data are in wide form—each participant (identified by ID) has one rows (i.e., each row is uniquely identified by ID; thus, we would say the data are in ID form):

Code

exampleData_wide

3 Transform Data From Wide to Long

Code

exampleData_long2 <- exampleData_wide |>
  tidyr::pivot_longer(
    cols = starts_with("timepoint_"),
    names_to = "timepoint",
    names_prefix = "timepoint_",
    values_to = "score"
  )

Now the data are back in wide form—each participant (identified by ID) has multiple rows (i.e., each row is uniquely identified by the combination of ID and timepoint; thus, we would say the data are in ID–timepoint form):

Code

exampleData_long2

4 Example of Exploding Number of Columns When Transforming From Long to Wide

Transforming from long to wide tends to work best when participants have the same values on the time variable (e.g., age). When participants all have different values on the time variable, the number of columns can explode when the data are transformed from long to wide form. For example, consider the following data:

Code

set.seed(52242) # for a reproducible result

exampleData2_long <- expand.grid(
  ID = 1001:1003,
  instance = 1:4
) |>
  select(-instance)

exampleData2_long$age <- sample(
  x = 1:99,
  size = nrow(exampleData2_long),
  replace = FALSE
)

exampleData2_long$score <- sample(
  x = 50:100,
  size = nrow(exampleData2_long),
  replace = TRUE
)

Here are the data in long form:

Code

exampleData2_long

Now, let’s widen the data by age:

Code

exampleData2_wide <- exampleData2_long |>
  tidyr::pivot_wider(
    names_from  = age,
    values_from = score,
    names_prefix = "age_"
  )

Here are the data in wide form:

Code

exampleData2_wide

Notice how we went from one column for age when the data were in long form to 12 columns for age when the data are in wide form. It also leads to lots of missing values. There is only one observed value for each age column.

Reuse

CC BY 4.0

--- title: "Data Manipulation of Longitudinal Data" --- # Preamble ## Load Libraries ```{r} library("tidyverse") ``` ## Simulate Data First, let's generate some example data. ```{r} set.seed(52242) # for a reproducible result exampleData_long <- expand.grid( ID = 1001:1003, timepoint = 1:4 ) exampleData_long$score <- sample( x = 50:100, size = nrow(exampleData_long), replace = TRUE ) ``` The data are in long form—each participant (identified by `ID`) has multiple rows (i.e., each row is uniquely identified by the combination of `ID` and `timepoint`; thus, we would say the data are in `ID`–`timepoint` form): ```{r} exampleData_long ``` # Transform Data From Long to Wide {#sec-longToWide} To transform the data from long to wide, we can use the following syntax: ```{r} exampleData_wide <- exampleData_long |> tidyr::pivot_wider( names_from = timepoint, values_from = score, names_prefix = "timepoint_" ) ``` Now the data are in wide form—each participant (identified by `ID`) has one rows (i.e., each row is uniquely identified by `ID`; thus, we would say the data are in `ID` form): ```{r} exampleData_wide ``` # Transform Data From Wide to Long {#sec-wideToLong} ```{r} exampleData_long2 <- exampleData_wide |> tidyr::pivot_longer( cols = starts_with("timepoint_"), names_to = "timepoint", names_prefix = "timepoint_", values_to = "score" ) ``` Now the data are back in wide form—each participant (identified by `ID`) has multiple rows (i.e., each row is uniquely identified by the combination of `ID` and `timepoint`; thus, we would say the data are in `ID`–`timepoint` form): ```{r} exampleData_long2 ``` # Example of Exploding Number of Columns When Transforming From Long to Wide {#sec-explodingColumns} Transforming from long to wide tends to work best when participants have the same values on the time variable (e.g., age). When participants all have different values on the time variable, the number of columns can explode when the data are transformed from long to wide form. For example, consider the following data: ```{r} set.seed(52242) # for a reproducible result exampleData2_long <- expand.grid( ID = 1001:1003, instance = 1:4 ) |> select(-instance) exampleData2_long$age <- sample( x = 1:99, size = nrow(exampleData2_long), replace = FALSE ) exampleData2_long$score <- sample( x = 50:100, size = nrow(exampleData2_long), replace = TRUE ) ``` Here are the data in long form: ```{r} exampleData2_long ``` Now, let's widen the data by age: ```{r} exampleData2_wide <- exampleData2_long |> tidyr::pivot_wider( names_from = age, values_from = score, names_prefix = "age_" ) ``` Here are the data in wide form: ```{r} exampleData2_wide ``` Notice how we went from one column for age when the data were in long form to 12 columns for age when the data are in wide form. It also leads to lots of missing values. There is only one observed value for each age column.