Creating a New Column Based on Values in Other Rows
In this article, we will explore how to create a new column in a data frame that takes values from other rows only for certain conditions. We’ll use the dplyr and tidyr packages in R to achieve this.
Background
When working with data frames, it’s common to have situations where you need to perform calculations or assignments based on values in other columns or even entire rows. In such cases, creating a new column that incorporates these values can be both useful and efficient.
Using dplyr and tidyr
Here, we will use the dplyr package for its powerful grammar of data manipulation, which allows us to specify operations in a declarative way. Specifically, we’ll utilize the group_by function followed by mutate to create new columns and perform calculations.
Identifying Baseline Med
Let’s start with identifying the baseline medication (med) for each study participant. This involves finding the first visit where the participant had less than or equal to 8 medications.
library(dplyr)
library(tidyr)
dset %>%
group_by(studynr) %>% # Group by studynr
mutate(baseline_med = case_when( # Define a conditional statement for baseline_med
med_dt == min(med_dt) & TOTAL <= 8 ~ 'A', # If TOTAL is less than or equal to 8, assign 'A'
med_dt == min(med_dt) & TOTAL > 8 ~ 'B', # If TOTAL is greater than 8, assign 'B'
TRUE ~ NA # Otherwise, return NA
)) %>%
fill(baseline_med, .direction = 'updown') %>% # Fill missing values with the previous value in baseline_med column
ungroup() # Remove group_by from result
Creating a New Column
The mutate function creates a new column called baseline_med. We’ve used a conditional statement to assign either ‘A’ or ‘B’ based on whether the first visit had fewer than or equal to 8 medications.
Note that we use min(med_dt) to identify the baseline visit for each participant. This assumes that the earliest date represents the participant’s first study visit.
Results
After executing this code, you’ll see a new column called baseline_med in your data frame with values assigned as ‘A’ or ‘B’.
The resulting output from running this code:
| studynr | med_dt | TOTAL | baseline_med |
|---|---|---|---|
| 1 | 2012-12-13 12:00:00 | 7 | A |
| 1 | 2013-07-03 12:00:00 | 9 | A |
| 1 | 2017-06-01 12:00:00 | 10 | A |
| 2 | 2012-10-23 12:00:00 | 8 | A |
| 2 | 2017-10-24 12:00:00 | 7 | A |
| 2 | 2016-11-12 12:00:00 | 9 | A |
| 3 | 2016-05-27 12:00:00 | 10 | B |
| 3 | 2017-09-25 12:00:00 | 8 | B |
| 3 | 2018-09-01 12:00:00 | 6 | B |
| 4 | 2019-06-05 12:00:00 | 5 | B |
| 4 | 2015-06-11 12:00:00 | 8 | B |
| 4 | 2015-03-31 12:00:00 | 10 | B |
| 5 | 2017-02-07 12:00:00 | 15 | B |
| 5 | 2017-12-21 12:00:00 | 12 | B |
| 5 | 2014-09-19 12:00:00 | 10 | B |
Using fill with Upward Direction
The fill function is used to fill missing values in the baseline_med column. The .direction = 'updown' argument tells dplyr to use upward direction for filling.
This means that if there are any missing values in this new column, they will be replaced with the previous value from the previous row using the upward direction (i.e., moving up from the most recent row).
This approach ensures consistency and coherence in our output data frame.
Last modified on 2023-05-21