Pairwise Table in widyr: A Practical Guide for Co-Accurrence Analysis in R

====================================

In this article, we will explore how to create a pairwise table using the widyr package in R. The pairwise_count function is commonly used to analyze co-occurrences of items, but it assumes that the input data are already in a specific format. In this tutorial, we’ll focus on transforming colon-separated data into a suitable format for pairwise analysis.

Introduction

The widyr package offers an efficient way to perform data manipulation and transformation tasks in R. However, its pairwise_count function requires a specific input format to work correctly. This article aims to provide practical guidance on how to adapt your data for pairwise analysis using the widyr package.

Understanding the Problem

The original problem statement presents a dataset with titles and authors listed as colon-separated values in the authors column. The goal is to create a pairwise table using these authors and titles, assuming that each row represents an author’s co-occurrence of a specific title.

# Create a dummy dataframe with authors and titles
authors_titles <- data.frame(title = c("paper_a", "paper_b", "paper_c", "paper_d"),
                              authors = c("Smith, David; Wright, James; Hughs, Jessica; Barro, Albert",
                                         "Smith, David; Wright, Jessica; Wright, James",
                                         "Smith, Jenny; Hughs, Jessica",
                                         "Wright, James; Hughs, Jessica; Barro, Albert"))

# Print the original dataframe
print(authors_titles)

Solution Overview

To create a pairwise table in widyr, we need to:

Separate the individual authors from their titles using string manipulation functions.
Transform the data into a long format with each author as a separate row, ensuring that each row contains only one title and one author.
Use pairwise_count to create the pairwise table.

Step-by-Step Solution

Step 1: Separate Authors from Titles

We will use the stringr::str_count function to count the number of semicolons (**) in each author’s string, and then create a new column with this information. This will help us separate individual authors.

# Load necessary libraries
library(widyr)
library(dplyr)
library(tidyr)

# Create a dataframe with authors and titles
authors_titles <- data.frame(title = c("paper_a", "paper_b", "paper_c", "paper_d"),
                              authors = c("Smith, David; Wright, James; Hughs, Jessica; Barro, Albert",
                                         "Smith, David; Wright, Jessica; Wright, James",
                                         "Smith, Jenny; Hughs, Jessica",
                                         "Wright, James; Hughs, Jessica; Barro, Albert"))

# Count the number of semicolons in each author's string
authors_titles %>% 
  mutate(sep = stringr::str_count(authors, ";")) %>% 
  select(-sep)

Step 2: Separate Authors into Individual Columns

Using separate from the tidyr package, we’ll separate the authors into individual columns.

# Separate the authors into individual columns
authors_titles %>% 
  separate(authors, sep = ";", into = sprintf("author.%d", 1:max(.$sep)), fill = "right") %>% 
  select(-sep)

Step 3: Trim Whitespace from Author Names

To ensure clean author names, we’ll remove any whitespace using the trimws function.

# Remove leading and trailing whitespace from each author's name
authors_titles %>% 
  mutate(author = trimws(author))

Step 4: Create a Long Format with Pairwise Analysis

Finally, we can use pairwise_count to create the pairwise table.

# Perform pairwise analysis on authors and titles
authors_titles %>% 
  select(-title) %>% 
  pivot_longer(-title, values_to = "author") %>% 
  group_by(title, author) %>% 
  count() %>% 
  rename(n = count)

The complete code for this solution is as follows:

# Load necessary libraries
library(widyr)
library(dplyr)
library(tidyr)

# Create a dataframe with authors and titles
authors_titles <- data.frame(title = c("paper_a", "paper_b", "paper_c", "paper_d"),
                              authors = c("Smith, David; Wright, James; Hughs, Jessica; Barro, Albert",
                                         "Smith, David; Wright, Jessica; Wright, James",
                                         "Smith, Jenny; Hughs, Jessica",
                                         "Wright, James; Hugs, Jessica; Barro, Albert"))

# Count the number of semicolons in each author's string
authors_titles %>% 
  mutate(sep = stringr::str_count(authors, ";")) %>% 
  select(-sep)

# Separate the authors into individual columns
authors_titles %>% 
  separate(authors, sep = ";", into = sprintf("author.%d", 1:max(.$sep)), fill = "right") %>% 
  select(-sep)

# Remove leading and trailing whitespace from each author's name
authors_titles %>% 
  mutate(author = trimws(author))

# Perform pairwise analysis on authors and titles
authors_titles %>% 
  select(-title) %>% 
  pivot_longer(-title, values_to = "author") %>% 
  group_by(title, author) %>% 
  count() %>% 
  rename(n = count)

Last modified on 2023-12-18