How to Remove Rows with Missing Values from a Data Frame in R

Subset in R not removing rows in data frame

Understanding the Problem

The problem at hand is a common confusion when working with data frames in R. A user has pulled data from a web source, structured it into a data frame, and attempted to remove rows based on certain conditions. However, instead of removing all rows that do not meet the condition, only a few non-qualifiers are removed, leaving many observations with less than the desired number of games played.

Background

To understand this issue, we need to dive into the world of data frames, subset functions, and data types in R. A data frame is a two-dimensional table that stores data in rows and columns. Each column represents a variable, while each row represents an observation or record.

In R, data frames are created using the data.frame() function, which combines multiple vectors into a single data structure. The subset() function is used to filter rows based on specific conditions.

Data Types in R

Before we dive deeper, let’s discuss data types in R:

  • Numeric: whole numbers and decimal values (e.g., 1, 2.5)
  • Character: strings of characters (e.g., “hello”, ‘goodbye’)
  • Logical: TRUE or FALSE values
  • Date: dates in the format YYYY-MM-DD

When working with data frames, it’s essential to understand that each column has a specific data type. If two columns have different data types, operations between them can fail.

The Issue at Hand

In the provided code snippet, we’re trying to subset players who play more than 20 games. However, we notice that some observations with less than 20 games played are still included in the result.

advstats[,c('PER', 'BPM', 'G')] <- sapply(advstats[,c('PER','BPM', 'G')], as.numeric)
advstats <- subset(advstats, G > 20)

Here’s what happens:

  • We first convert columns PER, BPM, and G to numeric data type using as.numeric().
  • Then, we use the subset() function to filter rows where G is greater than 20.

However, there’s an important detail: when we converted G to numeric data type using sapply(), only values that can be converted to numbers are preserved. In this case, the G column contains some character values (e.g., “5”, “6”), which cannot be directly compared with numerical values.

The Fix

The issue arises because we didn’t specify the correct data type for the G column during the conversion process. To fix this, we need to explicitly convert all columns to numeric data type using a consistent approach.

advstats[,c('PER', 'BPM', 'G')] <- lapply(advstats[,c('PER','BPM', 'G')], function(x) ifelse(is.numeric(x), as.numeric(x), NA))

In this revised code, we use lapply() to apply the conversion function to each column. If a value is numeric, it’s converted to numeric data type using as.numeric(). Otherwise, the value is replaced with NA.

We then proceed with the subset operation as before.

advstats <- subset(advstats, G > 20)

By applying this consistent conversion approach, we ensure that all columns have the same data type, and operations between them are valid.

Example Walkthrough

Let’s walk through an example to illustrate this concept:

Suppose we have a data frame df with three columns: Name, Age, and Height.

# Create a sample data frame
library(dplyr)

df <- tibble(
  Name = c("John", "Alice", "Bob", "Eve"),
  Age = c(25, 30, NA, 35),
  Height = c(180, 165, 175, NA)
)

In this example, the Age and Height columns contain both numeric and missing values.

If we try to subset players who are older than 30 years using the subset() function without converting data types, we’ll encounter an error:

# Try to subset without converting data types
df_subset <- subset(df, Age > 30)

Error in df_subset: factor level ‘NA’ is not an acceptable value for the factor `Age`

This is because the subset() function expects all values to be of a consistent data type.

To fix this issue, we need to convert all columns to numeric data type using a consistent approach:

# Convert data types consistently
df$Age <- as.numeric(df$Age, na.rm = TRUE)
df$Height <- as.numeric(df$Height, na.rm = TRUE)

# Now subset players who are older than 30 years
df_subset <- df[ Age > 30, ]

In this revised code, we use as.numeric() to convert both Age and Height columns to numeric data type while removing missing values using the na.rm argument.

By applying these steps consistently throughout your data analysis workflow, you’ll be able to avoid common pitfalls like this one and ensure accurate results.


Last modified on 2023-07-06