Removing Duplicate Records with Old ID in SQL/HiveQL: A Step-by-Step Guide to Efficient Data Cleaning

Removing Duplicate Records with Old ID in SQL/HiveQL

Introduction

Have you ever encountered a situation where you need to remove duplicate records from a table, but the duplicates have an older id or refresh_id? This problem is more common than you think, and it can be challenging to solve. In this article, we will explore how to use SQL and HiveQL to remove duplicate records with old IDs.

Understanding Duplicate Records

Duplicate records are rows in a table that have the same values for certain columns, but different ids or refresh_ids. For example, consider a table with the following structure:

dayproduct_idrefresh_id
2021-11-19abcde1637515839
2021-11-19abcde1637515839
2021-11-19abcde1637515839
2021-11-19abcde1637515839
2021-11-19efghg1637430516

In this example, the record with product_id = abcde has four duplicate records with different refresh_ids. However, we want to keep only one of these records, say the one with the highest refresh_id.

Using SQL to Remove Duplicate Records

To remove duplicate records, you can use a combination of the ROW_NUMBER() and DENSE_RANK() functions in SQL.

Step 1: Identify the Duplicate Rows

First, we need to identify which rows are duplicates. We can do this by using the ROW_NUMBER() function to assign a unique number to each row within a partition of the result set.

SELECT day,
       product_id,
       refresh_id,
       ROW_NUMBER() OVER (PARTITION BY day ORDER BY refresh_id DESC) AS row_num
FROM table_name;

This will return something like this:

dayproduct_idrefresh_idrow_num
2021-11-19abcde16375158391
2021-11-19abcde16375158392
2021-11-19abcde16375158393
2021-11-19abcde16375158394
2021-11-19efghg16374305161

As we can see, the row_num column identifies which row is the original (i.e., the one with the highest refresh_id) for each group of duplicate rows.

Step 2: Filter Out the Duplicate Rows

Next, we use a subquery to select only the rows with row_num = 1, which correspond to the original duplicate rows.

SELECT day,
       product_id,
       refresh_id
FROM (
    SELECT day,
           product_id,
           refresh_id,
           ROW_NUMBER() OVER (PARTITION BY day ORDER BY refresh_id DESC) AS row_num
    FROM table_name
) AS subquery
WHERE row_num = 1;

This will return the original duplicate rows, with their respective refresh_ids.

Step 3: Update or Delete the Duplicate Rows

Finally, if you want to keep only one record per group of duplicates and delete the rest, you can use an UPDATE statement. However, be cautious when using DELETE statements, as they permanently remove data from your table.

UPDATE table_name
SET refresh_id = (
    SELECT refresh_id FROM (
        SELECT day,
               product_id,
               refresh_id,
               ROW_NUMBER() OVER (PARTITION BY day ORDER BY refresh_id DESC) AS row_num
        FROM table_name
    ) AS subquery
    WHERE day = '2021-11-19' AND product_id = 'abcde'
)
WHERE (day, product_id) IN (
    SELECT day,
           product_id
    FROM (
        SELECT day,
               product_id,
               refresh_id,
               ROW_NUMBER() OVER (PARTITION BY day ORDER BY refresh_id DESC) AS row_num
        FROM table_name
    ) AS subquery
    WHERE row_num > 1
);

DELETE FROM table_name
WHERE (day, product_id) IN (
    SELECT day,
           product_id
    FROM (
        SELECT day,
               product_id,
               refresh_id,
               ROW_NUMBER() OVER (PARTITION BY day ORDER BY refresh_id DESC) AS row_num
        FROM table_name
    ) AS subquery
    WHERE row_num > 1
);

HiveQL Equivalent

The same problem can be solved in HiveQL using the ROW_NUMBER() function, but with some differences in syntax. Here’s an example:

SELECT day,
       product_id,
       refresh_id
FROM (
    SELECT day,
           product_id,
           refresh_id,
           ROW_NUMBER() OVER (PARTITION BY day ORDER BY refresh_id DESC) AS row_num
    FROM table_name
) AS subquery
WHERE row_num = 1;

Conclusion

Removing duplicate records with old IDs is a common problem in data analysis. By using SQL and HiveQL, you can easily identify the original duplicates and filter out the rest. Remember to use ROW_NUMBER() or DENSE_RANK() functions to assign unique numbers to each row within a partition of the result set.

  1. Identify duplicate rows using ROW_NUMBER() function.
  2. Filter out duplicate rows using subquery with row_num = 1.
  3. Update or delete duplicate rows based on your requirements.

By following these steps, you can efficiently remove duplicate records from your data and improve its quality.


Last modified on 2024-01-13