BigQuery: Find Duplicates Using WHERE, QUALIFY and HAVING

BigQuery: Finding Duplicates Using WHERE, QUALIFY and HAVING https://www.bensound.com/ (energy) https://www.bensound.com/

Cleaning data is a core analyst task — and duplicates are a major source of bad metrics. They inflate aggregates, distort reporting, and break downstream logic. BigQuery’sQUALIFY clause provides a clean, direct way to filter window functions without subqueries. This guide shows how to use QUALIFY to spot and remove duplicates and how it differs from WHERE and HAVING

Where QUALIFY Fits in the SQL Workflow • Before the examples, here’s how the clauses line up: • WHERE → filters rows before grouping or window functions • HAVING → filters rows after aggregations • QUALIFY → filters rows after window functions • Easy rule of thumb: • Use WHERE for raw row‑level filters • Use HAVING for aggregated filters • Use QUALIFY for window‑function filter

What the QUALIFY Clause Does QUALIFY lets you filter on the result of a window function after it has been calculated. • Without QUALIFY: you need a subquery SELECT * FROM • ( • SELECT *, • ROW_NUMBER() OVER (PARTITION BY email ORDER BY updated_at DESC) AS rn • FROM sample_data • ) WHERE rn = 1; With QUALIFY: you don’t need a subquery SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY updated_at DESC) AS rn FROM sample_data QUALIFY rn = 1; It’s cleaner, easier to read, and avoids unnecessary subqueries.

Why QUALIFY Is So Useful • QUALIFY is becoming a standard SQL pattern because it: • avoids unnecessary subqueries • keeps window logic and filtering together • improves readability • reduces indentation and nesting • makes deduplication patterns easy to teach and reuse • For analytics pipelines, dbt models, and BI‑ready tables, it’s one of the cleanest tools you can add to your SQL workflow

Sample Data With Duplicates Here’s a small dataset you can paste straight into BigQuery: WITH sample_data AS ( SELECT * FROM UNNEST([ STRUCT(1 AS id, "alice@example.com" AS email, TIMESTAMP("2024-01-01 10:00:00") AS updated_at), STRUCT(2 AS id, "bob@example.com" AS email, TIMESTAMP("2024-01-02 09:00:00") AS updated_at), STRUCT(3 AS id, "alice@example.com" AS email, TIMESTAMP("2024-01-03 12:30:00") AS updated_at), STRUCT(4 AS id, "carol@example.com" AS email, TIMESTAMP("2024-01-04 08:15:00") AS updated_at), STRUCT(5 AS id, "bob@example.com" AS email, TIMESTAMP("2024-01-05 14:45:00") AS updated_at), STRUCT(6 AS id, "dave@example.com" AS email, TIMESTAMP("2024-01-06 11:20:00") AS updated_at), STRUCT(7 AS id, "bob@example.com" AS email, TIMESTAMP("2024-01-07 16:10:00") AS updated_at) ]) ) SELECT * FROM sample_data;

Sample Data With Duplicates • Duplicates: • • alice@example.com : appears twice • • bob@example.com : appears three times • Perfect for demonstrating deduplication.

Finding Duplicate Rows with QUALIFY • The core pattern: • SELECT *, • ROW_NUMBER() OVER ( PARTITION BY email ORDER BY updated_at DESC • ) AS rn FROM sample_data • QUALIFY rn > 1; What each part does: • PARTITION BY email — defines what counts as a duplicate • ORDER BY updated_at DESC — ranks the newest record first • ROW_NUMBER() — assigns a unique number within each group • QUALIFY rn > 1 — returns only the duplicates

Finding Duplicate Rows with QUALIFY This gives you all rows except the most recent one for each email.This screenshot shows the duplicate rows returned by the query. Each row has an rn value greater than 1, which means it isn’t the most recent record for that email.

Keeping Only the Latest Record • To keep the newest version of each row: • SELECT * • FROM sample_data • QUALIFY ROW_NUMBER() OVER (PARTITION BY email ORDER BY updated_at DESC • ) = 1; • Ideal for: • slowly changing dimensions • event logs • incremental ingestion • cleaning dimension tables before loading into BI tools

Keeping Only the Latest Record This screenshot shows the latest record for each email, with only the most recent value kept and all older duplicates removed.

WHERE Still Matters • If you need to filter the raw dataset before running window functions, use WHERE: • SELECT * from • ( • SELECT sample_data.* • , count(1) over (partition by email) as rn • FROM sample_data • ) a • where rn > 1 • Use case: apply row‑level filters before window logic runs.

WHERE Still Matters This output shows the rows where each email appears more than once, with the rn value indicating how many times that email occurs across the dataset.

Where HAVING Still Matters If you only need to know which keys are duplicated—not the full rows—HAVING is still the right tool: SELECT email, COUNT(*) AS cnt FROM sample_data GROUP BY email HAVING cnt > 1; Use case: identify duplicated keys without returning every record.

Where HAVING Still Matters Use HAVING when you’re filtering on aggregated values. HAVING is useful when you want to return one row per group and filter based on an aggregated value, like showing only the emails that appear more than once.

WHERE vs HAVING vs QUALIFY

Reusable Templates These templates turn the examples above into simple, reusable patterns you can plug into your own queries

Conclusion QUALIFY is one of those features that instantly improves your SQL. It makes deduplication clearer, reduces boilerplate, and keeps your logic in one readable block. If you’re working in BigQuery and still using subqueries to filter window functions, switching to QUALIFY will make your code cleaner, simpler, and easier to maintain.

Credit: simon.harrison@selectdistinct.co.uk Want to unlock more resources? Visit our SQL Glossary, BigQuery Glossary or Business Analytics Blog

BigQuery: Find Duplicates Using WHERE, QUALIFY and HAVING