Data Engineering SQL

SQL for Data Engineers: The Complete Guide (2026)

Q: What SQL skills does a data engineer need?

Data engineers need SQL skills beyond basic SELECT queries. The core competencies are: writing CTEs for multi-step pipeline logic, using window functions for deduplication and incremental loads, writing MERGE/UPSERT statements for ETL, building incremental load patterns with watermark queries, implementing SCD Type 2 for slowly changing dimensions, and writing data quality checks. Cross-database syntax differences across Snowflake, BigQuery, PostgreSQL, and MySQL are also important to understand.

Q: What is an incremental load pattern in SQL?

An incremental load pattern processes only new or changed records since the last pipeline run, instead of reloading all data every time. The most common approach uses a watermark: you store the maximum updated_at timestamp from the previous run, then query the source table with WHERE updated_at > last_watermark. This dramatically reduces query cost and runtime for large tables.

Q: How do you deduplicate data in SQL?

The standard SQL deduplication pattern uses ROW_NUMBER() with PARTITION BY and ORDER BY to rank duplicate rows, then filters to keep only rank 1. For example: WITH ranked AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn FROM orders) SELECT * FROM ranked WHERE rn = 1. This keeps the most recent record for each customer_id and discards duplicates.

Q: What is the difference between INSERT INTO SELECT and MERGE in SQL ETL?

INSERT INTO SELECT appends rows from one table into another and is used for simple load operations where you are certain rows do not already exist in the target. MERGE (also called UPSERT) handles both inserts and updates in one statement: it matches source rows to target rows on a key, updates existing rows if they match, and inserts new rows if they do not. MERGE is safer for ETL because it prevents duplicate inserts when a pipeline reruns.

Q: What is SCD Type 2 and how is it implemented in SQL?

Slowly Changing Dimension Type 2 (SCD Type 2) preserves the full history of changes to a dimension record by inserting a new row for each change instead of overwriting the existing row. Each row has effective_from and effective_to date columns, and a flag indicating whether the row is the current version. When a dimension attribute changes, the current row is closed by setting effective_to to today's date and is_current to false, and a new row is inserted with the new values, effective_from set to today, effective_to set to NULL, and is_current set to true.

SQL for data engineers goes far beyond SELECT queries. This guide covers the patterns you use every day: ETL with MERGE and INSERT INTO SELECT, CTEs for pipeline logic, window functions for deduplication, incremental load watermarks, SCD Type 2, data quality checks, and cross-database syntax differences for Snowflake, BigQuery, PostgreSQL, and MySQL.

Apr 9, 2026 18 min read

What SQL Skills Do Data Engineers Need?

Data engineers and data analysts both write SQL, but the problems they solve are different. Analysts focus on answering questions: aggregations, filters, joins, and reporting queries. Data engineers focus on moving and transforming data reliably at scale: loading tables, handling duplicates, tracking changes over time, and making pipelines that can fail and retry safely.

That distinction changes which SQL skills matter most. The data engineering SQL toolkit includes:

ETL patterns — INSERT INTO SELECT for bulk loads, MERGE for upserts, and TRUNCATE/INSERT for full refreshes
CTEs — breaking complex pipeline transformations into readable, testable steps
Window functions — deduplication, ranking, and computing values across ordered rows without collapsing the result set
Incremental load patterns — processing only new or changed records to avoid reloading entire tables
Data quality checks — NULL assertions, referential integrity, row count validation built directly into the pipeline
Slowly changing dimensions — tracking historical state of dimension records (SCD Type 2)
Cross-database awareness — Snowflake, BigQuery, PostgreSQL, and MySQL each have syntax quirks that matter when writing portable or platform-specific SQL

This guide walks through each area with real query patterns you can adapt directly.

ETL with SQL: INSERT INTO SELECT and MERGE/UPSERT

The two core ETL write patterns are INSERT INTO SELECT (for appending rows) and MERGE (for upserts). Choosing the right one depends on whether duplicate keys can appear in the target table.

INSERT INTO SELECT

Use INSERT INTO SELECT when you are certain the source rows do not already exist in the target, or when the target table is truncated before every load.

-- Full refresh: truncate then reload
TRUNCATE TABLE warehouse.orders;

INSERT INTO warehouse.orders (order_id, customer_id, amount, status, created_at)
SELECT
    o.order_id,
    o.customer_id,
    o.amount,
    o.status,
    o.created_at
FROM staging.raw_orders o
WHERE o.created_at >= '2026-01-01';

For append-only fact tables where the source is already filtered to new rows only, INSERT INTO SELECT is simple and fast. The risk is that a pipeline failure followed by a retry can insert duplicate rows — which is why MERGE is safer for tables with natural keys.

MERGE / UPSERT

MERGE matches source rows against target rows on a key, then updates existing rows or inserts new ones in a single atomic statement. This makes pipelines idempotent: safe to rerun without creating duplicates.

-- PostgreSQL / Snowflake MERGE syntax
MERGE INTO warehouse.customers AS target
USING staging.raw_customers AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.updated_at > target.updated_at THEN
    UPDATE SET
        email      = source.email,
        name       = source.name,
        updated_at = source.updated_at
WHEN NOT MATCHED THEN
    INSERT (customer_id, email, name, created_at, updated_at)
    VALUES (source.customer_id, source.email, source.name,
            source.created_at, source.updated_at);

PostgreSQL does not support MERGE before version 15. For older versions, use INSERT ... ON CONFLICT:

-- PostgreSQL < 15 upsert with ON CONFLICT
INSERT INTO warehouse.customers (customer_id, email, name, updated_at)
SELECT customer_id, email, name, updated_at
FROM staging.raw_customers
ON CONFLICT (customer_id)
DO UPDATE SET
    email      = EXCLUDED.email,
    name       = EXCLUDED.name,
    updated_at = EXCLUDED.updated_at
WHERE EXCLUDED.updated_at > warehouse.customers.updated_at;

BigQuery uses MERGE with nearly identical syntax to SQL Server. MySQL supports INSERT ... ON DUPLICATE KEY UPDATE but not the full MERGE statement.

CTEs for Data Pipeline Logic

Common Table Expressions (CTEs) let you break a complex transformation into named, sequential steps. For data pipelines this is essential: a single monolithic SELECT with five nested subqueries is hard to test, debug, and hand off to another engineer. A chain of CTEs is readable and debuggable at each step.

For a deeper reference on CTE syntax and capabilities, see the complete CTE guide.

-- Multi-step ETL transformation using chained CTEs
WITH

-- Step 1: filter raw events to the target date range
filtered_events AS (
    SELECT *
    FROM raw.events
    WHERE event_date >= CURRENT_DATE - INTERVAL '7 days'
      AND event_type IN ('purchase', 'refund')
),

-- Step 2: join to customer dimension for enrichment
enriched AS (
    SELECT
        e.event_id,
        e.event_date,
        e.event_type,
        e.amount,
        c.customer_segment,
        c.country
    FROM filtered_events e
    LEFT JOIN warehouse.customers c
        ON e.customer_id = c.customer_id
),

-- Step 3: aggregate by segment and date
aggregated AS (
    SELECT
        event_date,
        customer_segment,
        country,
        SUM(CASE WHEN event_type = 'purchase' THEN amount ELSE 0 END) AS gross_revenue,
        SUM(CASE WHEN event_type = 'refund'   THEN amount ELSE 0 END) AS refunds,
        COUNT(DISTINCT event_id) AS event_count
    FROM enriched
    GROUP BY event_date, customer_segment, country
)

-- Final write: computed metrics ready for the mart
SELECT
    event_date,
    customer_segment,
    country,
    gross_revenue,
    refunds,
    gross_revenue - refunds AS net_revenue,
    event_count
FROM aggregated
ORDER BY event_date DESC, gross_revenue DESC;

Each CTE is a checkpoint. When this query returns unexpected results you can run each step in isolation by commenting out the final SELECT and querying the intermediate CTE directly. That is not possible with deeply nested subqueries.

Recursive CTEs for Hierarchies

Recursive CTEs are useful for traversing parent-child hierarchies in a single query — organization charts, product category trees, bill-of-materials structures.

-- Walk an org chart from a given manager downward
WITH RECURSIVE org_tree AS (
    -- Anchor: start with the root manager
    SELECT employee_id, manager_id, name, 1 AS depth
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive step: find direct reports of each row
    SELECT e.employee_id, e.manager_id, e.name, ot.depth + 1
    FROM employees e
    INNER JOIN org_tree ot ON e.manager_id = ot.employee_id
)
SELECT employee_id, name, depth
FROM org_tree
ORDER BY depth, name;

Window Functions for Data Engineering

Data engineers use window functions differently from analysts. The most common data engineering use cases are deduplication and incremental load row selection — not running totals or moving averages.

The core pattern is ROW_NUMBER() OVER (PARTITION BY key ORDER BY recency DESC). This assigns rank 1 to the most recent version of each key, letting you filter to exactly one row per entity without a self-join.

Incremental Load Patterns

Reloading an entire source table on every pipeline run is expensive and slow once tables grow past a few million rows. Incremental loads process only rows that are new or changed since the last run.

Watermark-based incremental loads

The watermark pattern stores the maximum updated_at (or created_at) timestamp from the last successful run, then uses it as a filter on the next run.

-- Step 1: read the last watermark from a pipeline metadata table
SELECT MAX(last_loaded_at)
FROM pipeline.metadata
WHERE pipeline_name = 'orders_incremental';

-- Step 2: load only rows changed since the watermark
-- (substitute :last_watermark with the value from step 1)
INSERT INTO warehouse.orders (order_id, customer_id, amount, status, updated_at)
SELECT order_id, customer_id, amount, status, updated_at
FROM staging.raw_orders
WHERE updated_at > :last_watermark;

-- Step 3: update the watermark after a successful load
UPDATE pipeline.metadata
SET last_loaded_at = NOW()
WHERE pipeline_name = 'orders_incremental';

Using MAX as a dynamic watermark inside a single query

-- Self-contained watermark without a metadata table
INSERT INTO warehouse.orders (order_id, customer_id, amount, status, updated_at)
SELECT s.order_id, s.customer_id, s.amount, s.status, s.updated_at
FROM staging.raw_orders s
WHERE s.updated_at > (
    SELECT COALESCE(MAX(updated_at), '1970-01-01')
    FROM warehouse.orders
);

The COALESCE handles the first run when the target table is empty, defaulting to the Unix epoch so all source rows are included.

Deduplication with ROW_NUMBER

Source data rarely arrives clean. APIs return duplicate events, CDC streams can emit the same row multiple times, and manual exports often include repeated records. The standard SQL deduplication pattern uses ROW_NUMBER() to rank duplicates and keep only rank 1.

-- Deduplicate orders: keep the most recent record per order_id
WITH ranked AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY order_id
            ORDER BY updated_at DESC
        ) AS rn
    FROM staging.raw_orders
)
SELECT
    order_id,
    customer_id,
    amount,
    status,
    updated_at
FROM ranked
WHERE rn = 1;

When the "most recent" definition is ambiguous (no reliable timestamp), use a tiebreaker column such as a surrogate key or ingestion timestamp:

WITH ranked AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY updated_at DESC, ingestion_id DESC
        ) AS rn
    FROM staging.raw_customers
)
SELECT * FROM ranked WHERE rn = 1;

For Snowflake and BigQuery, you can also use QUALIFY to inline the filter without a wrapper CTE:

-- Snowflake / BigQuery: QUALIFY simplifies the dedup pattern
SELECT *
FROM staging.raw_orders
QUALIFY ROW_NUMBER() OVER (
    PARTITION BY order_id
    ORDER BY updated_at DESC
) = 1;

Need to write complex ETL SQL faster? AI2SQL generates window functions, CTEs, and MERGE statements from plain English descriptions.

Data Quality Checks in SQL

Data quality gates belong inside the pipeline, not downstream. Write them as assertions that raise an error (or log a warning) when data conditions are violated. The three categories to cover for every pipeline stage are: NULL checks, referential integrity, and row count validation.

NULL checks

-- Fail if any required columns contain NULLs
SELECT
    COUNT(*) AS total_rows,
    SUM(CASE WHEN order_id   IS NULL THEN 1 ELSE 0 END) AS null_order_id,
    SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS null_customer_id,
    SUM(CASE WHEN amount      IS NULL THEN 1 ELSE 0 END) AS null_amount
FROM staging.raw_orders
WHERE order_date = CURRENT_DATE;

-- Pipeline fails if null_order_id > 0 (enforced in orchestrator)

Referential integrity

-- Find orders whose customer_id does not exist in the customer dimension
SELECT o.order_id, o.customer_id
FROM staging.raw_orders o
LEFT JOIN warehouse.customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;

-- Zero rows expected; any result triggers an alert

Row count validation

-- Compare source row count to target row count after load
WITH source_count AS (
    SELECT COUNT(*) AS cnt FROM staging.raw_orders WHERE order_date = CURRENT_DATE
),
target_count AS (
    SELECT COUNT(*) AS cnt FROM warehouse.orders WHERE order_date = CURRENT_DATE
)
SELECT
    s.cnt AS source_rows,
    t.cnt AS target_rows,
    s.cnt - t.cnt AS discrepancy
FROM source_count s, target_count t;

These checks are most effective when embedded as steps in your orchestration tool (Airflow, dbt tests, Dagster asset checks) so a failure halts downstream tasks before bad data propagates.

Partitioning and Performance

Large tables in data warehouses are almost always partitioned. Writing queries that take advantage of partition pruning is one of the highest-leverage performance skills for a data engineer.

Partition pruning

Partition pruning works when the WHERE clause filters on the partition column directly, without wrapping it in a function.

-- Good: partition pruning works, scans only the 2026-04-09 partition
SELECT order_id, amount
FROM warehouse.orders
WHERE order_date = '2026-04-09';

-- Bad: wrapping in a function disables pruning, full table scan
SELECT order_id, amount
FROM warehouse.orders
WHERE DATE_TRUNC('month', order_date) = '2026-04-01';

In BigQuery, partition columns must appear in the WHERE clause as literals or query parameters. Using a subquery or function on the partition column forces a full scan and a billing surprise.

Clustering

Snowflake and BigQuery support clustering keys that physically sort data within a partition by additional columns. Queries that filter on cluster keys skip micro-partitions (Snowflake) or column shards (BigQuery) entirely.

-- BigQuery: create a partitioned and clustered table
CREATE TABLE warehouse.events
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type
AS
SELECT * FROM staging.raw_events;

-- A query filtering on both partition and cluster keys is maximally efficient
SELECT event_id, user_id, event_type, properties
FROM warehouse.events
WHERE DATE(event_timestamp) = '2026-04-09'
  AND user_id = 'u_12345'
  AND event_type = 'purchase';

Working Across Databases

Data engineers frequently move between platforms. The SQL is mostly portable, but a handful of differences cause real friction when switching between MySQL, PostgreSQL, Snowflake, and BigQuery.

Feature	MySQL	PostgreSQL	Snowflake	BigQuery
UPSERT syntax	ON DUPLICATE KEY UPDATE	ON CONFLICT DO UPDATE (v15+: MERGE)	MERGE	MERGE
QUALIFY clause	Not supported	Not supported	Supported	Supported
Date truncation	DATE_FORMAT(col, '%Y-%m-01')	DATE_TRUNC('month', col)	DATE_TRUNC('month', col)	DATE_TRUNC(col, MONTH)
String aggregation	GROUP_CONCAT()	STRING_AGG()	LISTAGG()	STRING_AGG()
JSON access	JSON_EXTRACT(col, '$.key')	col->>'key'	col:key or PARSE_JSON()	JSON_VALUE(col, '$.key')
Recursive CTE	Supported (8.0+)	Supported	Supported	Supported

The BigQuery argument order for DATE_TRUNC is the reverse of PostgreSQL and Snowflake — a common source of syntax errors when porting queries.

SQL Patterns Every Data Engineer Should Know

Beyond the fundamentals, these are the patterns that separate engineers who can maintain a production warehouse from those who struggle with it.

SCD Type 2: Slowly Changing Dimensions

Slowly Changing Dimension Type 2 preserves the full history of a dimension record. Instead of overwriting the row when an attribute changes (Type 1), SCD Type 2 inserts a new row and marks the old one as expired.

The target table schema includes effective_from, effective_to, and is_current columns:

CREATE TABLE warehouse.dim_customers (
    customer_sk   BIGINT PRIMARY KEY,  -- surrogate key
    customer_id   VARCHAR(50),         -- natural key from source
    email         VARCHAR(255),
    customer_tier VARCHAR(20),
    country       VARCHAR(50),
    effective_from DATE NOT NULL,
    effective_to   DATE,               -- NULL means current row
    is_current     BOOLEAN NOT NULL DEFAULT TRUE
);

When a customer's tier changes, two SQL statements run: one to close the old row and one to insert the new one.

-- Step 1: expire the current row for changed customers
UPDATE warehouse.dim_customers
SET
    effective_to = CURRENT_DATE - INTERVAL '1 day',
    is_current   = FALSE
WHERE customer_id IN (
    -- source rows where a tracked attribute has changed
    SELECT s.customer_id
    FROM staging.raw_customers s
    INNER JOIN warehouse.dim_customers d
        ON s.customer_id = d.customer_id
        AND d.is_current = TRUE
    WHERE s.customer_tier != d.customer_tier
       OR s.country       != d.country
);

-- Step 2: insert new rows for changed and new customers
INSERT INTO warehouse.dim_customers
    (customer_id, email, customer_tier, country, effective_from, effective_to, is_current)
SELECT
    s.customer_id,
    s.email,
    s.customer_tier,
    s.country,
    CURRENT_DATE AS effective_from,
    NULL         AS effective_to,
    TRUE         AS is_current
FROM staging.raw_customers s
LEFT JOIN warehouse.dim_customers d
    ON s.customer_id = d.customer_id
    AND d.is_current = TRUE
WHERE d.customer_id IS NULL                   -- new customers
   OR s.customer_tier != d.customer_tier      -- changed tier
   OR s.country       != d.country;           -- changed country

To query a point-in-time view of the dimension (as of a specific date), filter on effective_from and effective_to:

-- Customers as they were on 2026-01-15
SELECT customer_id, email, customer_tier, country
FROM warehouse.dim_customers
WHERE effective_from <= '2026-01-15'
  AND (effective_to > '2026-01-15' OR effective_to IS NULL);

Pivot / Crosstab

Pivoting rows into columns comes up often when preparing data for reporting tables. PostgreSQL and BigQuery use conditional aggregation; Snowflake has a native PIVOT clause.

-- Conditional aggregation pivot (works in all databases)
SELECT
    order_date,
    SUM(CASE WHEN status = 'completed' THEN amount ELSE 0 END) AS completed,
    SUM(CASE WHEN status = 'refunded'  THEN amount ELSE 0 END) AS refunded,
    SUM(CASE WHEN status = 'pending'   THEN amount ELSE 0 END) AS pending
FROM warehouse.orders
GROUP BY order_date
ORDER BY order_date;

Gap detection in time series

Finding missing dates or broken sequences in a time series is a classic data engineering QA task. Use LAG or a date spine to detect gaps.

For a full reference on LAG, LEAD, and offset functions, see the SQL window functions guide. For JOIN patterns used in spine-based gap detection, see SQL joins explained.

-- Find gaps in daily event data using LAG
WITH daily_counts AS (
    SELECT
        event_date,
        COUNT(*) AS event_count,
        LAG(event_date) OVER (ORDER BY event_date) AS prev_date
    FROM warehouse.events
    GROUP BY event_date
)
SELECT
    prev_date  AS gap_after,
    event_date AS gap_before,
    event_date - prev_date - 1 AS missing_days
FROM daily_counts
WHERE event_date - prev_date > 1
ORDER BY event_date;

Frequently Asked Questions

What SQL skills does a data engineer need?

The core data engineering SQL skills are: writing CTEs for multi-step pipeline logic, window functions for deduplication and row selection, MERGE/UPSERT for idempotent loads, incremental load patterns with watermark queries, SCD Type 2 for dimension history, and data quality assertions. Cross-database syntax differences between Snowflake, BigQuery, PostgreSQL, and MySQL are also important to understand.

What is an incremental load pattern in SQL?

An incremental load processes only rows that are new or changed since the last pipeline run, instead of reloading the full source table. The most common approach stores a watermark (the maximum updated_at from the last run) and filters the next query with WHERE updated_at > last_watermark. This dramatically reduces cost and runtime for large tables.

How do you deduplicate data in SQL?

Use ROW_NUMBER() OVER (PARTITION BY natural_key ORDER BY updated_at DESC) to rank duplicate rows, then filter to WHERE rn = 1. This keeps the most recent version of each record and discards all others. On Snowflake and BigQuery, the QUALIFY clause lets you do this inline without wrapping CTE.

What is the difference between INSERT INTO SELECT and MERGE?

INSERT INTO SELECT appends rows and is best for truncate-reload patterns or append-only tables where duplicates cannot occur. MERGE handles both inserts and updates atomically: it matches source to target on a key, updates matched rows, and inserts unmatched ones. MERGE makes pipelines idempotent — safe to rerun without creating duplicates — which is why it is preferred for tables with natural keys.

What is SCD Type 2 and how is it implemented in SQL?

SCD Type 2 preserves full history for slowly changing dimension records. Each change produces a new row instead of overwriting the existing one. Rows carry effective_from, effective_to (NULL for the current row), and is_current columns. When a dimension attribute changes, the current row is closed by setting effective_to and is_current = FALSE, and a new row is inserted with the updated values and is_current = TRUE.