Data Science SQL

SQL for Data Science: Essential Queries Every Data Scientist Needs (2026)

15 SQL patterns that cover 90% of data science work: aggregation, window functions, cohort analysis, funnel queries, feature engineering, and more. With copy-paste examples.

Mar 24, 2026 14 min read

SQL Is the Most Underrated Skill in Data Science

Every data science bootcamp teaches Python. Most of them cover pandas, scikit-learn, and TensorFlow in the first month. But the skill that separates productive data scientists from everyone else gets surprisingly little attention: SQL.

Here is the reality. Your data lives in a database. Not a CSV file, not a Jupyter notebook, not a pandas DataFrame. A database. And the fastest, most efficient way to access, filter, aggregate, and prepare that data is SQL. Before you can train a model, build a dashboard, or run an A/B test analysis, you need to pull the right data in the right shape. That is SQL's job.

According to Stack Overflow's 2025 developer survey, SQL remains the third most used programming language overall and the single most requested skill in data science job postings. Yet many data scientists treat it as an afterthought, pulling raw tables with SELECT * and doing all transformations in pandas. This works for small datasets. It falls apart at scale.

This guide covers the 15 SQL patterns that handle 90% of real data science work. Each one includes a copy-paste example you can adapt to your own tables.

Why SQL Matters for Data Scientists

SQL is not just a data retrieval tool. For data scientists, it serves three critical functions.

1. Data Access and Extraction

Production data sits in PostgreSQL, MySQL, BigQuery, Snowflake, or Redshift. You cannot run experiments without querying these systems. Writing efficient SQL means you get the data you need in seconds, not minutes. It also means you pull only what you need instead of dumping entire tables into memory.

2. Exploratory Data Analysis

Before building a model, you need to understand the data. What is the distribution of order values? How many users signed up last month? What percentage of sessions result in a purchase? All of these questions are faster to answer in SQL than in pandas, especially when the underlying table has millions of rows. A well-written GROUP BY query returns results in under a second. Loading the same table into a DataFrame, then running df.groupby(), takes orders of magnitude longer.

3. Feature Engineering

Many machine learning features are aggregations over historical data. Average purchase value per customer. Days since last login. Number of support tickets in the past 30 days. Computing these features in SQL and joining them to your training set is cleaner and faster than writing equivalent pandas code. It also makes your feature pipeline reproducible: a SQL query is easy to schedule, version, and audit.

15 Essential SQL Patterns for Data Science

1. Aggregation with GROUP BY

The foundation of any analytical query. Group rows by a dimension and compute summary statistics.

-- Revenue and order count by product category
SELECT
    p.category,
    COUNT(DISTINCT o.id) AS order_count,
    SUM(oi.quantity * oi.unit_price) AS total_revenue,
    AVG(oi.quantity * oi.unit_price) AS avg_order_value
FROM order_items oi
JOIN orders o ON oi.order_id = o.id
JOIN products p ON oi.product_id = p.id
WHERE o.order_date >= '2026-01-01'
GROUP BY p.category
ORDER BY total_revenue DESC;

2. Window Functions for Ranking

Window functions let you compute values across a set of rows without collapsing them. This is essential for ranking, running totals, and comparing each row to its group.

-- Rank customers by total spend within each region
SELECT
    customer_id,
    region,
    total_spend,
    RANK() OVER (PARTITION BY region ORDER BY total_spend DESC) AS region_rank,
    total_spend / SUM(total_spend) OVER (PARTITION BY region) AS pct_of_region
FROM customer_summary;

3. CTEs for Readable Multi-Step Queries

Common Table Expressions turn complex queries into a sequence of named steps. Use them whenever a query has more than two levels of nesting.

-- Find high-value customers who have not ordered in 90 days
WITH customer_stats AS (
    SELECT
        customer_id,
        SUM(total_amount) AS lifetime_value,
        MAX(order_date) AS last_order_date,
        COUNT(*) AS order_count
    FROM orders
    GROUP BY customer_id
),
at_risk AS (
    SELECT *
    FROM customer_stats
    WHERE lifetime_value > 500
      AND last_order_date < CURRENT_DATE - INTERVAL '90 days'
)
SELECT
    c.name,
    c.email,
    ar.lifetime_value,
    ar.last_order_date,
    ar.order_count
FROM at_risk ar
JOIN customers c ON ar.customer_id = c.id
ORDER BY ar.lifetime_value DESC;

4. Random Sampling

Pulling a random sample for exploratory analysis or model training. Different databases have different syntax, but the pattern is the same.

-- PostgreSQL: random 5% sample
SELECT *
FROM events
TABLESAMPLE BERNOULLI(5);

-- General approach: random 1000 rows
SELECT *
FROM users
ORDER BY RANDOM()
LIMIT 1000;

-- Stratified sample: 100 users per plan type
SELECT *
FROM (
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY plan_type ORDER BY RANDOM()) AS rn
    FROM users
) sub
WHERE rn <= 100;

5. Pivoting Data

Transform rows into columns. Useful for creating feature matrices or summary reports.

-- Monthly revenue pivot by category
SELECT
    DATE_TRUNC('month', o.order_date) AS month,
    SUM(CASE WHEN p.category = 'electronics' THEN oi.quantity * oi.unit_price ELSE 0 END) AS electronics,
    SUM(CASE WHEN p.category = 'clothing' THEN oi.quantity * oi.unit_price ELSE 0 END) AS clothing,
    SUM(CASE WHEN p.category = 'books' THEN oi.quantity * oi.unit_price ELSE 0 END) AS books,
    SUM(oi.quantity * oi.unit_price) AS total
FROM order_items oi
JOIN orders o ON oi.order_id = o.id
JOIN products p ON oi.product_id = p.id
GROUP BY DATE_TRUNC('month', o.order_date)
ORDER BY month;

6. Date Manipulation

Data science revolves around time. These patterns let you group, filter, and calculate time-based metrics.

-- Daily active users for the past 30 days
SELECT
    DATE_TRUNC('day', event_time) AS day,
    COUNT(DISTINCT user_id) AS dau
FROM events
WHERE event_time >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY DATE_TRUNC('day', event_time)
ORDER BY day;

-- Day of week analysis
SELECT
    EXTRACT(DOW FROM order_date) AS day_of_week,
    COUNT(*) AS orders,
    AVG(total_amount) AS avg_amount
FROM orders
GROUP BY EXTRACT(DOW FROM order_date)
ORDER BY day_of_week;

7. Cohort Analysis

Cohort analysis groups users by their signup month and tracks behavior over time. This is one of the most valuable queries for product analytics.

-- Monthly cohort retention
WITH user_cohort AS (
    SELECT
        user_id,
        DATE_TRUNC('month', created_at) AS cohort_month
    FROM users
),
user_activity AS (
    SELECT
        user_id,
        DATE_TRUNC('month', event_time) AS activity_month
    FROM events
    GROUP BY user_id, DATE_TRUNC('month', event_time)
)
SELECT
    uc.cohort_month,
    COUNT(DISTINCT uc.user_id) AS cohort_size,
    COUNT(DISTINCT CASE
        WHEN ua.activity_month = uc.cohort_month + INTERVAL '1 month'
        THEN uc.user_id END) AS retained_month_1,
    COUNT(DISTINCT CASE
        WHEN ua.activity_month = uc.cohort_month + INTERVAL '2 months'
        THEN uc.user_id END) AS retained_month_2,
    COUNT(DISTINCT CASE
        WHEN ua.activity_month = uc.cohort_month + INTERVAL '3 months'
        THEN uc.user_id END) AS retained_month_3
FROM user_cohort uc
LEFT JOIN user_activity ua ON uc.user_id = ua.user_id
GROUP BY uc.cohort_month
ORDER BY uc.cohort_month;

8. Retention Queries

Measure what percentage of users return after their first visit. Critical for subscription businesses and product growth analysis.

-- Week-over-week retention rate
WITH first_visit AS (
    SELECT
        user_id,
        MIN(DATE_TRUNC('week', event_time)) AS first_week
    FROM events
    GROUP BY user_id
),
weekly_activity AS (
    SELECT DISTINCT
        user_id,
        DATE_TRUNC('week', event_time) AS active_week
    FROM events
)
SELECT
    fv.first_week,
    COUNT(DISTINCT fv.user_id) AS new_users,
    ROUND(100.0 * COUNT(DISTINCT wa1.user_id) / COUNT(DISTINCT fv.user_id), 1) AS week_1_pct,
    ROUND(100.0 * COUNT(DISTINCT wa2.user_id) / COUNT(DISTINCT fv.user_id), 1) AS week_2_pct,
    ROUND(100.0 * COUNT(DISTINCT wa4.user_id) / COUNT(DISTINCT fv.user_id), 1) AS week_4_pct
FROM first_visit fv
LEFT JOIN weekly_activity wa1 ON fv.user_id = wa1.user_id AND wa1.active_week = fv.first_week + INTERVAL '1 week'
LEFT JOIN weekly_activity wa2 ON fv.user_id = wa2.user_id AND wa2.active_week = fv.first_week + INTERVAL '2 weeks'
LEFT JOIN weekly_activity wa4 ON fv.user_id = wa4.user_id AND wa4.active_week = fv.first_week + INTERVAL '4 weeks'
GROUP BY fv.first_week
ORDER BY fv.first_week;

9. Funnel Queries

Track conversion through a sequence of steps. Every product team needs this.

-- Signup to purchase funnel
WITH funnel AS (
    SELECT
        user_id,
        MAX(CASE WHEN event_name = 'page_view' THEN 1 ELSE 0 END) AS visited,
        MAX(CASE WHEN event_name = 'signup' THEN 1 ELSE 0 END) AS signed_up,
        MAX(CASE WHEN event_name = 'add_to_cart' THEN 1 ELSE 0 END) AS added_to_cart,
        MAX(CASE WHEN event_name = 'purchase' THEN 1 ELSE 0 END) AS purchased
    FROM events
    WHERE event_time >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY user_id
)
SELECT
    COUNT(*) AS total_visitors,
    SUM(signed_up) AS signups,
    SUM(added_to_cart) AS add_to_carts,
    SUM(purchased) AS purchases,
    ROUND(100.0 * SUM(signed_up) / COUNT(*), 1) AS signup_rate,
    ROUND(100.0 * SUM(purchased) / NULLIF(SUM(signed_up), 0), 1) AS purchase_rate
FROM funnel
WHERE visited = 1;

10. Percentile Calculations

Percentiles are more informative than averages for skewed distributions, which describes most real-world data.

-- Order value distribution
SELECT
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY total_amount) AS p25,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY total_amount) AS median,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_amount) AS p75,
    PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY total_amount) AS p90,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY total_amount) AS p99,
    AVG(total_amount) AS mean
FROM orders
WHERE order_date >= '2026-01-01';

-- Percentile by category
SELECT
    category,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY price) AS median_price,
    AVG(price) AS mean_price
FROM products
GROUP BY category;

11. Moving Averages

Smooth out daily noise to see the underlying trend. Essential for time series analysis.

-- 7-day moving average of daily revenue
WITH daily_revenue AS (
    SELECT
        DATE_TRUNC('day', order_date) AS day,
        SUM(total_amount) AS revenue
    FROM orders
    GROUP BY DATE_TRUNC('day', order_date)
)
SELECT
    day,
    revenue,
    AVG(revenue) OVER (
        ORDER BY day
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7d,
    AVG(revenue) OVER (
        ORDER BY day
        ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
    ) AS moving_avg_30d
FROM daily_revenue
ORDER BY day;

12. Year-over-Year Comparison

Compare metrics across the same period in different years. This removes seasonality from your analysis.

-- Monthly revenue: this year vs last year
WITH monthly AS (
    SELECT
        EXTRACT(YEAR FROM order_date) AS year,
        EXTRACT(MONTH FROM order_date) AS month,
        SUM(total_amount) AS revenue
    FROM orders
    WHERE order_date >= '2025-01-01'
    GROUP BY EXTRACT(YEAR FROM order_date), EXTRACT(MONTH FROM order_date)
)
SELECT
    curr.month,
    curr.revenue AS revenue_2026,
    prev.revenue AS revenue_2025,
    ROUND(100.0 * (curr.revenue - prev.revenue) / prev.revenue, 1) AS yoy_growth_pct
FROM monthly curr
JOIN monthly prev ON curr.month = prev.month
    AND curr.year = 2026 AND prev.year = 2025
ORDER BY curr.month;

13. NULL Handling

NULLs propagate silently through calculations and break aggregations. Handle them explicitly.

-- COALESCE: replace NULL with a default
SELECT
    user_id,
    COALESCE(phone, 'no phone') AS phone,
    COALESCE(last_login, created_at) AS last_active
FROM users;

-- Count NULLs per column (data quality check)
SELECT
    COUNT(*) AS total_rows,
    COUNT(*) - COUNT(email) AS null_emails,
    COUNT(*) - COUNT(phone) AS null_phones,
    COUNT(*) - COUNT(last_login) AS null_logins,
    ROUND(100.0 * (COUNT(*) - COUNT(email)) / COUNT(*), 1) AS pct_null_email
FROM users;

-- NULLIF: avoid division by zero
SELECT
    campaign_name,
    clicks,
    impressions,
    ROUND(100.0 * clicks / NULLIF(impressions, 0), 2) AS ctr
FROM ad_campaigns;

14. CASE WHEN for Feature Engineering

CASE WHEN is how you create categorical features, bin continuous variables, and build flag columns directly in SQL.

-- Create features for a churn prediction model
SELECT
    user_id,
    CASE
        WHEN days_since_last_login <= 7 THEN 'active'
        WHEN days_since_last_login <= 30 THEN 'at_risk'
        WHEN days_since_last_login <= 90 THEN 'dormant'
        ELSE 'churned'
    END AS activity_segment,
    CASE
        WHEN lifetime_value >= 1000 THEN 'high'
        WHEN lifetime_value >= 200 THEN 'medium'
        ELSE 'low'
    END AS value_tier,
    CASE WHEN has_support_ticket THEN 1 ELSE 0 END AS support_flag,
    CASE WHEN referral_source IS NOT NULL THEN 1 ELSE 0 END AS was_referred,
    CASE
        WHEN order_count = 1 THEN 'single_purchase'
        WHEN order_count BETWEEN 2 AND 5 THEN 'repeat'
        ELSE 'loyal'
    END AS purchase_behavior
FROM user_features;

15. Self-Joins for Sequential Events

Self-joins let you compare rows within the same table. Use them to find sequences, calculate time between events, or pair records.

-- Time between first and second purchase per customer
WITH ordered_purchases AS (
    SELECT
        customer_id,
        order_date,
        ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) AS purchase_num
    FROM orders
)
SELECT
    p1.customer_id,
    p1.order_date AS first_purchase,
    p2.order_date AS second_purchase,
    p2.order_date - p1.order_date AS days_between
FROM ordered_purchases p1
JOIN ordered_purchases p2
    ON p1.customer_id = p2.customer_id
    AND p1.purchase_num = 1
    AND p2.purchase_num = 2;

-- Average time to second purchase
WITH ordered_purchases AS (
    SELECT
        customer_id,
        order_date,
        ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) AS purchase_num
    FROM orders
)
SELECT
    AVG(p2.order_date - p1.order_date) AS avg_days_to_repeat
FROM ordered_purchases p1
JOIN ordered_purchases p2
    ON p1.customer_id = p2.customer_id
    AND p1.purchase_num = 1
    AND p2.purchase_num = 2;

SQL vs Pandas: When to Use Which

The SQL-versus-pandas debate misses the point. They are complementary tools. The question is not which one to learn. It is which one to reach for at each stage of your workflow.

Operation	SQL	Pandas	Winner
Filter 100M rows	`WHERE status = 'active'`	`df[df.status == 'active']`	SQL (runs on DB server)
Group and aggregate	`GROUP BY category`	`df.groupby('category')`	SQL (for large data)
Join tables	`JOIN orders ON ...`	`pd.merge(df1, df2)`	SQL (optimized joins)
Pivot / reshape	CASE WHEN + GROUP BY	`df.pivot_table()`	Pandas (simpler syntax)
Statistical tests	Limited	`scipy.stats`	Pandas/Python
Visualization	Not possible	`matplotlib, seaborn`	Pandas/Python
Feature engineering	CASE WHEN, window funcs	`df.apply(), np.where()`	Depends on scale

The practical workflow: use SQL to pull, filter, join, and aggregate your data down to a manageable size. Then load the result into pandas for statistical analysis, visualization, and modeling. This gives you the performance of the database engine for heavy lifting and the flexibility of Python for the final analysis.

How AI Accelerates SQL for Data Science

The patterns in this guide are powerful, but some of them have verbose syntax. A cohort analysis query can be 30 lines long. A funnel query with five steps takes careful CASE WHEN construction. Window function syntax is not intuitive for anyone.

AI-powered SQL tools change the workflow. Instead of writing a 30-line cohort query from scratch, you describe what you need:

"Show me monthly retention rates for users who signed up in Q1 2026"
"Calculate 7-day moving average revenue by product category"
"Build a funnel from page_view to signup to purchase for the last 30 days"

The AI generates the complete SQL, adapted to your specific database schema. You review, adjust if needed, and run it. This cuts query development time from minutes to seconds, especially for complex analytical patterns.

Try it yourself - describe your analysis in plain English and get production-ready SQL. AI2SQL supports PostgreSQL, MySQL, SQL Server, BigQuery, and more.

This does not replace understanding SQL. You still need to know what a window function does to verify the output. You still need to understand joins to catch errors. But it removes the friction of remembering exact syntax, letting you focus on the analysis instead of the language.

Frequently Asked Questions

Do data scientists need to know SQL?

Yes. SQL is consistently ranked as the most in-demand technical skill for data science roles. While Python and R handle modeling and visualization, SQL is how you access, filter, and prepare data at the source. Most production data lives in relational databases, and writing SQL directly is far more efficient than extracting everything into a DataFrame first.

Is SQL enough for data science?

SQL alone is not enough, but it is a critical foundation. You also need Python or R for statistical modeling, machine learning, and visualization. However, SQL handles data extraction, cleaning, aggregation, and feature engineering more efficiently than pandas for large datasets. Many data scientists do 60-80% of their data preparation work in SQL.

What SQL topics should data scientists focus on?

Data scientists should prioritize window functions (ROW_NUMBER, RANK, LAG, LEAD), CTEs for readable multi-step queries, aggregation with GROUP BY and HAVING, date manipulation for time series analysis, CASE WHEN for feature engineering, percentile and statistical functions, and self-joins for sequential event analysis. These patterns cover most analytical SQL work.

Should I use SQL or pandas for data analysis?

Use SQL for initial data extraction, filtering large datasets, joins across tables, and aggregations that reduce data volume. Use pandas for exploratory analysis on smaller datasets, complex transformations, statistical operations, and visualization. The best workflow combines both: SQL to pull and pre-aggregate data, pandas for final analysis and modeling.

Can AI tools help data scientists write SQL faster?

Yes. AI-powered SQL generators like AI2SQL let data scientists describe their analysis in plain English and get working queries instantly. This is especially valuable for complex patterns like cohort analysis, funnel queries, and window functions that have verbose syntax. AI tools reduce the time spent on query syntax so you can focus on the analysis itself.