Topic 25 of

SQL Performance Tips — Write Faster Queries

Your query takes 2 minutes to run. Your teammate's identical logic finishes in 3 seconds. The difference? These 15 performance patterns.

📚Advanced
⏱️11 min
5 quizzes
📇

Indexes — The #1 Performance Lever

An index is a data structure that allows the database to find rows without scanning the entire table. Think of it like a book's index: instead of reading every page to find "window functions," you look it up in the index and jump to page 143.

How Indexes Speed Up Queries

Without an index on customer_id:

query.sqlSQL
SELECT * FROM orders WHERE customer_id = 12345;
-- Database scans ALL 10 million rows to find matches (slow)

With an index on customer_id:

query.sqlSQL
-- Same query, but now the database uses the index
-- Finds matching rows in milliseconds instead of seconds

When to Use Indexes

Add indexes on columns that appear in:

  • WHERE clauses (most common)
  • JOIN conditions
  • ORDER BY clauses
  • GROUP BY clauses

Creating Indexes

query.sqlSQL
-- Single-column index
CREATE INDEX idx_customer_id ON orders(customer_id);

-- Multi-column index (composite)
CREATE INDEX idx_customer_date ON orders(customer_id, order_date);

-- Unique index (enforces uniqueness)
CREATE UNIQUE INDEX idx_email ON customers(email);

Index Best Practices

Do:

  • Index foreign keys used in JOINs
  • Index columns frequently used in WHERE
  • Use multi-column indexes for common filter combinations

Don't:

  • Over-index — each index slows down INSERT/UPDATE/DELETE
  • Index small tables (under 1000 rows) — scans are faster
  • Index high-cardinality columns that change frequently
Info

Warning: Indexes speed up reads but slow down writes. Each INSERT, UPDATE, or DELETE must update all relevant indexes. Balance read performance with write overhead.

Index Pitfalls

Functions on indexed columns kill index usage:

query.sqlSQL
-- BAD: function on indexed column (index not used)
SELECT * FROM orders WHERE YEAR(order_date) = 2026;

-- GOOD: rewrite to preserve index usage
SELECT * FROM orders WHERE order_date BETWEEN '2026-01-01' AND '2026-12-31';
🏗️

Query Structure — Filter Early, Aggregate Late

1. Avoid SELECT * — Specify Columns

query.sqlSQL
-- BAD: retrieves all columns even if you only need 3
SELECT * FROM orders WHERE city = 'Mumbai';

-- GOOD: specify only needed columns
SELECT order_id, customer_id, amount FROM orders WHERE city = 'Mumbai';

In columnar databases like BigQuery, SELECT * scans all columns → higher cost.

2. Filter Early with WHERE

Push filters as early as possible to reduce the data processed:

query.sqlSQL
-- BAD: filter after JOIN (processes all rows first)
SELECT o.order_id, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2026-01-01';

-- BETTER: filter before JOIN (fewer rows to join)
SELECT o.order_id, c.name
FROM (
  SELECT * FROM orders WHERE order_date >= '2026-01-01'
) o
JOIN customers c ON o.customer_id = c.customer_id;

-- BEST: filter in WHERE (most databases optimize this automatically)
SELECT o.order_id, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2026-01-01';
-- Modern query planners push this filter down automatically

Modern databases (PostgreSQL, BigQuery) optimize WHERE filters automatically, but understanding the concept helps you write better CTEs and subqueries.

3. Use EXISTS Instead of IN for Large Subqueries

query.sqlSQL
-- SLOWER: IN with large subquery
SELECT * FROM customers
WHERE customer_id IN (SELECT customer_id FROM orders WHERE amount > 1000);

-- FASTER: EXISTS stops at first match
SELECT * FROM customers c
WHERE EXISTS (
  SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND o.amount > 1000
);

IN evaluates the entire subquery. EXISTS stops at the first match for each row.

4. Limit Early with CTEs

Break complex queries into CTEs to filter at each step:

query.sqlSQL
-- Filter early in each CTE
WITH high_value_orders AS (
  SELECT customer_id, SUM(amount) AS total
  FROM orders
  WHERE order_date >= '2026-01-01'  -- Filter here
  GROUP BY customer_id
  HAVING SUM(amount) > 10000  -- Filter again
),
active_customers AS (
  SELECT customer_id, name
  FROM customers
  WHERE status = 'active'  -- Filter here too
)
SELECT ac.name, hvo.total
FROM high_value_orders hvo
JOIN active_customers ac ON hvo.customer_id = ac.customer_id;

⚠️ CheckpointQuiz error: Missing or invalid options array

🔗

JOIN Optimization

1. Join Smaller Tables First

Most modern databases optimize join order automatically, but if you're using older systems or writing manual subqueries, join small tables to large tables:

query.sqlSQL
-- Join small dimension tables to large fact tables
SELECT o.order_id, c.name, p.product_name
FROM orders o  -- Large table (10M rows)
JOIN products p ON o.product_id = p.product_id  -- Small table (1K rows)
JOIN customers c ON o.customer_id = c.customer_id;  -- Medium table (500K rows)

2. Index All JOIN Columns

query.sqlSQL
-- Ensure both sides of the JOIN have indexes
CREATE INDEX idx_orders_customer ON orders(customer_id);
CREATE INDEX idx_customers_id ON customers(customer_id);

Without indexes on JOIN keys, the database does a nested loop: for each row in the left table, scan the entire right table.

3. Avoid Cartesian Products

A missing JOIN condition creates a cartesian product: every row × every row.

query.sqlSQL
-- BAD: missing ON clause (cartesian product)
SELECT * FROM orders, customers;
-- 10M orders × 500K customers = 5 trillion rows (will crash)

-- GOOD: proper JOIN
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;

4. LEFT JOIN + IS NULL for Anti-Joins

Finding rows in A not in B:

query.sqlSQL
-- Find customers with no orders
SELECT c.customer_id, c.name
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;

This is often faster than NOT IN or NOT EXISTS when the right table is large.

📊

Aggregation and Grouping

1. Use WHERE Before GROUP BY

Filter rows before grouping to reduce the data processed:

query.sqlSQL
-- GOOD: filter before grouping
SELECT city, SUM(amount) AS revenue
FROM orders
WHERE order_date >= '2026-01-01'  -- Reduces rows before aggregation
GROUP BY city;

2. Use HAVING Only for Aggregate Filters

HAVING filters after grouping. Use WHERE for non-aggregate filters:

query.sqlSQL
-- BAD: non-aggregate filter in HAVING
SELECT city, SUM(amount) AS revenue
FROM orders
GROUP BY city
HAVING city = 'Mumbai';  -- Filters after grouping (inefficient)

-- GOOD: non-aggregate filter in WHERE
SELECT city, SUM(amount) AS revenue
FROM orders
WHERE city = 'Mumbai'  -- Filters before grouping
GROUP BY city;

3. Avoid DISTINCT When Possible

DISTINCT requires sorting or hashing all rows. Use GROUP BY or eliminate duplicates earlier in the pipeline:

query.sqlSQL
-- SLOWER: DISTINCT on large result set
SELECT DISTINCT customer_id FROM orders;

-- FASTER: GROUP BY (same result, optimized internally)
SELECT customer_id FROM orders GROUP BY customer_id;

4. Partial Aggregations

For very large datasets, aggregate in stages:

query.sqlSQL
-- Aggregate daily first, then monthly
WITH daily_sales AS (
  SELECT DATE(order_date) AS day, SUM(amount) AS daily_revenue
  FROM orders
  GROUP BY day
)
SELECT DATE_TRUNC('month', day) AS month, SUM(daily_revenue) AS monthly_revenue
FROM daily_sales
GROUP BY month;
🔍

EXPLAIN Plans — Debugging Slow Queries

EXPLAIN shows how the database executes your query. Use it to find bottlenecks.

Running EXPLAIN

query.sqlSQL
-- PostgreSQL / MySQL
EXPLAIN SELECT * FROM orders WHERE customer_id = 12345;

-- PostgreSQL with detailed stats
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 12345;

-- BigQuery
-- Click "Execution details" in the BigQuery console after running a query

Reading EXPLAIN Output

Key things to look for:

  1. Seq Scan / Table Scan: Bad for large tables — means no index is used

    • Fix: Add an index on the filtered column
  2. Index Scan: Good — database is using an index

  3. Nested Loop: Can be slow if the inner loop is large

    • Fix: Index JOIN columns or rewrite as a hash join
  4. Hash Join / Merge Join: Good for large JOINs

  5. Rows: Estimated number of rows processed at each step

    • If much higher than expected, check your filters

Example EXPLAIN Output (PostgreSQL)

EXPLAIN SELECT * FROM orders WHERE customer_id = 12345; Seq Scan on orders (cost=0.00..180000.00 rows=100 width=100) Filter: (customer_id = 12345)

Translation: Sequential scan (no index) → costs 180,000 units → slow.

After creating an index:

Index Scan using idx_customer_id on orders (cost=0.42..8.44 rows=1 width=100) Index Cond: (customer_id = 12345)

Translation: Index scan → costs 8.44 units → 21,000× faster.

BigQuery Performance Tips

In BigQuery, focus on bytes processed (shows in query details):

  1. Partition tables by date — filter on partition column to scan less data

    query.sqlSQL
    -- Partitioned table
    SELECT * FROM orders
    WHERE DATE(order_date) = '2026-03-01';
    -- Only scans 1 day of data
  2. Cluster tables — physically groups similar rows for faster scans

    query.sqlSQL
    -- Cluster by city for city-level queries
    CREATE TABLE orders_clustered
    CLUSTER BY city AS
    SELECT * FROM orders;
  3. Use _PARTITIONTIME for date filtering

    query.sqlSQL
    SELECT * FROM `dataset.orders*`
    WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2026-01-01') AND TIMESTAMP('2026-03-31');

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}