Topic 54 of

Google BigQuery Tutorial for Data Analysts

BigQuery is Google's serverless data warehouse that can scan petabytes of data in seconds. No servers to manage, no indexes to tune โ€” just write SQL and pay only for what you query.

๐Ÿ“šIntermediate
โฑ๏ธ12 min
โœ…7 quizzes
โ˜๏ธ

What is Google BigQuery?

Google BigQuery is a serverless, highly scalable data warehouse built for analytics. Unlike traditional databases (MySQL, PostgreSQL), you don't manage servers, configure clusters, or tune indexes. You just load data and query it.

Key Features

| Feature | What it Means | Benefit | |---------|---------------|---------| | Serverless | No infrastructure to manage | Start querying in seconds, no DevOps | | Columnar storage | Data stored by column, not row | Aggregations (SUM, AVG) are 10-100x faster | | Massively parallel | Queries run on thousands of nodes | Scan terabytes in seconds | | Pay-per-query | $5 per TB scanned (on-demand) | Only pay for what you query, not storage | | Separation of storage/compute | Storage and queries billed separately | Store petabytes cheaply, query on-demand |

How it Differs from Traditional Databases

MySQL/PostgreSQL (OLTP):

  • Row-oriented (fast for single-row reads/writes)
  • You manage servers, memory, indexes
  • Best for: Transactional workloads (INSERT, UPDATE, DELETE)

BigQuery (OLAP):

  • Column-oriented (fast for aggregations)
  • Fully managed (no servers)
  • Best for: Analytical queries (SELECT with GROUP BY across millions of rows)
Think of it this way...

MySQL is like renting a car โ€” you manage fuel, maintenance, and parking. BigQuery is like Uber โ€” you just say where you want to go, and Google handles everything. You pay per ride (per query), not monthly rent.

๐Ÿ“Š

BigQuery Basics โ€” Datasets, Tables, and Queries

Structure

BigQuery organizes data hierarchically:

Project (billing account) โ””โ”€โ”€ Dataset (like a database schema) โ””โ”€โ”€ Table (actual data)

Example โ€” Swiggy's BigQuery setup:

swiggy-prod (project) โ”œโ”€โ”€ raw_data (dataset) โ”‚ โ”œโ”€โ”€ orders โ”‚ โ”œโ”€โ”€ customers โ”‚ โ””โ”€โ”€ restaurants โ””โ”€โ”€ analytics (dataset) โ”œโ”€โ”€ fact_orders โ””โ”€โ”€ dim_customers

Creating a Dataset

query.sqlSQL
-- In BigQuery Console
CREATE SCHEMA `swiggy-prod.analytics`
OPTIONS (
  location = 'asia-south1',  -- Mumbai region
  description = 'Cleaned, transformed tables for dashboards'
);

Creating a Table

query.sqlSQL
CREATE TABLE `swiggy-prod.analytics.fact_orders` (
  order_id STRING NOT NULL,
  customer_id STRING NOT NULL,
  restaurant_id STRING NOT NULL,
  order_date DATE NOT NULL,
  delivery_time_minutes INT64,
  amount FLOAT64,
  city STRING
)
PARTITION BY order_date
CLUSTER BY city;

Querying a Table

query.sqlSQL
-- Total orders and revenue by city in March 2026
SELECT
  city,
  COUNT(*) AS order_count,
  SUM(amount) AS total_revenue,
  AVG(delivery_time_minutes) AS avg_delivery_time
FROM `swiggy-prod.analytics.fact_orders`
WHERE order_date BETWEEN '2026-03-01' AND '2026-03-31'
GROUP BY city
ORDER BY total_revenue DESC;

Result:

| city | order_count | total_revenue | avg_delivery_time | |------|-------------|---------------|-------------------| | Mumbai | 1,245,000 | 6,22,50,000 | 27.3 | | Bangalore | 1,098,000 | 5,49,00,000 | 24.8 | | Delhi | 987,000 | 4,93,50,000 | 29.1 |

Info

BigQuery Syntax: Table names are fully qualified: `project.dataset.table`. Backticks are required if names contain hyphens (e.g., swiggy-prod).

โš ๏ธ CheckpointQuiz error: Missing or invalid options array

๐Ÿ’ฐ

BigQuery Pricing โ€” Pay Only for What You Query

BigQuery has two pricing models:

1. On-Demand Pricing (Most Common)

You pay per query based on data scanned:

  • $5 per TB scanned (first 1 TB per month is free)
  • Storage: $0.02 per GB per month (active), $0.01 per GB (long-term, 90+ days)

Example costs:

| Query | Data Scanned | Cost | |-------|--------------|------| | SELECT COUNT(*) FROM orders (500 GB table) | 500 GB | $2.50 | | SELECT city, SUM(amount) FROM orders GROUP BY city (500 GB table) | 500 GB | $2.50 | | SELECT * FROM orders WHERE order_date = '2026-03-22' (partitioned, 2 GB) | 2 GB | $0.01 | | SELECT order_id, amount FROM orders (only 2 columns, 50 GB) | 50 GB | $0.25 |

Key insight: Columnar storage means you only pay for columns you query. Selecting 2 columns from a 50-column table is 25x cheaper than SELECT *.

2. Flat-Rate Pricing (For Heavy Users)

Pay a fixed monthly fee for guaranteed query capacity:

  • $2,000/month for 100 slots (processing units)
  • Good for companies running 24/7 queries (e.g., real-time dashboards)

Example: Zepto (10-minute delivery) runs hundreds of queries per hour. They use flat-rate pricing to cap costs at $2,000/month instead of paying per query.

Cost Optimization Tips

  1. Use partitioning (scan only relevant partitions)
  2. Use clustering (prune data within partitions)
  3. Select only needed columns (avoid SELECT *)
  4. Preview data (use LIMIT or table preview, which is free)
  5. Use approximate aggregations (APPROX_COUNT_DISTINCT vs COUNT(DISTINCT))
Info

Free Tier: The first 1 TB scanned per month is free. For small projects, you might never pay. Storage is also free up to 10 GB.

๐Ÿš€

Partitioning and Clustering โ€” Optimizing Query Performance

BigQuery's two key optimization features:

Partitioning โ€” Split Tables by Date/Range

Partitioning divides a table into segments based on a column (usually a date). Queries that filter on the partition column scan only relevant partitions, not the entire table.

Example โ€” Partitioning by order_date:

query.sqlSQL
CREATE TABLE `swiggy-prod.analytics.fact_orders`
PARTITION BY order_date
AS
SELECT * FROM `swiggy-prod.raw.orders`;

Without partitioning:

query.sqlSQL
-- Scans entire 2 TB table
SELECT COUNT(*) FROM fact_orders
WHERE order_date = '2026-03-22';
-- Cost: $10

With partitioning:

query.sqlSQL
-- Scans only the 2026-03-22 partition (2 GB)
SELECT COUNT(*) FROM fact_orders
WHERE order_date = '2026-03-22';
-- Cost: $0.01

Partition types:

  • Daily: One partition per day (most common)
  • Monthly: One partition per month
  • Integer range: Partition by customer_id ranges

Clustering โ€” Organize Data Within Partitions

Clustering sorts data within each partition by one or more columns. Queries filtering on clustered columns skip irrelevant blocks.

Example โ€” Clustering by city:

query.sqlSQL
CREATE TABLE `swiggy-prod.analytics.fact_orders`
PARTITION BY order_date
CLUSTER BY city
AS
SELECT * FROM `swiggy-prod.raw.orders`;

Query:

query.sqlSQL
-- Partition prunes to one day (2 GB)
-- Clustering prunes to Mumbai blocks (200 MB)
SELECT COUNT(*) FROM fact_orders
WHERE order_date = '2026-03-22'
  AND city = 'Mumbai';
-- Cost: $0.001 (only 200 MB scanned)

When to use clustering:

  • High-cardinality columns (city, product_id, customer_id)
  • Frequently filtered columns
  • Up to 4 clustering columns
Think of it this way...

Flipkart's orders table (5 billion rows, 2 TB) is:

  • Partitioned by order_date (daily)
  • Clustered by category, city

A query for "Electronics orders in Mumbai on 2026-03-22" scans only:

  • One partition (one day)
  • Electronics + Mumbai blocks within that partition

Without optimization: 2 TB ($10) With optimization: 50 MB ($0.0003)

๐ŸŒ

BigQuery Public Datasets โ€” Free Data to Practice

BigQuery provides free public datasets (Google doesn't charge for storage, you only pay for queries):

Popular Public Datasets

  1. Google Analytics Sample

    • bigquery-public-data.ga4_obfuscated_sample_ecommerce
    • Real e-commerce website traffic and conversions
  2. COVID-19 Open Data

    • bigquery-public-data.covid19_open_data.covid19_open_data
    • Global COVID-19 cases, deaths, vaccinations
  3. GitHub Archive

    • githubarchive.day.20260322
    • Every public GitHub event (commits, issues, PRs)
  4. New York Taxi Trips

    • bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2023
    • 1.4 billion taxi rides with pickup/dropoff times, fares, distances
  5. Stack Overflow

    • bigquery-public-data.stackoverflow.posts_questions
    • 50+ million Stack Overflow questions and answers

Example โ€” Analyzing GitHub Activity

query.sqlSQL
-- Top 10 most active GitHub repositories on March 22, 2026
SELECT
  repo.name AS repo_name,
  COUNT(*) AS event_count
FROM `githubarchive.day.20260322`
GROUP BY repo_name
ORDER BY event_count DESC
LIMIT 10;

Result:

| repo_name | event_count | |-----------|-------------| | microsoft/vscode | 12,450 | | pytorch/pytorch | 8,932 | | facebook/react | 7,821 |

Example โ€” COVID-19 Analysis (India)

query.sqlSQL
-- Daily new COVID-19 cases in India (last 30 days)
SELECT
  date,
  new_confirmed AS new_cases,
  new_deceased AS new_deaths
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE country_name = 'India'
  AND date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
ORDER BY date;
Info

Practice for Free: Query public datasets to learn BigQuery without using your own data. The first 1 TB/month is free, and most public dataset queries are under 10 GB.

โš™๏ธ

Advanced BigQuery Features

1. Nested and Repeated Fields (Arrays/Structs)

BigQuery supports complex data types (unlike traditional SQL):

query.sqlSQL
-- Table with nested fields
CREATE TABLE orders (
  order_id STRING,
  customer STRUCT<
    name STRING,
    email STRING,
    address STRUCT<city STRING, state STRING>
  >,
  items ARRAY<STRUCT<
    product_id STRING,
    quantity INT64,
    price FLOAT64
  >>
);

-- Query nested fields
SELECT
  order_id,
  customer.name AS customer_name,
  customer.address.city AS city,
  -- Unnest array to get individual items
  item.product_id,
  item.quantity
FROM orders
CROSS JOIN UNNEST(items) AS item
WHERE customer.address.city = 'Mumbai';

Use case: Storing JSON events (e.g., Google Analytics hits) without flattening.

2. Scheduled Queries

Automate queries to run daily/hourly:

query.sqlSQL
-- This query runs daily at 2 AM
CREATE OR REPLACE TABLE analytics.daily_summary AS
SELECT
  order_date,
  COUNT(*) AS order_count,
  SUM(amount) AS revenue
FROM raw.orders
WHERE order_date = CURRENT_DATE() - 1
GROUP BY order_date;

Set schedule: In BigQuery Console โ†’ Schedule Query โ†’ Daily at 02:00

3. Machine Learning with BigQuery ML

Train ML models using SQL:

query.sqlSQL
-- Train a model to predict customer churn
CREATE OR REPLACE MODEL analytics.churn_model
OPTIONS(model_type='logistic_reg', input_label_cols=['churned']) AS
SELECT
  days_since_last_order,
  total_orders,
  avg_order_value,
  churned
FROM analytics.customer_features;

-- Predict churn for current customers
SELECT
  customer_id,
  predicted_churned,
  predicted_churned_probs[OFFSET(1)].prob AS churn_probability
FROM ML.PREDICT(MODEL analytics.churn_model,
  (SELECT * FROM analytics.current_customers)
)
ORDER BY churn_probability DESC;

Use cases: Churn prediction, recommendation systems, demand forecasting

4. Streaming Inserts

Insert rows in real-time (not batches):

code.pyPython
from google.cloud import bigquery

client = bigquery.Client()
table_id = "swiggy-prod.analytics.realtime_orders"

rows_to_insert = [
    {"order_id": "12345", "amount": 450, "timestamp": "2026-03-22T14:32:10"},
    {"order_id": "12346", "amount": 890, "timestamp": "2026-03-22T14:32:15"},
]

errors = client.insert_rows_json(table_id, rows_to_insert)
if errors:
    print(f"Errors: {errors}")

Use case: Real-time dashboards (e.g., live order tracking)

Info

Streaming cost: $0.01 per 200 MB inserted. For high-volume streaming, use BigQuery Storage Write API (cheaper).

โš ๏ธ FinalQuiz error: Missing or invalid questions array

โš ๏ธ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}