What is Google BigQuery Tutorial for Data Analysts?

Learn Google BigQuery from scratch. Understand serverless architecture, how to query public datasets, optimize costs, and use BigQuery for analytics at scale.

Is Google BigQuery Tutorial for Data Analysts suitable for beginners?

This topic is designed for Intermediate level learners. It takes approximately 12 min to complete and includes 7 interactive quizzes to test your understanding.

How long does it take to learn Google BigQuery Tutorial for Data Analysts?

You can complete this topic in about 12 min. The topic is part 54 of undefined in our comprehensive Data Analytics Learning Path.

Google BigQuery Tutorial: Complete Guide for Analysts | DataPath

☁️

What is Google BigQuery?

Google BigQuery is a serverless, highly scalable data warehouse built for analytics. Unlike traditional databases (MySQL, PostgreSQL), you don't manage servers, configure clusters, or tune indexes. You just load data and query it.

Key Features

| Feature | What it Means | Benefit | |---------|---------------|---------| | Serverless | No infrastructure to manage | Start querying in seconds, no DevOps | | Columnar storage | Data stored by column, not row | Aggregations (SUM, AVG) are 10-100x faster | | Massively parallel | Queries run on thousands of nodes | Scan terabytes in seconds | | Pay-per-query | $5 per TB scanned (on-demand) | Only pay for what you query, not storage | | Separation of storage/compute | Storage and queries billed separately | Store petabytes cheaply, query on-demand |

How it Differs from Traditional Databases

MySQL/PostgreSQL (OLTP):

Row-oriented (fast for single-row reads/writes)
You manage servers, memory, indexes
Best for: Transactional workloads (INSERT, UPDATE, DELETE)

BigQuery (OLAP):

Column-oriented (fast for aggregations)
Fully managed (no servers)
Best for: Analytical queries (SELECT with GROUP BY across millions of rows)

Think of it this way...

MySQL is like renting a car — you manage fuel, maintenance, and parking. BigQuery is like Uber — you just say where you want to go, and Google handles everything. You pay per ride (per query), not monthly rent.

📊

BigQuery Basics — Datasets, Tables, and Queries

Structure

BigQuery organizes data hierarchically:

Project (billing account)
  └── Dataset (like a database schema)
        └── Table (actual data)

Example — Swiggy's BigQuery setup:

swiggy-prod (project)
  ├── raw_data (dataset)
  │     ├── orders
  │     ├── customers
  │     └── restaurants
  └── analytics (dataset)
        ├── fact_orders
        └── dim_customers

Creating a Dataset

query.sqlSQL

-- In BigQuery Console
CREATE SCHEMA `swiggy-prod.analytics`
OPTIONS (
  location = 'asia-south1',  -- Mumbai region
  description = 'Cleaned, transformed tables for dashboards'
);

Creating a Table

query.sqlSQL

CREATE TABLE `swiggy-prod.analytics.fact_orders` (
  order_id STRING NOT NULL,
  customer_id STRING NOT NULL,
  restaurant_id STRING NOT NULL,
  order_date DATE NOT NULL,
  delivery_time_minutes INT64,
  amount FLOAT64,
  city STRING
)
PARTITION BY order_date
CLUSTER BY city;

Querying a Table

query.sqlSQL

-- Total orders and revenue by city in March 2026
SELECT
  city,
  COUNT(*) AS order_count,
  SUM(amount) AS total_revenue,
  AVG(delivery_time_minutes) AS avg_delivery_time
FROM `swiggy-prod.analytics.fact_orders`
WHERE order_date BETWEEN '2026-03-01' AND '2026-03-31'
GROUP BY city
ORDER BY total_revenue DESC;

Result:

| city | order_count | total_revenue | avg_delivery_time | |------|-------------|---------------|-------------------| | Mumbai | 1,245,000 | 6,22,50,000 | 27.3 | | Bangalore | 1,098,000 | 5,49,00,000 | 24.8 | | Delhi | 987,000 | 4,93,50,000 | 29.1 |

Info

BigQuery Syntax: Table names are fully qualified: `project.dataset.table`. Backticks are required if names contain hyphens (e.g., swiggy-prod).

⚠️ CheckpointQuiz error: Missing or invalid options array

💰

BigQuery Pricing — Pay Only for What You Query

BigQuery has two pricing models:

1. On-Demand Pricing (Most Common)

You pay per query based on data scanned:

$5 per TB scanned (first 1 TB per month is free)
Storage: $0.02 per GB per month (active), $0.01 per GB (long-term, 90+ days)

Example costs:

| Query | Data Scanned | Cost | |-------|--------------|------| | SELECT COUNT(*) FROM orders (500 GB table) | 500 GB | $2.50 | | SELECT city, SUM(amount) FROM orders GROUP BY city (500 GB table) | 500 GB | $2.50 | | SELECT * FROM orders WHERE order_date = '2026-03-22' (partitioned, 2 GB) | 2 GB | $0.01 | | SELECT order_id, amount FROM orders (only 2 columns, 50 GB) | 50 GB | $0.25 |

Key insight: Columnar storage means you only pay for columns you query. Selecting 2 columns from a 50-column table is 25x cheaper than SELECT *.

2. Flat-Rate Pricing (For Heavy Users)

Pay a fixed monthly fee for guaranteed query capacity:

$2,000/month for 100 slots (processing units)
Good for companies running 24/7 queries (e.g., real-time dashboards)

Example: Zepto (10-minute delivery) runs hundreds of queries per hour. They use flat-rate pricing to cap costs at $2,000/month instead of paying per query.

Cost Optimization Tips

Use partitioning (scan only relevant partitions)
Use clustering (prune data within partitions)
Select only needed columns (avoid SELECT *)
Preview data (use LIMIT or table preview, which is free)
Use approximate aggregations (APPROX_COUNT_DISTINCT vs COUNT(DISTINCT))

Info

Free Tier: The first 1 TB scanned per month is free. For small projects, you might never pay. Storage is also free up to 10 GB.

🚀

Partitioning and Clustering — Optimizing Query Performance

BigQuery's two key optimization features:

Partitioning — Split Tables by Date/Range

Partitioning divides a table into segments based on a column (usually a date). Queries that filter on the partition column scan only relevant partitions, not the entire table.

Example — Partitioning by order_date:

query.sqlSQL

CREATE TABLE `swiggy-prod.analytics.fact_orders`
PARTITION BY order_date
AS
SELECT * FROM `swiggy-prod.raw.orders`;

Without partitioning:

query.sqlSQL

-- Scans entire 2 TB table
SELECT COUNT(*) FROM fact_orders
WHERE order_date = '2026-03-22';
-- Cost: $10

With partitioning:

query.sqlSQL

-- Scans only the 2026-03-22 partition (2 GB)
SELECT COUNT(*) FROM fact_orders
WHERE order_date = '2026-03-22';
-- Cost: $0.01

Partition types:

Daily: One partition per day (most common)
Monthly: One partition per month
Integer range: Partition by customer_id ranges

Clustering — Organize Data Within Partitions

Clustering sorts data within each partition by one or more columns. Queries filtering on clustered columns skip irrelevant blocks.

Example — Clustering by city:

query.sqlSQL

CREATE TABLE `swiggy-prod.analytics.fact_orders`
PARTITION BY order_date
CLUSTER BY city
AS
SELECT * FROM `swiggy-prod.raw.orders`;

Query:

query.sqlSQL

-- Partition prunes to one day (2 GB)
-- Clustering prunes to Mumbai blocks (200 MB)
SELECT COUNT(*) FROM fact_orders
WHERE order_date = '2026-03-22'
  AND city = 'Mumbai';
-- Cost: $0.001 (only 200 MB scanned)

When to use clustering:

High-cardinality columns (city, product_id, customer_id)
Frequently filtered columns
Up to 4 clustering columns

Think of it this way...

Flipkart's orders table (5 billion rows, 2 TB) is:

Partitioned by order_date (daily)
Clustered by category, city

A query for "Electronics orders in Mumbai on 2026-03-22" scans only:

One partition (one day)
Electronics + Mumbai blocks within that partition

Without optimization: 2 TB ($10) With optimization: 50 MB ($0.0003)

🌍

BigQuery Public Datasets — Free Data to Practice

BigQuery provides free public datasets (Google doesn't charge for storage, you only pay for queries):

Popular Public Datasets

Google Analytics Sample
- bigquery-public-data.ga4_obfuscated_sample_ecommerce
- Real e-commerce website traffic and conversions
COVID-19 Open Data
- bigquery-public-data.covid19_open_data.covid19_open_data
- Global COVID-19 cases, deaths, vaccinations
GitHub Archive
- githubarchive.day.20260322
- Every public GitHub event (commits, issues, PRs)
New York Taxi Trips
- bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2023
- 1.4 billion taxi rides with pickup/dropoff times, fares, distances
Stack Overflow
- bigquery-public-data.stackoverflow.posts_questions
- 50+ million Stack Overflow questions and answers

Example — Analyzing GitHub Activity

query.sqlSQL

-- Top 10 most active GitHub repositories on March 22, 2026
SELECT
  repo.name AS repo_name,
  COUNT(*) AS event_count
FROM `githubarchive.day.20260322`
GROUP BY repo_name
ORDER BY event_count DESC
LIMIT 10;

Result:

| repo_name | event_count | |-----------|-------------| | microsoft/vscode | 12,450 | | pytorch/pytorch | 8,932 | | facebook/react | 7,821 |

Example — COVID-19 Analysis (India)

query.sqlSQL

-- Daily new COVID-19 cases in India (last 30 days)
SELECT
  date,
  new_confirmed AS new_cases,
  new_deceased AS new_deaths
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE country_name = 'India'
  AND date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
ORDER BY date;

Info

Practice for Free: Query public datasets to learn BigQuery without using your own data. The first 1 TB/month is free, and most public dataset queries are under 10 GB.

⚙️

Advanced BigQuery Features

1. Nested and Repeated Fields (Arrays/Structs)

BigQuery supports complex data types (unlike traditional SQL):

query.sqlSQL

-- Table with nested fields
CREATE TABLE orders (
  order_id STRING,
  customer STRUCT<
    name STRING,
    email STRING,
    address STRUCT<city STRING, state STRING>
  >,
  items ARRAY<STRUCT<
    product_id STRING,
    quantity INT64,
    price FLOAT64
  >>
);

-- Query nested fields
SELECT
  order_id,
  customer.name AS customer_name,
  customer.address.city AS city,
  -- Unnest array to get individual items
  item.product_id,
  item.quantity
FROM orders
CROSS JOIN UNNEST(items) AS item
WHERE customer.address.city = 'Mumbai';

Use case: Storing JSON events (e.g., Google Analytics hits) without flattening.

2. Scheduled Queries

Automate queries to run daily/hourly:

query.sqlSQL

-- This query runs daily at 2 AM
CREATE OR REPLACE TABLE analytics.daily_summary AS
SELECT
  order_date,
  COUNT(*) AS order_count,
  SUM(amount) AS revenue
FROM raw.orders
WHERE order_date = CURRENT_DATE() - 1
GROUP BY order_date;

Set schedule: In BigQuery Console → Schedule Query → Daily at 02:00

3. Machine Learning with BigQuery ML

Train ML models using SQL:

query.sqlSQL

-- Train a model to predict customer churn
CREATE OR REPLACE MODEL analytics.churn_model
OPTIONS(model_type='logistic_reg', input_label_cols=['churned']) AS
SELECT
  days_since_last_order,
  total_orders,
  avg_order_value,
  churned
FROM analytics.customer_features;

-- Predict churn for current customers
SELECT
  customer_id,
  predicted_churned,
  predicted_churned_probs[OFFSET(1)].prob AS churn_probability
FROM ML.PREDICT(MODEL analytics.churn_model,
  (SELECT * FROM analytics.current_customers)
)
ORDER BY churn_probability DESC;

Use cases: Churn prediction, recommendation systems, demand forecasting

4. Streaming Inserts

Insert rows in real-time (not batches):

code.pyPython

from google.cloud import bigquery

client = bigquery.Client()
table_id = "swiggy-prod.analytics.realtime_orders"

rows_to_insert = [
    {"order_id": "12345", "amount": 450, "timestamp": "2026-03-22T14:32:10"},
    {"order_id": "12346", "amount": 890, "timestamp": "2026-03-22T14:32:15"},
]

errors = client.insert_rows_json(table_id, rows_to_insert)
if errors:
    print(f"Errors: {errors}")

Use case: Real-time dashboards (e.g., live order tracking)

Info

Streaming cost: $0.01 per 200 MB inserted. For high-volume streaming, use BigQuery Storage Write API (cheaper).

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}