Big Data Analytics (CS 604) - Semester 6 BTech IT at AUV

Your Progress

0 / 28 topics

0% complete

Overview

🎯

Why it matters

Facebook processes petabytes of data. Netflix recommendations analyze billions of records. Big Data = big money. Companies pay premium for engineers who can handle massive datasets.

💼

Placement relevance

Data Engineer roles at FAANG. Analytics positions. Hadoop/Spark skills valued. Growing field with ₹20-45 LPA for big data specialists. Cloud companies need big data expertise.

🔗

Prerequisites for

Data Engineering · Data Science · Cloud Data Platforms · Stream Processing · Data Warehousing

📚

Recommended books

Hadoop: The Definitive Guide by Tom White · Learning Spark by Holden Karau · Big Data: Principles and Best Practices by Nathan Marz · MongoDB: The Definitive Guide by Shannon Bradshaw

Curriculum — 4 Units

U1

Unit 1 · 7 Topics · 0% complete

Big Data Basics

⚡ Key Formulae

MapReduce:Map(key, value) → Shuffle/Sort → Reduce(key, list<values>)

HDFS:NameNode (metadata) + DataNodes (blocks, replication factor 3)

3Vs (Volume, Velocity, Variety)

Big Data Characteristics (Veracity, Value)

Distributed Systems Concepts

Hadoop Ecosystem Overview

HDFS Architecture

MapReduce Programming Model

YARN (Resource Management)

U2

Unit 2 · 7 Topics · 0% complete

NoSQL Databases

⚡ Key Formulae

CAP:Consistency, Availability, Partition Tolerance (choose 2 of 3)

BASE:Basically Available, Soft state, Eventual consistency

CAP Theorem

BASE vs ACID

MongoDB (Document Store)

Cassandra (Column Store)

HBase (Column-Oriented)

Redis (Key-Value Store)

Graph Databases (Neo4j)

U3

Unit 3 · 7 Topics · 0% complete

Apache Spark

⚡ Key Formulae

RDD Operations:Lazy transformations + eager actions

DAG:Directed Acyclic Graph for execution optimization

RDDs (Resilient Distributed Datasets)

Transformations (map, filter, flatMap)

Actions (collect, count, reduce)

Spark SQL & DataFrames

Spark Streaming

MLlib (Machine Learning Library)

Spark vs MapReduce

U4

Unit 4 · 7 Topics · 0% complete

Data Analytics & Tools

⚡ Key Formulae

ETL:Extract → Transform → Load (data pipeline)

Lambda Architecture:Batch Layer + Speed Layer + Serving Layer

Hive (SQL on Hadoop)

Pig (Data Flow Language)

Apache Kafka (Streaming)

Data Warehousing

ETL Processes

Data Visualization

Real-Time Analytics

Previous Year Questions

Unit 12023 · End Semester10 marks

Write MapReduce pseudocode for Word Count problem. Given input: 'hello world hello'. Show Map output, Shuffle phase, and Reduce output step-by-step.

Unit 22023 · End Semester8 marks

Explain CAP theorem with examples. For an e-commerce site, would you prioritize CA, CP, or AP? Justify. Compare MongoDB and Cassandra.

Unit 32022 · End Semester6 marks

What are RDDs in Spark? Explain transformations vs actions with examples. Why is Spark faster than MapReduce?

Exam Strategy

🗺️

MapReduce examples

Word count, average calculation, max value — practice 5 problems. Show Map output (key-value pairs), Shuffle phase, Reduce output. Tabular format helps.

🎯

CAP theorem is gold

CAP theorem + ACID vs BASE comparison appears in EVERY exam. Make a comparison table. Give examples: MongoDB (CP), Cassandra (AP).

⚡

Spark vs Hadoop

Why Spark is faster (in-memory vs disk). RDD lineage for fault tolerance. Lazy evaluation concept. Always asked in exams.

Related Subjects

Database Management Systems