What is Python for Data Analysis — Complete Beginner's Guide?

Learn Python for data analysis from scratch. Master Pandas, NumPy, data cleaning, and visualization. Real examples with Indian datasets — perfect for analysts.

Is Python for Data Analysis — Complete Beginner's Guide suitable for beginners?

This topic is designed for Beginner level learners. It takes approximately 13 min to complete and includes 10 interactive quizzes to test your understanding.

How long does it take to learn Python for Data Analysis — Complete Beginner's Guide?

You can complete this topic in about 13 min. The topic is part 26 of undefined in our comprehensive Data Analytics Learning Path.

Python for Data Analysis — Complete Guide for Beginners | DataPath

🐍

Why Python for Data Analysis?

Python has become the dominant language for data analysis — and for good reason. While SQL is perfect for querying databases, Python excels at everything that comes after: cleaning messy data, performing complex transformations, statistical analysis, and creating visualizations.

At companies like Swiggy, PhonePe, and Flipkart, data analysts use SQL to extract data and Python to analyze it. Here's why:

Flexibility: Python handles any data format — CSV, JSON, Excel, APIs, databases — without breaking a sweat. Need to merge data from three different sources with different structures? Python makes it straightforward.

Libraries: The Python ecosystem offers specialized tools for every task:

Pandas: DataFrames for tabular data manipulation (like Excel on steroids)
NumPy: Fast numerical computations and array operations
Matplotlib/Seaborn: Publication-quality visualizations
Scikit-learn: Machine learning for predictions and clustering

Reproducibility: Unlike Excel, Python scripts are reproducible. Write your analysis once, run it on updated data anytime. No manual copy-pasting, no "I forgot which cells I edited" moments.

Think of it this way...

SQL is like a librarian who helps you find books. Python is what you do with those books — reading, highlighting, summarizing, and connecting ideas across multiple sources to write your own analysis.

⚙️

Setting Up Your Python Environment

Before writing code, you need Python installed and a way to run your analysis. Here's the quickest path:

Install Python with Anaconda

Anaconda is a Python distribution that includes Python + all major data science libraries pre-installed. Download it from anaconda.com/download — it's free.

Once installed, you can run Python in three ways:

1. Jupyter Notebook (recommended for beginners):

$ terminalBash

jupyter notebook

Opens in your browser. Combines code, output, and notes in one document. Perfect for exploratory analysis.

2. VS Code (recommended for serious projects): Install VS Code, add the Python extension, and run .py files directly. Better for scripts you'll reuse.

3. Command line (quick tests):

$ terminalBash

python
>>> 2 + 2
4

Essential Libraries to Install

If you're not using Anaconda, install these manually:

$ terminalBash

pip install pandas numpy matplotlib seaborn jupyter

Verify your installation:

code.pyPython

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print(f"Pandas: {pd.__version__}")
print(f"NumPy: {np.__version__}")
# Should print version numbers without errors

Info

Pro Tip: Create a dedicated folder for your data projects (e.g., C:\data-projects\ or ~/data-projects/). Store your datasets, notebooks, and scripts there. Keep it organized from day one.

⚠️ CheckpointQuiz error: Missing or invalid options array

🚀

Your First Data Analysis in Python

Let's analyze real data from scratch. We'll use a simple Flipkart orders dataset to answer: Which city generates the most revenue?

Step 1: Import Libraries

code.pyPython

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Step 2: Load Data

code.pyPython

# Load CSV file
orders = pd.read_csv('flipkart_orders.csv')

# View first 5 rows
print(orders.head())

# Check shape: (rows, columns)
print(f"Dataset has {orders.shape[0]} rows and {orders.shape[1]} columns")

Sample output:

   order_id    city  amount      order_date
0      1001  Mumbai    2500  2026-01-15
1      1002  Delhi     3200  2026-01-16
2      1003  Mumbai    1800  2026-01-16
3      1004  Bangalore 4100  2026-01-17
4      1005  Chennai   2900  2026-01-18

Dataset has 50000 rows and 4 columns

Step 3: Explore the Data

code.pyPython

# Get summary statistics
print(orders.describe())

# Check for missing values
print(orders.isnull().sum())

# Check data types
print(orders.dtypes)

Step 4: Analyze Revenue by City

code.pyPython

# Group by city and sum revenue
revenue_by_city = orders.groupby('city')['amount'].sum().sort_values(ascending=False)

print(revenue_by_city)

Output:

city
Mumbai       42350000
Delhi        38200000
Bangalore    35600000
Chennai      28900000
Name: amount, dtype: int64

Step 5: Visualize Results

code.pyPython

# Create bar chart
revenue_by_city.plot(kind='bar', color='steelblue', figsize=(10, 6))
plt.title('Revenue by City', fontsize=16)
plt.xlabel('City', fontsize=12)
plt.ylabel('Revenue (₹)', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

That's it! Five steps to go from raw data to insights with visualization. This workflow — load, explore, analyze, visualize — is what data analysts do daily.

🔑

Key Python Concepts for Analysts

You don't need to master all of Python — just the parts relevant to data work. Here are the essentials:

1. Variables and Data Types

code.pyPython

# Numbers
revenue = 125000
growth_rate = 0.23

# Strings
city = "Mumbai"
status = 'delivered'

# Booleans
is_active = True
has_discount = False

# Lists (ordered collection)
cities = ['Mumbai', 'Delhi', 'Bangalore']
revenues = [42000, 38000, 35000]

# Dictionaries (key-value pairs)
order = {
    'order_id': 1001,
    'city': 'Mumbai',
    'amount': 2500
}

2. Indexing and Slicing

code.pyPython

cities = ['Mumbai', 'Delhi', 'Bangalore', 'Chennai', 'Hyderabad']

# Access by index (starts at 0)
print(cities[0])  # Mumbai
print(cities[-1])  # Hyderabad (last element)

# Slicing: [start:end] (end not included)
print(cities[1:3])  # ['Delhi', 'Bangalore']
print(cities[:2])   # First 2: ['Mumbai', 'Delhi']
print(cities[2:])   # From index 2 onward: ['Bangalore', 'Chennai', 'Hyderabad']

3. Loops and Conditionals

code.pyPython

# For loop
cities = ['Mumbai', 'Delhi', 'Bangalore']
for city in cities:
    print(f"Analyzing {city} data...")

# If-elif-else
amount = 2500
if amount > 3000:
    category = 'High'
elif amount > 1000:
    category = 'Medium'
else:
    category = 'Low'

4. Functions (Reusable Code Blocks)

code.pyPython

def calculate_gst(amount, rate=0.18):
    """Calculate GST on an amount"""
    gst = amount * rate
    total = amount + gst
    return total

# Use the function
order_amount = 2500
final_amount = calculate_gst(order_amount)
print(f"Total with GST: ₹{final_amount}")  # ₹2950.0

5. List Comprehensions (Concise Loops)

code.pyPython

# Long way
amounts = [1000, 1500, 2000, 2500]
with_gst = []
for amount in amounts:
    with_gst.append(amount * 1.18)

# Short way (list comprehension)
with_gst = [amount * 1.18 for amount in amounts]
print(with_gst)  # [1180.0, 1770.0, 2360.0, 2950.0]

# With condition
high_value = [amount for amount in amounts if amount > 1500]
print(high_value)  # [2000, 2500]

Info

For Analysts: You'll use loops less often than you think. Pandas has vectorized operations that apply functions to entire columns at once — much faster than looping row by row.

🐼

Introduction to Pandas DataFrames

Pandas is the core library for data analysis in Python. Its central data structure is the DataFrame — a table with rows and columns, like a spreadsheet or SQL result set.

Creating DataFrames

code.pyPython

import pandas as pd

# From a dictionary
data = {
    'city': ['Mumbai', 'Delhi', 'Bangalore', 'Chennai'],
    'orders': [1200, 1100, 950, 800],
    'revenue': [42000, 38000, 35000, 28000]
}
df = pd.DataFrame(data)
print(df)

Output:

        city  orders  revenue
0     Mumbai    1200    42000
1      Delhi    1100    38000
2  Bangalore     950    35000
3    Chennai     800    28000

Basic DataFrame Operations

code.pyPython

# Select a column (returns a Series)
print(df['city'])

# Select multiple columns (returns a DataFrame)
print(df[['city', 'revenue']])

# Filter rows
high_revenue = df[df['revenue'] > 35000]
print(high_revenue)

# Add a new column
df['avg_order_value'] = df['revenue'] / df['orders']
print(df)

Reading from Files

code.pyPython

# CSV files (most common)
df = pd.read_csv('orders.csv')

# Excel files
df = pd.read_excel('sales_data.xlsx', sheet_name='Q1_Sales')

# From SQL database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM orders', conn)

# JSON files (APIs often return JSON)
df = pd.read_json('api_response.json')

Essential DataFrame Methods

code.pyPython

# View data
df.head(10)        # First 10 rows
df.tail(5)         # Last 5 rows
df.sample(10)      # Random 10 rows

# Get info
df.shape           # (rows, columns)
df.info()          # Column types and memory usage
df.describe()      # Statistical summary of numeric columns

# Column operations
df.columns         # List of column names
df.dtypes          # Data type of each column
df['amount'].sum()    # Sum of amount column
df['amount'].mean()   # Average
df['amount'].median() # Median
df['amount'].max()    # Maximum value

Think of it this way...

If you're comfortable with SQL, think of a DataFrame as the result of a SELECT query that you can continue manipulating in memory. df.groupby('city') is like GROUP BY city, df[df['amount'] > 1000] is like WHERE amount > 1000.

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}