Major Dataset Platforms
| Platform | Description | Best For | URL | |----------|-------------|----------|-----| | Kaggle | 50K+ datasets, competitions, notebooks | Learning, portfolio projects | kaggle.com/datasets | | UCI Repository | 600+ classic ML datasets | Benchmarking, academic | archive.ics.uci.edu/ml | | Google Dataset Search | Search engine for datasets | Discovery across sources | datasetsearch.research.google.com | | Data.gov.in | Indian government open data | India-specific analysis | data.gov.in | | World Bank | Global economic indicators | International comparisons | data.worldbank.org | | Our World in Data | Research datasets (health, climate) | Social impact analysis | ourworldindata.org | | GitHub Awesome | Curated public datasets | Topic-specific collections | github.com/awesomedata | | FiveThirtyEight | Journalism datasets with stories | Reproducible analysis | data.fivethirtyeight.com |
Business & E-commerce Datasets
Recommended for Portfolio Projects:
1. Online Retail Dataset (Kaggle)
- Size: 500K+ transactions
- Features: Customer ID, product, quantity, price, date, country
- Projects: RFM analysis, customer segmentation, cohort retention, market basket
- Why it's great: Multiple customers, time series, clean structure
2. Superstore Sales (Kaggle)
- Size: 9,994 orders
- Features: Category, sales, profit, region, shipping
- Projects: Sales dashboard, profitability analysis, regional comparison
- Why it's great: Perfect for Power BI beginners, clear business metrics
3. Olist Brazilian E-commerce (Kaggle)
- Size: 100K orders, multiple tables
- Features: Orders, customers, products, reviews, payments
- Projects: SQL joins, delivery time analysis, review sentiment
- Why it's great: Real-world complexity, multiple tables to join
4. Instacart Market Basket (Kaggle)
- Size: 3+ million grocery orders
- Features: Products, aisles, order sequences
- Projects: Association rules, recommendation systems
- Why it's great: Large scale, interesting insights (milk + bread patterns)
5. Black Friday Sales (Kaggle)
- Size: 550K purchases
- Features: Age, gender, occupation, product category, purchase amount
- Projects: Customer profiling, demographic analysis
- Why it's great: Clean, good for segmentation
India-Specific Datasets
Why India datasets matter for your portfolio:
- ✅ Relatable to Indian recruiters (₹ vs $, Mumbai vs New York)
- ✅ Shows local market understanding
- ✅ Stands out from international datasets everyone uses
- ✅ Demonstrates initiative (sought out regional data)
Top India Datasets:
1. Zomato Bangalore Restaurants (Kaggle)
- 50K+ restaurants, ratings, cuisines, cost, location
- Projects: Price analysis by area, cuisine trends, rating patterns
- Insight example: "North Indian restaurants in Koramangala charge 40% premium"
2. IPL Complete Dataset (Kaggle)
- All matches 2008-2025, ball-by-ball data
- Projects: Player performance, team analysis, win prediction
- Insight example: "Batsmen average 15% higher in Chennai vs Mumbai"
3. India Air Quality Data (Kaggle)
- PM2.5, PM10 across cities, hourly data
- Projects: Pollution trends, city comparison, seasonal patterns
- Insight example: "Delhi AQI spikes 300% during Diwali week"
4. COVID-19 India (Kaggle)
- State-wise cases, testing, vaccination
- Projects: Time-series forecasting, vaccination pace analysis
- Insight example: "Kerala detected cases 2× faster due to higher testing"
5. Swiggy/Zomato Delivery Data (Search Kaggle)
- Delivery times, restaurant partners, user ratings
- Projects: Delivery optimization, peak hour analysis
- Insight example: "Avg delivery time: 28 min weekday vs 35 min weekend"
6. Naukri/LinkedIn Job Postings (Kaggle)
- Job titles, skills required, salaries, companies
- Projects: Skill demand analysis, salary benchmarking
- Insight example: "SQL appears in 87% of data analyst JDs"
⚠️ CheckpointQuiz error: Missing or invalid options array
Domain-Specific Datasets
Finance & Economics:
- Stock Prices: Yahoo Finance API (free), Alpha Vantage
- Cryptocurrency: Coinbase API, CryptoCompare
- Credit Card Fraud: Kaggle (imbalanced classification)
- Loan Default: Kaggle (credit risk modeling)
- Bitcoin Historical: Blockchain.com, CoinMetrics
Healthcare:
- Diabetes Prediction: Kaggle, UCI
- Heart Disease: UCI (Cleveland dataset)
- Cancer Data: UCI (breast cancer Wisconsin)
- COVID-19 Global: Johns Hopkins, Our World in Data
- Hospital Readmission: CMS.gov
Sports & Entertainment:
- IPL Cricket: Kaggle (ball-by-ball 2008-2025)
- FIFA Players: Kaggle (ratings, stats, wages)
- NBA Stats: Basketball-Reference, Kaggle
- Olympics: Kaggle (120 years of data)
- IMDb Movies: Kaggle, OMDb API
- Spotify Music: Kaggle (audio features, popularity)
Social Media:
- Twitter Sentiment: Kaggle (various topics)
- YouTube Trending: Kaggle (daily trending videos)
- Reddit: Pushshift API, Kaggle archives
- Instagram Engagement: Various Kaggle sources
Real Estate:
- House Prices: Kaggle (Ames, Boston housing)
- Airbnb: Inside Airbnb (listings by city)
- Property Valuation: MagicBricks, 99acres (scraped data on Kaggle)
Transportation:
- Uber/Lyft: Kaggle (trip data, surge pricing)
- Flight Delays: US DOT, Kaggle
- NYC Taxi: NYC Open Data (millions of trips)
- Bike Sharing: Capital Bikeshare, Kaggle
How to Choose Datasets for Portfolio
Good Portfolio Dataset Criteria:
1. Size
- ❌ Too small (<1000 rows): Looks like toy project
- ✅ Sweet spot (10K-500K): Real-world scale
- ⚠️ Too large (>10M): May need cloud tools
2. Business Relevance
- ✅ E-commerce, sales, customers, products
- ✅ Clear business metrics (revenue, churn, conversion)
- ❌ Academic abstractions (iris flowers, wine quality)
3. Complexity (some is good!)
- ✅ Missing values → show data cleaning
- ✅ Outliers → show handling techniques
- ✅ Multiple tables → show SQL JOINs
- ✅ Time component → show trend analysis
- ❌ 90% nulls → too messy to be useful
4. Analysis Potential
- ✅ Multiple questions can be answered
- ✅ Segmentation opportunities
- ✅ Time-series/trends
- ✅ Visual storytelling possible
5. Uniqueness
- ❌ Titanic, Iris (everyone does these)
- ✅ India-specific datasets
- ✅ Recent data (2023-2026)
- ✅ Niche but interesting (IPL, Zomato)
Recommended Portfolio Progression:
Beginner (First 3 projects):
- Superstore Sales: Power BI dashboard, simple SQL
- Zomato Bangalore: Python EDA, interesting insights
- IPL Dataset: Engaging topic, good storytelling
Intermediate (Next 3 projects): 4. Online Retail: RFM analysis, customer segmentation 5. Olist E-commerce: Multiple tables, SQL JOINs, funnel analysis 6. Airbnb Pricing: Regression model, price prediction
Advanced (Stand out): 7. Web scraping: Collect your own Naukri job data 8. API integration: Real-time stock/crypto dashboard 9. Multi-source: Combine Swiggy + weather + traffic data
Dataset Project Best Practices
1. Document Your Work
Instead of: Just uploading final analysis Do this: Show your process
# Zomato Bangalore Restaurant Analysis
## Dataset
- Source: Kaggle
- Size: 51,717 restaurants
- Date: Updated March 2024
## Data Cleaning
1. Removed 2,347 duplicates (4.5%)
2. Missing values:
- Cost: 1,203 nulls → imputed with area median
- Rating: 5,421 nulls → excluded (likely new restaurants)
3. Outliers: Capped cost at 99th percentile (₹3,500)
## Business Questions
1. Which areas have highest restaurant density?
2. How does cuisine affect pricing?
3. What rating do you need to charge premium prices?2. Create Compelling Visualizations
Dashboard essentials:
- 1 headline KPI (avg rating, total revenue)
- Trend over time (sales by month)
- Comparison (top 10 products, regions)
- Distribution (customer age, order value)
- Filters (date range, category, city)
Tools:
- Power BI: Best for business dashboards
- Tableau Public: Great for interactive viz
- Python (matplotlib/seaborn): For custom analysis
- Excel: Quick exploration
3. Share on Multiple Platforms
GitHub:
- Code (Python/SQL scripts)
- README with insights
- requirements.txt
- Screenshots
Kaggle:
- Jupyter notebook
- Markdown explanations
- Public dataset if you scraped data
Tableau Public:
- Interactive dashboard
- Clear title and filters
- Mobile-friendly
LinkedIn:
- Post key insights with visuals
- Link to full project
- Use hashtags: #dataanalysis #python #powerbi
4. Write Business Insights, Not Just Stats
Bad: "Average rating is 3.7" Good: "Restaurants rated >4.0 charge 25% premium (₹600 vs ₹480 avg). Consider: does quality justify higher prices, or is rating manipulation occurring?"
Bad: "Linear regression R² = 0.73" Good: "Model explains 73% of price variation. Key drivers: area (Koramangala +40%), cuisine (North Indian +30%), rating (each 0.5⭐ = +15% price)."
5. Iterate Based on Feedback
Get feedback:
- Reddit: r/datascience, r/dataisbeautiful
- LinkedIn posts
- Discord communities (DataTalks.Club)
- Ask senior analysts
Common improvements:
- Add axis labels/units
- Include data source
- Explain statistical methods
- Add business recommendations
⚠️ FinalQuiz error: Missing or invalid questions array
⚠️ SummarySection error: Missing or invalid items array
Received: {"hasItems":false,"isArray":false}