This project demonstrates Data Warehouse design and OLAP analysis using PySpark in Databricks.
It follows a Star Schema approach with one fact table (sales_fact) and four dimension tables:
date_dimproduct_dimcustomer_dimstore_dim
The workflow includes:
- Synthetic Data Generation with
Faker - Data Modeling in a Star Schema format
- Delta Table storage to simulate Data Warehouse persistence
- OLAP-style queries for business insights
- ER Diagram visualization of the schema
- Databricks (PySpark runtime)
- Delta Lake for table storage
- Faker for synthetic data generation
- Matplotlib & NetworkX for ER diagram visualization
- Data Modeling with Star Schema
# Total Sales by Category
spark.table("sales_fact") \
.join(spark.table("product_dim"), "ProductID") \
.groupBy("Category") \
.agg(sum("Sales_Amount").alias("Total_Sales")) \
.orderBy(desc("Total_Sales")) \
.show()