Challenge
Multiple teams needed reusable ingestion with file detection, SCD2 support, and consistent orchestration across varying source formats.
Solution
Built PySpark + SparkSQL ingestion modules, metadata-driven Airflow DAG patterns, and Delta Lake optimization strategies (partitioning and Z-Order).
Outcomes
- Reduced onboarding effort through metadata-driven orchestration
- Improved performance on high-volume datasets
- Introduced stronger engineering quality via SOLID patterns + pytest