Apache Airflow: Synchronizing Datasets across Multiple instances
Sebastien Crocquevieille
Who:
- Data Engineer currently working in Taiwan
- French & Mexican
- Speak EN, FR, ES & some ZH
What:
- Hadoop, PySpark, Kubernetes
- Open Source enthusiast
- Tech for Good
- Basketball & Video Games
Where:
- Europython 2019-2024
- Pycon TW 2023
Abstract
Data Engineer jobs regularly include scheduling or scaling workflows.
But have you ever asked yourself, can I **scale my scheduling** ?
It turns out that you can!
But doing so raises a number of issues that need to be addressed.
In this talk we will be:
- Resuming Data-aware scheduling in Apache Airflow
- Discussing diverse methods to upscale our scheduler
- Choosing a specific solution to synchronize Airflow Datasets between instances
- Discussing some roadblocks and limitations that we have met along the way
All of this in the context of a professional production environment.
Thank you for your time, I hope you enjoy the talk.