Distributing your pandas ETL job using Ray and Modin

李泓旻 (Andrew)

李泓旻 (Andrew)

I am currently working as a data engineer in the financial industry. In the past, I worked as a one-stop shop for data science(Manufacturing), covering data engineering, ETL, modeling, and deployment. Dedicated to finding the most suitable tool for each need. Keep contributing to open source projects. LIFE IS SHORT. USE PYTHON.

  • Intro
  • More Info
  • Slido
  • Note

Abstract

Are you using pandas to process data? Do you want to handle a large dataset using pandas? Do you want to develop the Python code on your laptop and run it on Cloud or Kubernetes effortlessly? In this talk, I assume you are familiar with pandas and I will share how to distribute your pandas ETL job by changing few lines of code(even just one).

Description

If you are working in Data Science field, pandas is a fantastic tool for Python users. According to the offical document, pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. However, every tools have limitations. pandas can manipulate small data efficiently because it handle the data in memory, which means it's difficult to process large datasets.

In this talk, I will share two common cases describing how to distribute your pandas ETL job by changing few lines of code(even just one):

  • Handling many small datasets which share the same ETL logic
  • Handling an out-of-core dataset without re-write the ETL script

References:

Video