I'm a data Engineer who focuses on data ingestion in Hadoop ecosystem. Interested in reading technical books and try new technologies. Primary area of research during my graduate studies is optimization theory and scheduling.
Implementing machine learning solutions at scale can be challenging. Especially, when data processing and modeling need to be deployed in distributed systems.
With its in-memory processing capabilities, Apache Spark has been all the rage for large scale data processing and analytics. Adopting Apache Spark in production become common. High-level APIs also make the learning cure of Apache Spark flatter. However, it is still not painless to move experimenting Python scripts into Apache Spark jobs in production.
An opensource project, “Koalas”, is aims to relieve the pain by implementing the pandas DataFrame API on top of Apache Spark. We will start by briefly introducing Koalas. Then, the main focus is about how to use Koalas to make machine learning projects running with Spark, including comparing the difference between Apache Spark, Pandas, and Koalas. Then, through a few examples,we will demonstrate how to develop on a single codebase that works both with pandas and with Spark.