Koalas - An Interface for Pandas User to Leverage Spark Framework.

Abstract

Implementing machine learning solutions at scale can be challenging. Especially, when data processing and modeling need to be deployed in distributed systems.

With its in-memory processing capabilities, Apache Spark has been all the rage for large scale data processing and analytics. Adopting Apache Spark in production become common. High-level APIs also make the learning cure of Apache Spark flatter. However, it is still not painless to move experimenting Python scripts into Apache Spark jobs in production.

An opensource project, “Koalas”, is aims to relieve the pain by implementing the pandas DataFrame API on top of Apache Spark. We will start by briefly introducing Koalas. Then, the main focus is about how to use Koalas to make machine learning projects running with Spark, including comparing the difference between Apache Spark, Pandas, and Koalas. Then, through a few examples,we will demonstrate how to develop on a single codebase that works both with pandas and with Spark.

Description

本演講將會透過"Koalas" 這個開源專案,來介紹如何便利地轉換pandas 運算應用到Spark上執行。 將介紹的python 專案: https://koalas.readthedocs.io/en/latest/ “Koalas”: 為Spark上的Pandas API,讓熟習Pandas的開發者可以快速在Spark上執行運算 由Databrick 開源並主導,以雙週頻率 release Materials and sample code (TBD)

Slides

https://docs.google.com/presentation/d/1Q4cX-p0U6jvVAm2dxyUslKsLA6R00C2XeIddJGwFH4U/edit?usp=sharing

Speaker

許理賀

I'm a data Engineer who focuses on data ingestion in Hadoop ecosystem. Interested in reading technical books and try new technologies. Primary area of research during my graduate studies is optimization theory and scheduling.