Effectively memory profiling distributed PySpark code

Kaashif Hymabaccus

Kaashif Hymabaccus

Kaashif Hymabaccus is a senior software engineer at Bloomberg. His team builds distributed systems to compute and store portfolio analytics, and he and his teammates are heavy users of Python, pandas, and PySpark.

    Abstract

    When writing code using PySpark to run distributed computations, it can be difficult to understand and profile your code since PySpark code executes both Python and JVM processes, possibly also running native code. This model is very different to non-distributed code using something like pandas, which runs in the same process. This talk will arm you with the knowledge needed to understand the PySpark driver/worker model, demonstrate how the open source Memray memory profiler can be used to profile Python and native (C/C++/Rust) code across drivers and workers, and take a deep dive into some challenging data processing scenarios where memory usage comes from unexpected places.

    Description

    Location

    R2

    Date

    Day 2 • 03:30-04:15 (UTC)

    Language

    English talk

    Level

    Experienced

    Category

    Data Analysis