Effectively memory profiling distributed PySpark code

Kaashif Hymabaccus

Kaashif Hymabaccus

Kaashif Hymabaccus is a senior software engineer at Bloomberg. His team builds distributed systems to compute and store portfolio analytics, and he and his teammates are heavy users of Python, pandas, and PySpark.

  • Intro
  • More Info
  • Slido
  • Note

Abstract

When writing code using PySpark to run distributed computations, it can be difficult to understand and profile your code since PySpark code executes both Python and JVM processes, possibly also running native code. This model is very different to non-distributed code using something like pandas, which runs in the same process. This talk will arm you with the knowledge needed to understand the PySpark driver/worker model, demonstrate how the open source Memray memory profiler can be used to profile Python and native (C/C++/Rust) code across drivers and workers, and take a deep dive into some challenging data processing scenarios where memory usage comes from unexpected places.

Description

  • PySpark

    • This talk uses PySpark, which is well known and needs only a brief introduction.
    • As of PySpark 3.4, PySpark includes a memory profiler which allows profiling Python code running on executors.
    • We will compare and contrast this built in memory profiler with Memray.
  • Memray

    • The focus of this talk is using Memray to profile memory usage in challenging distributed situations.
    • Memray is a relatively new (open sourced in 2022) and not yet widely adopted Python memory profiling tool.
    • One of the key, most innovative features of Memray is that it can seamlessly show memory allocations inside native extensions and can integrate profiles from C, C++, and Rust libraries. This is critical for us to understand memory usage from C extensions, which are common in our high performance data intensive use cases.
  • pandas

Video