Abstract
When writing code using PySpark to run distributed computations, it can be difficult to understand and profile your code since PySpark code executes both Python and JVM processes, possibly also running native code. This model is very different to non-distributed code using something like pandas, which runs in the same process. This talk will arm you with the knowledge needed to understand the PySpark driver/worker model, demonstrate how the open source Memray memory profiler can be used to profile Python and native (C/C++/Rust) code across drivers and workers, and take a deep dive into some challenging data processing scenarios where memory usage comes from unexpected places.
Description
Location
R2
Date
Day 2 • 03:30-04:15 (UTC)
Language
English talk
Level
Experienced
Category
Data Analysis