Abstract
Struggling with slow OLAP queries on messy JSON data? In this talk, I’ll share how we transformed our data pipeline using PySpark, evolving from traditional JSON access to the Variant format, and ultimately to Spark’s new Variant Shredding feature. By applying these changes in our production environment, where we process large-scale semi-structured event data, we achieved up to 6x faster query performance and improved storage efficiency. I’ll walk through how Parquet’s columnar storage works hand-in-hand with variant shredding to deliver these benefits, and also discuss the trade-offs, such as increased write latency.