PySpark: next generation cluster computing engine - Wisely Chen

PySpark: next generation cluster computing engine

Wisely Chen /English

Apache Spark™ is a lightning fast engine for large-scale data processing. It is an in-memory cluster computing framework, originally developed in UC Berkeley. Base on it's project page's evaluation, machine learning programming can run program 100x faster than Hadoop MapReduce. And Spark can run on Hadoop 2's YARN cluster manager, and can read any existing Hadoop data. Currently, it supports Scala, Java and Python for writing spark programs. In this talk, I will introduce the General concept of Spark's infrastructure, What is RDD (Resilient Distributed Datasets) in Spark, Introduction on PySpark, Demo of PySpark's speed and power, Head-to-head comparison between two programs doing same work - one written in Hadoop MapReduce and the other written using PySpark I will also conclude about the companies currently using Spark's use cases.

About Speaker

Profile picture
Sr. Software Engineer for the Yahoo! (Taiwan) Data Team. He has been responsible for data infrastructure, data solution, software release and continuous integration management. He is a lifelong student of software development/testing/deployment/CI processes and best practices and an avid coding puzzle competition fanatic as well as Open Source evangelist



Job title

Sr. Software Engineer

HDE, Inc. mongodb Google

Tagtoo Vpon Github Github Github

Quanta Research Institute

AcoMo Technology

CLBC KKTIX QSearch Python Software Foundation Open Source Software Foundry LIVEhouse.in Young Optics Wolf Tea

QSearch Business Next Vpon Inside 硬塞的 DIGITIMES INNOMAMBO 創新曼波

Built with Django and Mezzanine by PyCon Taiwan

Hosting provided by StreetVoice.

Bugs or wheels? Feedback and support here.

More on contact organizers@pycon.tw