PySpark: next generation cluster computing engine
Apache Spark™ is a lightning fast engine for large-scale data processing. It is an in-memory cluster computing framework, originally developed in UC Berkeley. Base on it's project page's evaluation, machine learning programming can run program 100x faster than Hadoop MapReduce. And Spark can run on Hadoop 2's YARN cluster manager, and can read any existing Hadoop data. Currently, it supports Scala, Java and Python for writing spark programs.
In this talk, I will introduce the General concept of Spark's infrastructure, What is RDD (Resilient Distributed Datasets) in Spark, Introduction on PySpark, Demo of PySpark's speed and power, Head-to-head comparison between two programs doing same work - one written in Hadoop MapReduce and the other written using PySpark
I will also conclude about the companies currently using Spark's use cases.
About Speaker
Sr. Software Engineer for the Yahoo! (Taiwan) Data Team. He has been responsible for data infrastructure, data solution, software release and continuous integration management. He is a lifelong student of software development/testing/deployment/CI processes and best practices and an avid coding puzzle competition fanatic as well as Open Source evangelist
Organization/Company
Yahoo!Job title
Sr. Software EngineerBuilt with Django and Mezzanine by PyCon Taiwan
Hosting provided by StreetVoice.
Bugs or wheels? Feedback and support here.
More on contact organizers@pycon.tw