PySpark: next generation cluster computing engine
Apache Spark™ is a lightning fast engine for large-scale data processing. It is an in-memory cluster computing framework, originally developed in UC Berkeley. Base on it's project page's evaluation, machine learning programming can run program 100x faster than Hadoop MapReduce. And Spark can run on Hadoop 2's YARN cluster manager, and can read any existing Hadoop data. Currently, it supports Scala, Java and Python for writing spark programs.
In this talk, I will introduce the General concept of Spark's infrastructure, What is RDD (Resilient Distributed Datasets) in Spark, Introduction on PySpark, Demo of PySpark's speed and power, Head-to-head comparison between two programs doing same work - one written in Hadoop MapReduce and the other written using PySpark
I will also conclude about the companies currently using Spark's use cases.
關於講者
Sr. Software Engineer for the Yahoo! (Taiwan) Data Team. He has been responsible for data infrastructure, data solution, software release and continuous integration management. He is a lifelong student of software development/testing/deployment/CI processes and best practices and an avid coding puzzle competition fanatic as well as Open Source evangelist
組織/公司
Yahoo!頭銜
Sr. Software EngineerBuilt with Django and Mezzanine by PyCon Taiwan
Hosting provided by StreetVoice.
網站問題或建議請 回饋給我們
其它問題可聯絡organizers@pycon.tw