Building Data Pipelines on Apache NiFi with Python

Abstract

In today's big data world, the data you need to analyze comes from diverse sources in a variety of different formats. Combining all that data and reconciling it is incredibly difficult. Based on your need, adopting a proper and manageable ETL tool can make data integration easier.

An open source project, Apache NiFi, is a tool to built to automate and manage the flow of data between systems. You can use NiFi to build streaming data pipelines between different data-related systems, including Apache Kafka and Apache Spark, various RDBS, and so much more!

In this talk, I will start with introducing a concept of ETL and Apache NiFi, what it can solve, and how to use Python to enable NiFi's ability. Then, a sample demo will help you to understand how to build a streaming data pipeline with NiFi.

Description

本演講將會透過"Apache NiFi" 這個開源工具,來介紹如何搭建一個streaming data pipline。 本演講著重重點有三個方向: 1. What is ETL 2. What is Apache NiFi 3. How do Apache NiFi and python work together 內容會從ETL的概念開始引導聽眾,進而介紹NiFi的基礎概念和如何利用NiFi建立data pipeline。後半段將加強說明如何透過Python 加強NiFi的資料處理功能。 將介紹的"Apache NiFi"專案: https://nifi.apache.org/ 特徵 - Web-based user interface - 可以自己開發extension - Highly configurable One of demo examples [Hello World! NiFi](https://medium.com/@suci/hello-world-nifi-dcafcba0fdb0) Materials and sample code * [Python Script Examples in NiFi](https://github.com/sucitw/python-script-in-NiFi)

Slides

https://speakerdeck.com/sucitw/building-data-pipelines-on-apache-nifi-with-python

Speaker

Shuhsi Lin

A data engineer and python programmer. Currently working on various data applications in a manufacturing company.

Research interests: IoT applications, data streaming processing, data analysis and data visualization.