A Senior ML/Data Engineer in Gogolook. Currently i am in charge of implementing streaming etl infrastructure and nlp related ml model and application. Having 4+ years experience of data science and data engineering, include NLP and Streaming(micro-batch) ETL design. My research interests include nlp related algorithm model and paper, streaming data pipeline and cloud service. Hope i can contribute something in data world.
摘要
In the current data-driven world, we are often faced with how to process and analyze data effectively and in real time. And streaming processing will be an important application. In addition, the data will have different schemas for different applications and needs. In order to effectively achieve data correctness and availability in the application of streaming, it is necessary to integrate schema verification into the streaming process. In order to achieve this objective, I will start with introducing the concept and use cases or scenarios of streaming process and two services, Apache Kafka and Schema Registry. The Kafka is a message queue system that can handle a large amount of streaming data. And Schema Registry is a service which based on Kafka, it can help us do schema verification during producing data to Kafka or consuming data from Kafka. Lastly, I will share how to use python to integrate these two service to implement a reliable streaming process.
說明
In this session, i will start with sharing the difference between batch and streaming to help participants establish a basic concept, also introduce importance and use cases or scenarios about streaming process. And i will highlight Apache Kafka and Schema Registry architecture and purpose. Then, i will discuss how to use python to implement streaming process, include produce data, consume data and achieve data schema verification through example code and demo. Lastly, i also share some important settings and how to finetune producer and consumer to improve high throughput and latency on streaming process.
影片