Effective ways to scale-up and maintain your Web Crawling project using Scrapy and its ecosystem of tools.

Kevin Lloyd Bernal

Kevin Lloyd Bernal

Kevin is currently a Software Engineer in Zyte. He builds on solutions to crawl the web at scale. He's part of the team that develops and maintains open source packages that enable developers to effectively manage their parsing and crawling solutions. He is also currently studying MS in Computer Science at GA Tech specializing in Machine Learning.

    Abstract

    Acquiring massive amounts of public data from anywhere on the web is crucial in today's data age. Such undertaking could be achieved through the use of Spiders which has two components: (1) Crawling —— the means to find the content of interest and (2) Extraction —— the means of turning data into a structured format. However, the web changes so fast that scaling and maintaining these spiders become an issue. In this talk, we will create an end-to-end web crawling project that walks through each crucial step, the challenges for each stage, and the available tools and techniques to overcome such obstacles. We will be using Scrapy, one of the most popular web crawling Python frameworks, together with its ecosystem of tools.

    Description

    Video

    Location

    R3

    Date

    Day 1 • 13:05-14:35 (GMT+8)

    Language

    English talk

    Level

    Intermediate

    Category

    Project Tooling