Exploring Network Clusters in Python

Anand S

Anand S

Anand is a co-founder of Gramener, a data science company. He leads a team that automates insights from data and narrates these as visual data stories. He is recognized as one of India's top 10 data scientists, and is a regular TEDx speaker. Anand is a gold medalist at IIM Bangalore and an alumnus of IIT Madras, London Business School, IBM, Infosys, Lehman Brothers, and BCG. More importantly, he has hand-transcribed every Calvin & Hobbes strip ever and dreams of watching every film on the IMDb Top 250. He blogs at http://s-anand.net. His talks are at https://bit.ly/anandtalks

  • Intro
  • More Info
  • Slido
  • Note

Abstract

This workshop explores popular network clustering algorithms (using the scikit-network and networkx Python libraries) on real-world movie datasets -- and shows how to find the path between two actors (via costars), how Iranian and Turkish actors are far removed from Hollywood but close to each other, and the one major film industry that is far away from the rest of the world.

Description

Network clustering is a way to explore community structure in social networks.

In this talk, we'll take actors on the Internet Movie Database and explore the clusters they form, connections between people, and outliers in those clusters.

PART 1. EXPLORING CLUSTERS (10 min)

We'll group the actor universe into distinct clusters, e.g. Hollywood, British actors, Chinese actors, etc., based on who acts most often with whom. Each actor belongs to one of these clusters.

You'll learn how to do this using different clustering algorithms in scikit-network and understand which model works better in which scenarios.

We'll drill down into a few clusters to see the sub-clusters. For example, to see how the Mexican vs Italian sub-clusters emerge in Hollywood, and how distinct they are.

You'll learn the principles of hierarchical clustering in the process.

PART 2. EXPLORING CONNECTIONS (15 min)

In this second part, we'll explore which clusters are closer to each other. For example, the British cluster is closest to the Hollywood cluster. The Turkish cluster is closest to the Iranian cluster.

You'll learn metrics in clustering -- such as modularity -- that define how tightly knit a cluster is, and how close two clusters are to each other.

We'll see how one cluster can connect with another through connectors -- actors who act across clusters -- and learn the best way for a budding Malaysian actor to enter Hollywood.

You'll learn how to use shortest path algorithms and prioritize when there are multiple paths of equal length.

PART 3. EXPLORING OUTLIERS (10 min)

In this third part, we'll explore which actors act more with other clusters than their own. Sean Connery and Penelope Cruz are examples of Hollywood actors who act more outside Hollywood, for example.

You'll learn how to detect outliers within network clusters.

SYNTHESIS (5 min)

We'll see how these principles apply to any network -- such as people who email each other, or products that are bought together.

By the end of the talk, you'll have a practical understanding of how to apply network clustering to any network.

Video