Data Analysis: Strange Loop 2023 Videos

October 18, 2023

What is this?

A few days ago, I saw the videos for the 2023 Strange Loop conference starting to land in my YouTube feed.

This started out as a relatively simple question:

which videos should I watch?

It wasn’t my first time looking at a list of videos on YouTube and wondering how to find “the best ones”.

The rest of this post is the journey; trying to find an answer. I wrote this for a few reasons:

Problem Statement

which videos are worth watching?

Of course, this is highly subjective. In my case, I’ll break it down as:

example of videos and views from youtube

(I’m not trying to pick on anyone, it’s just an example)

At this point, the plan usually looks like:

Getting The Data

I tried not to overthink this; I decided to scrape YouTube straight from Chrome’s Developer Tools.

using Chrome Developer Tools on that youtube page


I used this snippet in the developer console:

  map(el => el.ariaLabel).
  filter(text => text).


Caveat: the page is lazy-loading, scroll enough to capture all the 2023 videos

Creating a project

project is a big word. But when I manipulate data, and it involves multiple steps, I usually create a directory to hold my files. Here’s what I did:

The point is to do an amount of bureaucracy proportional to the task at hand.

the project directory's content

Massaging the data

I pasted the data to a file:

$ pbpaste > data.raw

Then, I opened the file in vim and cleaned up the entries:

raw data in vim


Looking at the data more closely:

zoomed in raw data in vim

I came up with this awk script:

match($0, /(.*) (.*) views (.*) (.*) ago/, arr) {
  title = arr[1]
  views = arr[2]
  count = arr[3]
  unit  = arr[4]

  sub(",", "", views)

  multiplier = 1
  if (unit == "week" || unit == "weeks") {
    multiplier = 7
  days = count * multiplier

  print views "\t" days "\t" title


Again, I only did what I needed for TODAY. This was a conscious decision.


Here’s what the .tsv looked like:

sample tab-separated data

Exploring the data

Personally, I use R. Feel free to use something else. Pick a tool and learn it well.

I loaded the data into R:


d <- read_tsv("data.tsv", col_names=c("views", "days", "title"),
         "views" = col_double(),
         "days" = col_double(),
         "title" = col_character()
     )) |>
     mutate(mean_daily_views = views / days)


Let’s plot views against days

ggplot(d, aes(days, views)) +
  geom_point(alpha=0.3, size=2.5, stroke=0, color="red")

sample tab-separated data


Here’s the (power law) distribution for views:

ggplot(d, aes(reorder(str_trunc(title, 40), views), views)) +
  geom_col(alpha=0.5, fill="red") +
  xlab("title") +

sample tab-separated data


What about daily views against views?

ggplot(d, aes(views, mean_daily_views)) +
  geom_point(alpha=0.3, size=2.5, stroke=0, color="red")

sample tab-separated data


Finally, the videos, by top views and top daily views:

sample tab-separated data


In the end, I didn’t find any deep insights in this data:

Maybe the power law distribution leads to obvious conclusions: watch what everybody else watched?

Here’s another view of the same data:

d |>
  mutate(popular = mean_daily_views >= 500) |>
ggplot(aes(days, views)) +
  geom_line(aes(x=x, y=y), alpha=0.1, size=3, color="red", +
  geom_point(aes(color=popular), alpha=0.3, size=2.5, stroke=0) +
  scale_y_log10() +
  scale_x_log10() +

log-log view of same data


Discuss on Twitter