Data Analysis: Strange Loop 2023 Videos

October 18, 2023

What is this?

A few days ago, I saw the videos for the 2023 Strange Loop conference starting to land in my YouTube feed.

This started out as a relatively simple question:

which videos should I watch?

It wasn’t my first time looking at a list of videos on YouTube and wondering how to find “the best ones”.

The rest of this post is the journey; trying to find an answer. I wrote this for a few reasons:

Problem Statement

which videos are worth watching?

Of course, this is highly subjective. In my case, I’ll break it down as:

example of videos and views from youtube

(I’m not trying to pick on anyone, it’s just an example)

At this point, the plan usually looks like:

Getting The Data

I tried not to overthink this; I decided to scrape YouTube straight from Chrome’s Developer Tools.

using Chrome Developer Tools on that youtube page

thoughts:

I used this snippet in the developer console:

copy(
  [...document.querySelectorAll("yt-formatted-string#video-title")].
  map(el => el.ariaLabel).
  filter(text => text).
  join("\n")
)

breakdown:

Caveat: the page is lazy-loading, scroll enough to capture all the 2023 videos

Creating a project

project is a big word. But when I manipulate data, and it involves multiple steps, I usually create a directory to hold my files. Here’s what I did:

The point is to do an amount of bureaucracy proportional to the task at hand.

the project directory's content

Massaging the data

I pasted the data to a file:

$ pbpaste > data.raw

Then, I opened the file in vim and cleaned up the entries:

raw data in vim

thoughts:

Looking at the data more closely:

zoomed in raw data in vim

I came up with this awk script:

match($0, /(.*) (.*) views (.*) (.*) ago/, arr) {
  title = arr[1]
  views = arr[2]
  count = arr[3]
  unit  = arr[4]

  sub(",", "", views)

  multiplier = 1
  if (unit == "week" || unit == "weeks") {
    multiplier = 7
  }
  days = count * multiplier

  print views "\t" days "\t" title
}

breakdown:

Again, I only did what I needed for TODAY. This was a conscious decision.

Caveats

Here’s what the .tsv looked like:

sample tab-separated data

Exploring the data

Personally, I use R. Feel free to use something else. Pick a tool and learn it well.

I loaded the data into R:

library(tidyverse)

d <- read_tsv("data.tsv", col_names=c("views", "days", "title"),
       col_types=cols(
         "views" = col_double(),
         "days" = col_double(),
         "title" = col_character()
     )) |>
     mutate(mean_daily_views = views / days)

breakdown:

Let’s plot views against days

ggplot(d, aes(days, views)) +
  geom_point(alpha=0.3, size=2.5, stroke=0, color="red")

sample tab-separated data

thoughts:

Here’s the (power law) distribution for views:

ggplot(d, aes(reorder(str_trunc(title, 40), views), views)) +
  geom_col(alpha=0.5, fill="red") +
  xlab("title") +
  coord_flip()

sample tab-separated data

breakdown:

What about daily views against views?

ggplot(d, aes(views, mean_daily_views)) +
  geom_point(alpha=0.3, size=2.5, stroke=0, color="red")

sample tab-separated data

breakdown:

Finally, the videos, by top views and top daily views:

sample tab-separated data

Discussion

In the end, I didn’t find any deep insights in this data:

Maybe the power law distribution leads to obvious conclusions: watch what everybody else watched?

Here’s another view of the same data:

d |>
  mutate(popular = mean_daily_views >= 500) |>
ggplot(aes(days, views)) +
  geom_line(aes(x=x, y=y), alpha=0.1, size=3, color="red", data=guide.data) +
  geom_point(aes(color=popular), alpha=0.3, size=2.5, stroke=0) +
  scale_y_log10() +
  scale_x_log10() +
  NULL

log-log view of same data

breakdown:

Discuss on Bluesky