# The Partition Problem

It started out with a seemingly simple real-life problem:

Given a playlist of sequential tracks, how do you divide it

in groups of similar durations?

I wanted to make 30-ish-minute groups; what I need for my daily Spanish practice.

It felt like a “solved problem”, I just needed to find some online “app” to do this…

## Research

At that point, the last thing I wanted was to code this myself. Instead, I was willing to put in the time and do my homework.

I spent HOURS on this. To make a **long** story short:

- it’s a set partition problem
- it’s linear, the order matters

*no good: listening to lesson 17 before lesson 2* - it’s discrete, the items cannot be subdivided

*no good: stopping in the middle of a lesson* - it’s similar to the partition problem
- it’s similar to the subset sum problem
- there were MANY almost-but-not-quite problems and algorithms

The closest thing I found was the linear partition from Skiena’s book though I found the explanation a bit opaque. He describes the problem as:

input: a given arrangement S of non-negative numbers and an integer k

output: partition S into k ranges, so as to minimize the maximum sum over all the ranges

## Coding it, after all

I never found a piece of software to which I could provide a list of numbers and how many parts I want. I decided to code it and try a few approaches. Here’s the data I’m working with and here’s what it looks like:

where the boxplot of the track durations looks like:

more details:

- 58 tracks
- range: 124-520s (2:04-8:40)
- mean: 290s (4:50)
- median: 281s (4:41)
- total duration: 16805s (4:40:05)
- no outliers

## Methodology

The objective is to divide the tracks into groups that are “fair”. There
shouldn’t be a “big group” or a “small group”, they should all be roughly
equal. We want to minimize the spread of the group durations. The **standard
deviation** is a good measure of spread, it’s also familiar and easy to
calculate.

I came across the idea that minimizing the standard deviation really meant minimizing the sum of squares. Although correct and easier to calculate, I found that it was almost more abstract and in units I found hard to relate to. Consequently, the standard deviation is what I’m going to use to describe the “goodness” of my solutions.

For the sake of comparison, I’ll try to split the tracks into 10 groups – 9 groups made the average group duration over 30m and I’m still aiming for around 30m. I’ll come back to this point when I break into different number of groups later.

## Simplest approach: same number of tracks in each group

Relevant code: count.R

This is not *completely* unreasonable, for tracks with similar durations, you
might get good results by making groups with the same number of tracks. On
the other hand, it doesn’t even look at the track durations, so it’s not using
all the information available.

```
sd: 304.5 -- range: 1040-2035s (17:20-33:55)
```

Most groups have 6 tracks, some have 5 – that’s what you’d expect from breaking 58 tracks into 10. A lot of long tracks at the beginning and small tracks at the end give us a lot of variation between groups.

Let’s call this our baseline – clearly, we can do better.

## Greedy approach: accumulate into a group until a maximum

Relevant code: cutoff.R

When I started thinking about the problem, it was the first solution that
came to mind. Accumulate tracks in a group until *around* 30 minutes, then
start a new group.

Using a cutoff of 1680.5 (total duration divided into 10 groups):

```
sd: 406.0 -- range: 148-1672s (2:28-27:52)
```

We can see many problems:

- we ended up with 12 groups – but maybe that’s not a deal-breaker

*the choice of 10 groups was arbitrary anyway* - last group is extreme, stuck with one leftover track
- on the plus side, the other groups at relatively even

This approach doesn’t rely on the number of groups, it has a specific cutoff to determine group membership. It’s possible to change the cutoff until you get a certain number of groups, or a lower standard deviation:

```
cutoff buckets sd
1500 13 288.9883
1550 12 131.9025
1600 12 131.9025
1650 12 266.5167
1700 12 406.5718
1750 11 265.3115
1800 11 270.0163
1850 11 475.3502
1900 10 245.7068
1950 10 399.2165
2000 9 145.1454
2050 9 145.1454
2100 9 145.1454
2150 9 336.371
2200 9 428.1255
2250 8 149.2667
2300 8 149.2667
2350 8 434.7446
2400 8 535.3405
2450 8 535.3405
2500 8 695.6937
```

The “best” attempt falls at cutoff is 2000:

```
sd: 145.1 -- range: 1513-2000s (25:13-33:20)
```

It’s clear from that data above that the exact value of the cutoff has a big impact on the end result. If you’re willing to try multiple values looking for improvements, this algorithm might be enough for you.

But keep in mind that this algorithm makes no effort in looking at neighbor solutions which might be big improvements. For example, you could “even out” the last group by stealing some tracks from its neighbor. Then, you can repeat until … all groups level out.

## Minizinc: letting the solver do its thing

Relevant code: partition.mzn

Although I enjoyed the Modeling Discrete Optimization class, I wasn’t sure that I would find a practical use for what I had learned. But Minizinc is designed to solve problems like this one – it is a language to describe constraints. You don’t have to write the search logic, that’s handled by a solver.

Above, I mentioned that Skiena’s algorithm was looking to “minimize the maximum sum over all the ranges”. I ran that and found:

```
sd: 245.2 -- range: 1040-1892s (17:20-31:32)
```

It’s pushing down on top values until they level out. It’s obvious looking at group 10 that the solver didn’t care
about the group values below the lowest max. We can immediately do better if we try to minimize `max - min`

which
will push down the max AND pull up the min:

```
sd: 132.5 -- range: 1488-1892s (24:48-31:32)
```

The solver converges quickly. Even better, we didn’t have to write any complicated logic – the solver did all the work. But we know that the solver didn’t investigate the groups between the min and max. After all, all the values between min and max don’t change min and max. Depending on your data, you might land on the best answer – or tweaking the mid-groups might improve the standard deviation.

It would have been best to try to minimize the “sum of squares” (or the standard deviation) directly, but the non-linear nature of that problem prevented the solver from doing a good job. The solver got lost, and I couldn’t find a way to guide it with my current Minizinc skills.

## The water flow way: inspired by nature

Relevant code: water-flow.R

If you imagine all the groups as containers filled with water and connected by pipes, you would only have to open the valves for all the containers to find their level. In the discrete case, it means that a whole track flows from one container to the next.

I decided to experiment with a simple algorithm:

- start from some initial state
- pick the biggest group
- send a track to one of its neighbor group
- goto 2

It’s not exactly how water flows, but it’s inspired by it. Always picking the
biggest group bypasses the case where I could pick a group smaller than both
its neighbors. I can code this function, call `optim`

in R in simulated annealing (“sann”)
mode to keep repeating the process.

```
sd: 120.9 -- range: 1488-1892s (24:48-31:32)
```

The range didn’t change compared to the Minizinc version because the min and max didn’t change. However, the algorithm did a better job at smoothing out group 8, 9 and 10.

I was surprised how well this simple algorithm worked in practice. I had expected it to get bogged down based on initial conditions, but that didn’t happen even with extreme initial conditions. This might be dependent on the exact data – your mileage may vary. Maybe it’s best to start the algorithm with something “almost right”, something like the output of one of the other method above.

## Durations and Number of Groups

Now that we have a workable algorithm to reduce the spread of the groups, we can come back to question of how many groups would be ideal.

To stay under 30m (1800s), I would divide the tracks into 11 groups.

```
sd: 109.7 -- range: 1348-1740s (22:28-29:00)
```

This is using the water-flow code. Meanwhile, if you’re curious, the other methods for 11 groups:

```
count: sd = 304.5
cutoff: sd = 265.3
minizinc: sd = 137.1
```

## Conclusion

Simple problems sometimes have complicated solutions :-)

I don’t intend this to be the final word on the problem. In fact, part of me hopes that I missed something.

Let me know on twitter or by email.