How To Shuffle and Sample on the Command-Line

October 16, 2015

Shuffling

How do you shuffle on the command-line? With shuf:

> seq 5          # sequence from 1..5
1
2
3
4
5

> seq 5 | shuf   # same sequence, shuffled
4
3
2
1
5

On Linux, you already have shuf.

On Mac OS X, brew install coreutils installs shuf as gshuf (g for GNU), but I usually alias gshuf to shuf to fix that.

You could use sort -R / sort --random-sort as a poor-man shuf. For larger files, that’s a terrible idea because sort will sort the whole file before shuffling.

Sampling

Sampling is the selection of a subset of individuals from within a statistical populationwikipedia.

You have to pass the -n flag to shuf:

> seq 100 | shuf -n 5     # pick 5 from 1..100
78
71
74
4
52

You can allow repeated picks of the same value with the -r flag:

> seq 5 | shuf -r -n 10   # pick 10 from 1..5, with repeats
2
2
1
1
2
5
4
3
3
2

> seq 5 | shuf -n 10      # pick 10 from 1..5, WITHOUT repeats
2
1
3
5
4

Picking one thing is simply:

> seq 100 | shuf -n 1     # pick 1 from 1..100
37

Why Shuffle? Why Sample?

Every time I’m faced with too many things to look at, I don’t trust myself for picking a representative sample. It’s too easy to say “I’ll pick the first 10” and to miss a problem that only happened later.

For example, you might have a system that generates files in a directory. There might be hundreds or thousands of files and you only want to get a feel for their content.

> find . -type f | shuf -n 10    # pick 10 files...

Or you might want to get a feel for what’s happening in a log file:

> cat /var/log/nginx/access.log | shuf -n 100

Depending on your specific situation, it might bring the interesting question of how many things to look at.

Discuss on Twitter