How To Shuffle and Sample on the Command-Line

October 16, 2015

Shuffling

How do you shuffle on the command-line? With shuf:

> seq 5          # sequence from 1..5
1
2
3
4
5

> seq 5 | shuf   # same sequence, shuffled
4
3
2
1
5

On Linux, you already have shuf.

On Mac OS X, brew install coreutils installs shuf as gshuf (g for GNU), but I usually alias gshuf to shuf to fix that.

You could use sort -R / sort --random-sort as a poor-man shuf. For larger files, that’s a terrible idea because sort will sort the whole file before shuffling.

Sampling

Sampling is the selection of a subset of individuals from within a statistical population – wikipedia.

You have to pass the -n flag to shuf:

> seq 100 | shuf -n 5     # pick 5 from 1..100
78
71
74
4
52

You can allow repeated picks of the same value with the -r flag:

> seq 5 | shuf -r -n 10   # pick 10 from 1..5, with repeats
2
2
1
1
2
5
4
3
3
2

> seq 5 | shuf -n 10      # pick 10 from 1..5, WITHOUT repeats
2
1
3
5
4

Picking one thing is simply:

> seq 100 | shuf -n 1     # pick 1 from 1..100
37

Why Shuffle? Why Sample?

Every time I’m faced with too many things to look at, I don’t trust myself for picking a representative sample. It’s too easy to say “I’ll pick the first 10” and to miss a problem that only happened later.

For example, you might have a system that generates files in a directory. There might be hundreds or thousands of files and you only want to get a feel for their content.

> find . -type f | shuf -n 10    # pick 10 files...

Or you might want to get a feel for what’s happening in a log file:

> cat /var/log/nginx/access.log | shuf -n 100

Depending on your specific situation, it might bring the interesting question of how many things to look at.

	GitHub
	Bluesky
	Email
	RSS

How To Shuffle and Sample on the Command-Line

Shuffling

Sampling

Why Shuffle? Why Sample?

Discuss on Bluesky