How To Shuffle and Sample on the Command-Line
Shuffling
How do you shuffle on the command-line? With shuf
:
On Linux, you already have shuf
.
On Mac OS X, brew install coreutils
installs shuf as gshuf
(g for GNU), but
I usually alias gshuf to shuf to fix that.
You could use sort -R
/ sort --random-sort
as a poor-man shuf. For larger
files, that’s a terrible idea because sort will sort the whole file before
shuffling.
Sampling
Sampling is the selection of a subset of individuals from within a statistical population – wikipedia.
You have to pass the -n
flag to shuf
:
You can allow repeated picks of the same value with the -r
flag:
Picking one thing is simply:
Why Shuffle? Why Sample?
Every time I’m faced with too many things to look at, I don’t trust myself for picking a representative sample. It’s too easy to say “I’ll pick the first 10” and to miss a problem that only happened later.
For example, you might have a system that generates files in a directory. There might be hundreds or thousands of files and you only want to get a feel for their content.
Or you might want to get a feel for what’s happening in a log file:
Depending on your specific situation, it might bring the interesting question of how many things to look at.