Unsorted uniq

May 30, 2014

Everybody gets caught the first time: uniq filters repeated lines – but only if they follow each other. This assumption greatly reduces the memory footprint of uniq and … its usefulness.

I explained previously how awk could be used to replace the classic sort | uniq -c incantation. In short, by skipping the sort, you can scale the solution to much bigger files.

If, however, all you need is an unsorted uniq, there’s an even shorter awk command you can use:

$ cat animals.txt
cats
cats
cats
dogs
birds
cats
dogs
dogs
birds
dogs
dogs
birds

$ cat animals.txt | awk '!cnts[$0]++'
cats
dogs
birds

Broken down as:

Or, in English, print the current line if you’ve never seen it, and mark it seen.

As a bonus, this command doesn’t have to process the whole file; it will print the new unique lines as they present themselves.

Discuss on Twitter