Unsorted uniq

May 30, 2014

Everybody gets caught the first time: uniq filters repeated lines – but only if they follow each other. This assumption greatly reduces the memory footprint of uniq and … its usefulness.

I explained previously how awk could be used to replace the classic sort | uniq -c incantation. In short, by skipping the sort, you can scale the solution to much bigger files.

If, however, all you need is an unsorted uniq, there’s an even shorter awk command you can use:

$ cat animals.txt
cats
cats
cats
dogs
birds
cats
dogs
dogs
birds
dogs
dogs
birds

$ cat animals.txt | awk '!cnts[$0]++'
cats
dogs
birds

Broken down as:

$0 – the whole current line
cnts[$0] – hash (automatically created) lookup of the current line
!cnts[$0] – if not in the hash (first occurence)
!cnts[$0]++ – if not in the hash, also, increment by one (flagging it as “seen”)
{ print $0 } – implicit action of the conditional above, print the current line

Or, in English, print the current line if you’ve never seen it, and mark it seen.

As a bonus, this command doesn’t have to process the whole file; it will print the new unique lines as they present themselves.

	GitHub
	Bluesky
	Email
	RSS

Unsorted uniq

Discuss on Bluesky