My Best Awk Tricks
This is a wrap-up of my AWK tutorial series.
You can start with why learn awk.
Or you can just straight to part 1 of the tutorial.
If you’ve read the tutorial, the amount of magic should be down to a minimum.
Before you ask: I have a cheatsheet and that’s where I keep the recipes that follow. I’m just human – I copy and paste what I need.
I’m not the author of these recipes, I only collected them over time.
Allow me to skip the
cat FILE | or
awk 'YOUR SCRIPT' FILE parts. By now,
I trust you to figure that out.
Which specific columns make sense for your specific needs will depend on you.
I might use
$1, but you’ll have to fix those. That’s what I do after I paste.
uniq without sort
I posted about this before, but it’s still my favorite:
The action is
Related to the above, print duplicates (without sort):
Print the 2nd time you see a string.
Group counts or sums
This was covered in the tutorial, but it’s damn useful:
Accumulate in an array, report at the
END. In both cases, pay attention to the columns you use.
Set operations: union, intersection, difference
If you have multiple files, and you consider their content as sets, you can generate a bunch of interesting subsets.
cat all the files and use the “uniq without sort” recipe from above :-)
For both intersection and difference, you need to accumulate from one file and process the other file.
The main trick is to realize that
FNR will, by definition, only be equal
during the processing of the first file. The
next statement ensures the rest
of the one-liner is skipped. We load the lut (LookUp Table) array with the relevant
parts from the first file.
$0 in lut instead of
lut[$0] for the condition? That’s an
optimization I learned the hard way: even a miss lookup in
instantiate the array location to an empty string – and over the processing of
HUGE files, you’ll eventually consume a lot of memory.
It takes a LOT for this problem to be a problem with the amount of memory
that computers have nowadays … that’s why I didn’t cover the
in operator in
This operation isn’t symmetrical: you’re removing the entries from FILE1 from FILE2. Switch the files around to get the other set difference.
If your AWK script isn’t fast enough, it might be time to consider whether AWK is the right tool for the job. How many GB of data are you piping through it?!
That being said, I know 2 tricks to speed up AWK:
Drop unicode support
LC_ALL variable forced to C will drop unicode support and, sometimes, greatly
speed up processing.
There are many variants of AWK, and the one you’re using is probably GNU AWK.
There are others: mawk is one of the FAST one.
- is mawk already installed?
- how much of a pain will it be to install?
- will my AWK script work without modifications?
These are all good questions. In all likehood:
- mawk won’t be installed…
- it will be easy to install (
brew install mawk, for example)
- your unmodified AWK script will just run FASTER
If you’re hitting the performance wall, giving mawk a chance might be worth it.