Awk Tutorial, part 3
I already mentioned why you should learn AWK.
In part 1, we laid a solid foundation.
In part 2, we covered most of what you would ever need.
Let’s cover what’s left.
How does AWK decide what a “column” is and isn’t?
AWK “trims” the line and separates on adjacent whitespace:
\s+ as a regex.
That’s usually what you want, but you can specify what you need with the
(source file: netflix.csv)
In reality, the
-F option takes a regular expression:
It separated on commas or hyphens, and picked the 3rd column (the day). This can be a useful approach to extract subfields.
When you use a comma (
,) in a print statement, that means “space”, right?
By default, it does. But that’s configurable:
The OFS (Output Field Separator) variable controls what goes between each field. I don’t use it very much; I usually format the output explicitly. But it’s sometimes useful and it’s good to know.
Passing in Variables
As in the previous example, you could decide to initialize variables in the
BEGIN block. But that’s not always possible – the
BEGIN block lives inside
the single-quotes which severely limits what you can do.
What if you want to pass in variables, maybe from a shell script?
-v (mnemonic “var”) option allows you to set variable from outside the
script, a convenient place where shell variables and substitutions are
available. We revisit the
OFS variable from above, in the way I would prefer
to set it:
Imagine a programming language without arrays or dictionaries. That’s how we’ve been using AWK up until now. But everything truly clever you can do with AWK (or any programming language) probably requires arrays.
How would we
SUM(volume) GROUP BY year?
- split columns on commas or hyphens
- accumulate volume ($8) in a dictionary, using the year ($1) as key
- volume is a variable that gets created as a dictionary, because we use it as a dictionary – this was discussed in part 2
- at the END, print each year and volume sum
When working with array, this pattern of “accumulation” and “reporting” at the END is commonplace. But there are some problems with the output, they highlight interesting points:
- the output isn’t sorted, the for-loop makes no guarantee over the order of the keys
(that can be fixed with a trailing
- the header was considered as a key, “Date”, and accumulated 0
(that can be fixed by removing the first line before AWK (
sed 1d) or by using
NR > 1before the accumulator block)
I’ve been selling AWK as a language optimized for one-liners, but it’s possible to reach unpleasant extremes. The last example wasn’t too complicated, but it was long. It could be made more readable.
It’s possible to package multiple lines of AWK in a bash script:
- there’s a shebang line, it’s a bash script on the outside
cat "$@"is a passthrough: if you feed a filename to the script, it will be used. If you don’t, an empty
(it allows you to call
cat file | SCRIPTNAME)
- notice the trailing single-quote on the AWK line, that’s where your script begins
- you can use multiple lines within the single-quotes
- use your own good taste to format the code
- notice the single-quote at the end, that’s a good place to put a pipe (if you need it)
Now you can combine the best of bash scripting, with the power of AWK – all in a very portable package.
Taking inventory: what can you do?
Everything I can.
Specifically, you have:
- more flexible ways to parse inputs with
- more convenient ways to print outputs with
- ultimate programming power: arrays!
- a nice way to package your logic in a bash script
- calculate the average closing price, grouped per year
- calculate the max closing price, grouped per month
- calculate the median volume, in 2015 – you might need this
Answers are here.
As a conclusion: my best AWK tricks.
At this point, I hope they won’t be a list of opaque incantations.
You will be able to see what and how it’s done.