Awk Tutorial, part 2
I already mentioned why you should learn AWK.
In part 1, we laid a solid foundation.
Let’s build on top of that.
NOTE: certain command outputs have been pretty-printed. Pipe through column -t
to obtain similar results.
Matching with Regular Expressions
So far, I’ve shown you ways to match lines based on column values. In practice, you usually want to match lines with regular expressions. For example, you can extract data from 2015:
(achievement unlocked: you have re-created grep…)
A regular expression, by itself, is a shorthand for the condition: $0 ~ /regex/
This means you can match regular expressions on specific columns. You can extract the data for the 1st of every month:
That’s already way better than grep.
Comparisons and Logic
I glossed over that in part 1, but AWK has all the usual comparison operators:
and logical operators:
You can almost create arbitrarily complex conditions. You are missing variables…
Built-in Variables
Some variables just “exist”; they already contain values and are automatically updated. These variables are easy to recognize because they are named in CAPITAL letters. Exception: column variables (starting with a $) are also built-in variables.
There are a bunch of built-in variables, but you’ll mostly use 2:
- NR : the number of records (lines) processed since AWK started
- NF : the number of fields (columns) on the current line
And 2 more if you’re dealing with multiple files:
- FNR : like NR, but resets to 1 when it begins processing a new file
- FILENAME: the name of the file being currently processed
User-Defined Variables
There is no need to “declare” the variable or initialize it. A variable “comes to life” when you use it. You can count (and print) how many lines happened in December 2015:
Not having to declare variables is convenient, but it’s also error-prone. If you misspell a variable, there won’t be any warning, and it might take you a while to discover your mistake. You’ve been warned. Remember: this is a language that optimizes for one-liners.
What are variables initialized to?
An undefined x
contains the empty string. The first time you access
it, that’s what you get. Strings are converted to numbers for numerical
operations:
Special Patterns: BEGIN and END
BEGIN
and END
are special conditions that only get triggered once per run.
BEGIN
gets triggered before processing any lineEND
gets triggered after all lines have been processed
These conditions get triggered even if there are no input lines.
BEGIN
is usually used to initialize variables – though now you know that’s not
necessary for zeroes or empty strings. It can also be used to print a header.
END
is usually used to crunch a result and print a summary or report:
(achievement unlocked: you have re-created wc -l…)
Blocks and Control
You can have multiple condition-block pairs. Each line in the input files gets presented to each block you write:
That works great until you have a line that matches both conditions:
The same line was printed twice! There are two solutions to this problem:
- making your conditions mutually exclusive
(which could be easy, but is often tedious and redundant) - using the
next
statement:
If you hit a next
, your script will stop matching blocks and go to the
next line from the input file. Using next
means you have to think about the
order of your blocks.
That’s not necessarily a bad thing.
There’s also an exit
statement to stop processing any more input and exit your script.
The END
block will still be executed, if you have one.
Taking inventory: what can you do?
At this point, a better question would be: what can’t you do?
In review, you can:
- match a line with regular expressions
- match a line with any operator
- use built-in variables, both in conditions or in blocks
- use your own variables, for all other needs
- control what happens at the beginning and the end of your script
- skip lines or exit early
Exercises
Try to:
- only print lines between February 29, 2016 and March 4, 2016
- sum the volumes for all days of January 2016
- average the closing price over all days of March 2015
- check that all lines have 7 columns
- only print every other line (say, even lines)
- remove empty lines in a file
Answers are here.