Awk Tutorial, part 2
I already mentioned why you should learn AWK.
In part 1, we laid a solid foundation.
Let’s build on top of that.
NOTE: certain command outputs have been pretty-printed. Pipe through column -t
to obtain similar results.
Matching with Regular Expressions
So far, I’ve shown you ways to match lines based on column values. In practice, you usually want to match lines with regular expressions. For example, you can extract data from 2015:
$ cat netflix.tsv | awk '/^2015-/'
2015-12-31 116.209999 117.459999 114.279999 114.379997 9245000 114.379997
2015-12-30 118.949997 119.019997 116.43 116.709999 8116200 116.709999
2015-12-29 118.190002 119.599998 116.919998 119.120003 8159200 119.120003
2015-12-28 117.260002 117.349998 113.849998 117.110001 8406300 117.110001
2015-12-24 118.220001 118.800003 117.300003 117.330002 3531300 117.330002
# snip
(achievement unlocked: you have re-created grep…)
A regular expression, by itself, is a shorthand for the condition: $0 ~ /regex/
awk '/^2015-/'
# means:
awk '$0 ~ /^2015-/'
This means you can match regular expressions on specific columns. You can extract the data for the 1st of every month:
$ cat netflix.tsv | awk '$1 ~ /-01$/'
2016-03-01 94.580002 99.160004 93.610001 98.300003 16997700 98.300003
2016-02-01 91.790001 97.18 91.300003 94.089996 19618000 94.089996
2015-12-01 124.470001 125.57 122.419998 125.370003 12528800 125.370003
2015-10-01 102.910004 106.110001 101.120003 105.980003 17426900 105.980003
2015-09-01 109.349998 111.239998 103.82 105.790001 35977100 105.790001
# snip
That’s already way better than grep.
Comparisons and Logic
I glossed over that in part 1, but AWK has all the usual comparison operators:
$2 == 124.47 # equality
$2 != 124.47 # inequality
$2 > 124.47 # greater than
$2 >= 124.47 # greater than or equal
$2 < 124.47 # smaller than
$2 <= 124.47 # smaller than or equal
$2 ~ /^10.$/ # regex match
$2 !~ /^10.$/ # regex negated match -- this one might be new
and logical operators:
$1 ~ /^2015/ && $6 > 20000000 # and -- high volume in 2015
$6 < 1000000 || $6 > 20000000 # or -- low or high volume
! /^2015/ # not -- not in 2015
You can almost create arbitrarily complex conditions. You are missing variables…
Built-in Variables
Some variables just “exist”; they already contain values and are automatically updated. These variables are easy to recognize because they are named in CAPITAL letters. Exception: column variables (starting with a $) are also built-in variables.
There are a bunch of built-in variables, but you’ll mostly use 2:
- NR : the number of records (lines) processed since AWK started
- NF : the number of fields (columns) on the current line
And 2 more if you’re dealing with multiple files:
- FNR : like NR, but resets to 1 when it begins processing a new file
- FILENAME: the name of the file being currently processed
User-Defined Variables
There is no need to “declare” the variable or initialize it. A variable “comes to life” when you use it. You can count (and print) how many lines happened in December 2015:
$ cat netflix.tsv | awk '/^2015-12/ {count++; print count, $0}'
1 2015-12-31 116.209999 117.459999 114.279999 114.379997 9245000 114.379997
2 2015-12-30 118.949997 119.019997 116.43 116.709999 8116200 116.709999
3 2015-12-29 118.190002 119.599998 116.919998 119.120003 8159200 119.120003
4 2015-12-28 117.260002 117.349998 113.849998 117.110001 8406300 117.110001
5 2015-12-24 118.220001 118.800003 117.300003 117.330002 3531300 117.330002
# snip
22 2015-12-01 124.470001 125.57 122.419998 125.370003 12528800 125.370003
Not having to declare variables is convenient, but it’s also error-prone. If you misspell a variable, there won’t be any warning, and it might take you a while to discover your mistake. You’ve been warned. Remember: this is a language that optimizes for one-liners.
What are variables initialized to?
$ awk 'BEGIN {print x + 2}' # => 2
$ awk 'BEGIN {x = x + 2; print x}' # => 2
$ awk 'BEGIN {print x}' # => <blank> -- empty string, really
# BEGIN will be discussed next...
An undefined x
contains the empty string. The first time you access
it, that’s what you get. Strings are converted to numbers for numerical
operations:
x + 2
# expands:
"" + 2
# expands:
0 + 2
# expands:
2
Special Patterns: BEGIN and END
BEGIN
and END
are special conditions that only get triggered once per run.
BEGIN
gets triggered before processing any lineEND
gets triggered after all lines have been processed
These conditions get triggered even if there are no input lines.
BEGIN
is usually used to initialize variables – though now you know that’s not
necessary for zeroes or empty strings. It can also be used to print a header.
END
is usually used to crunch a result and print a summary or report:
$ cat netflix.tsv | awk 'END {print NR}'
(achievement unlocked: you have re-created wc -l…)
Blocks and Control
You can have multiple condition-block pairs. Each line in the input files gets presented to each block you write:
$ cat netflix.tsv | awk '/^2016-03-24/ {print} $4 == 96.43 {print}'
2016-03-24 98.639999 98.849998 97.07 98.360001 10646900 98.360001
2016-03-15 97.870003 98.510002 96.43 97.860001 9678000 97.860001
# could be written as:
#
# $ cat netflix.tsv | awk '/^2016-03-24/; $4 == 96.43'
#
# because we both know what a missing block means...
# but for this example, it's a bit opaque.
That works great until you have a line that matches both conditions:
$ cat netflix.tsv | awk '/^2016-03-24/ {print} $4 == 97.07 {print}'
2016-03-24 98.639999 98.849998 97.07 98.360001 10646900 98.360001
2016-03-24 98.639999 98.849998 97.07 98.360001 10646900 98.360001
The same line was printed twice! There are two solutions to this problem:
- making your conditions mutually exclusive
(which could be easy, but is often tedious and redundant) - using the
next
statement:
$ cat netflix.tsv | awk '/^2016-03-24/ {print; next} $4 == 97.07 {print}'
2016-03-24 98.639999 98.849998 97.07 98.360001 10646900 98.360001
If you hit a next
, your script will stop matching blocks and go to the
next line from the input file. Using next
means you have to think about the
order of your blocks.
That’s not necessarily a bad thing.
There’s also an exit
statement to stop processing any more input and exit your script.
The END
block will still be executed, if you have one.
Taking inventory: what can you do?
At this point, a better question would be: what can’t you do?
In review, you can:
- match a line with regular expressions
- match a line with any operator
- use built-in variables, both in conditions or in blocks
- use your own variables, for all other needs
- control what happens at the beginning and the end of your script
- skip lines or exit early
Exercises
Try to:
- only print lines between February 29, 2016 and March 4, 2016
- sum the volumes for all days of January 2016
- average the closing price over all days of March 2015
- check that all lines have 7 columns
- only print every other line (say, even lines)
- remove empty lines in a file
Answers are here.