Awk Tutorial, part 2

May 4, 2016

I already mentioned why you should learn AWK.
In part 1, we laid a solid foundation.
Let’s build on top of that.

NOTE: certain command outputs have been pretty-printed. Pipe through column -t to obtain similar results.

Matching with Regular Expressions

So far, I’ve shown you ways to match lines based on column values. In practice, you usually want to match lines with regular expressions. For example, you can extract data from 2015:

$ cat netflix.tsv | awk '/^2015-/'
2015-12-31 116.209999 117.459999 114.279999 114.379997 9245000 114.379997
2015-12-30 118.949997 119.019997 116.43 116.709999 8116200 116.709999
2015-12-29 118.190002 119.599998 116.919998 119.120003 8159200 119.120003
2015-12-28 117.260002 117.349998 113.849998 117.110001 8406300 117.110001
2015-12-24 118.220001 118.800003 117.300003 117.330002 3531300 117.330002
# snip

(achievement unlocked: you have re-created grep…)

A regular expression, by itself, is a shorthand for the condition: $0 ~ /regex/

awk '/^2015-/'
# means:
awk '$0 ~ /^2015-/'

This means you can match regular expressions on specific columns. You can extract the data for the 1st of every month:

$ cat netflix.tsv | awk '$1 ~ /-01$/'
2016-03-01 94.580002 99.160004 93.610001 98.300003 16997700 98.300003
2016-02-01 91.790001 97.18 91.300003 94.089996 19618000 94.089996
2015-12-01 124.470001 125.57 122.419998 125.370003 12528800 125.370003
2015-10-01 102.910004 106.110001 101.120003 105.980003 17426900 105.980003
2015-09-01 109.349998 111.239998 103.82 105.790001 35977100 105.790001
# snip

That’s already way better than grep.

Comparisons and Logic

I glossed over that in part 1, but AWK has all the usual comparison operators:

$2 == 124.47   # equality
$2 != 124.47 # inequality

$2 > 124.47 # greater than
$2 >= 124.47 # greater than or equal
$2 < 124.47 # smaller than
$2 <= 124.47 # smaller than or equal

$2 ~ /^10.$/ # regex match
$2 !~ /^10.$/ # regex negated match -- this one might be new

and logical operators:

$1 ~ /^2015/ && $6 > 20000000  # and -- high volume in 2015
$6 < 1000000 || $6 > 20000000 # or -- low or high volume
! /^2015/ # not -- not in 2015

You can almost create arbitratily complex conditions. You are missing variables…

Built-in Variables

Some variables just “exist”, they already contain values and are automatically updated. These variables are easy to recognize because they are named in CAPITAL letters. Exception: column variables (starting with a $) are also built-in variables.

There are a bunch of built-in variables, but you’ll mostly use 2:

And 2 more if you’re dealing with multiple files:

User-Defined Variables

There is no need to “declare” the variable or initialize it. A variable “comes to life” when you use it. You can count (and print) how many lines happened in December 2015:

$ cat netflix.tsv | awk '/^2015-12/ {count++; print count, $0}'
1 2015-12-31 116.209999 117.459999 114.279999 114.379997 9245000 114.379997
2 2015-12-30 118.949997 119.019997 116.43 116.709999 8116200 116.709999
3 2015-12-29 118.190002 119.599998 116.919998 119.120003 8159200 119.120003
4 2015-12-28 117.260002 117.349998 113.849998 117.110001 8406300 117.110001
5 2015-12-24 118.220001 118.800003 117.300003 117.330002 3531300 117.330002
# snip
22 2015-12-01 124.470001 125.57 122.419998 125.370003 12528800 125.370003

Not having to declare variables is convenient, but it’s also error-prone. If you misspell a variable, there won’t be any warning and it might take you a while to discover your mistake. You’ve been warned. Remember: this is a language that optimizes for one-liners.

What are variables initialized to?

$ awk 'BEGIN {print x + 2}'          # => 2
$ awk 'BEGIN {x = x + 2; print x}' # => 2
$ awk 'BEGIN {print x}' # => <blank> -- empty string, really
# BEGIN will be discussed next...

An undefined x contains the empty string. The first time you access it, that’s what you get. Strings are converted to numbers for numerical operations:

x + 2
# expands:
"" + 2
# expands:
0 + 2
# expands:
2

Special Patterns: BEGIN and END

BEGIN and END are special conditions that only get triggered once per run.

These conditions get triggered even if there are no input lines.

BEGIN is usually used to initialize variables – though now you know that’s not necessary for zeroes or empty strings. It can also be used to print a header.

END is usually used to crunch a result and print a summary or report:

$ cat netflix.tsv | awk 'END {print NR}'

(achievement unlocked: you have re-created wc -l…)

Blocks and Control

You can have multiple condition-block pairs. Each line in the input files gets presented to each block you write:

$ cat netflix.tsv | awk '/^2016-03-24/ {print} $4 == 96.43 {print}'
2016-03-24 98.639999 98.849998 97.07 98.360001 10646900 98.360001
2016-03-15 97.870003 98.510002 96.43 97.860001 9678000 97.860001

# could be written as:
#
# $ cat netflix.tsv | awk '/^2016-03-24/; $4 == 96.43'
#
# because we both know what a missing block means...
# but for this example, it's a bit opaque.

That works great until you have a line that matches both conditions:

$ cat netflix.tsv | awk '/^2016-03-24/ {print} $4 == 97.07 {print}'
2016-03-24 98.639999 98.849998 97.07 98.360001 10646900 98.360001
2016-03-24 98.639999 98.849998 97.07 98.360001 10646900 98.360001

The same line was printed twice! There are two solutions for this problem:

$ cat netflix.tsv | awk '/^2016-03-24/ {print; next} $4 == 97.07 {print}'
2016-03-24 98.639999 98.849998 97.07 98.360001 10646900 98.360001

If you hit a next, your script will stop matching blocks and go to the next line from the input file. Using next means you have to think about the order of your blocks.
That’s not necessarily a bad thing.

There’s also an exit statement to stop processing any more input and exit your script. The END block will still be executed, if you have one.

Taking inventory: what can you do?

At this point, a better question would be: what can’t you do?

In review, you can:

Exercises

Try to:

Answers are here.

What’s next?

Part 3

Discuss on Twitter