Awk Tutorial, part 1

April 5, 2016

I already mentioned why you should learn AWK.
Let me show you how you can start using it today.

Example Data

I think it’s hard to learn AWK in a vacuum. I looked for open data on the web and picked Netflix historical stock prices. The CSV data is available to download from Yahoo finance or Google finance. It’s possible to parse this CSV data in AWK, but I replaced commas with TAB characters to make examples easier. Here is the data we’re going to use:

Date        Open        High        Low         Close       Volume     Adj Close
2016-03-24  98.639999   98.849998   97.07       98.360001   10646900   98.360001
2016-03-23  99.75       100.389999  98.809998   99.589996   8292300    99.589996
2016-03-22  100.480003  101.519997  99.199997   99.839996   9039500    99.839996
2016-03-21  101.150002  102.099998  99.50       101.059998  9562900    101.059998
2016-03-18  100.50      102.410004  100.010002  101.120003  15437300   101.120003
...

Download the data if you want to try examples yourself.

Printing Columns

Printing columns is probably the most useful things you can do in AWK:

$ cat netflix.tsv | awk '{print $2}'
Open
98.639999
99.75
100.480003
101.150002
100.50
# snip

Let’s take it one step at a time:

Alternatively, awk '{print $2}' netflix.tsv would have given us the same result. For this tutorial, I use cat to visually separate the input data from the AWK program itself. This also emphasizes that AWK can treat any input and not just existing files.

Yes, you need the curly brackets – I’ll come to that shortly. You already guessed it: column 1 is $1, column 2 is $2, column 7 is $7, etc…

There are 3485 lines in the data file. For most examples, I’ll truncate the output because more isn’t always better.

Always Use Single-Quotes with AWK

Let’s get this out of the way: always use single-quotes with AWK.

As you’ve seen above, column names have dollar signs in them ($1, $2, $7…) which would normally be substituted by BASH. Single-quotes are how you tell BASH to keep the content of your strings untouched. Double-quotes won’t work, and backslash escapes might work but are not worth fighting for.

Let’s keep things simple with single-quotes.

If you need to inject some values into your script, I’ll show you how in a follow-up tutorial.

What’s With Those Curly Brackets? { }

What’s the difference between:

awk '{print $2}'

awk 'print $2'

Answer: one works and the other doesn’t! (rimshot) We’ll need to take a step back to explain the difference. In AWK, a program is composed of rules which look like:

some-condition { one or many statements }

If it were C code:

if (some-condition) { one or many statements; }

In short, the curly brackets ({ }) tell AWK to do something. AWK allows either the condition or the action to be missing.

What does it mean when the condition is missing?

A missing condition defaults to “always run”:

awk '{print $2}'
# means:
awk '1 {print $2}'      # 0 is false, any other value is true

if true, print the 2nd column.

What does it mean when the action is missing?

A missing action defaults to “print”:

$ cat netflix.tsv | awk '$2 > 100'
Date        Open        High        Low         Close       Volume     Adj Close
2016-03-22  100.480003  101.519997  99.199997   99.839996   9039500    99.839996
2016-03-21  101.150002  102.099998  99.50       101.059998  9562900    101.059998
2016-03-18  100.50      102.410004  100.010002  101.120003  15437300   101.120003
2016-03-07  101.00      101.790001  95.25       95.489998   23855200   95.489998
2016-01-22  104.720001  104.989998  99.220001   100.720001  26772700   100.720001
# snip -- output has been reformated to align

A missing block just prints the whole matching line.

awk '$2 > 100'
# means:
awk '$2 > 100 { print }'
# means:
awk '$2 > 100 { print $0 }'

$0 is a special variable that contains the current line, before it was separated into fields. print $0 means “print the current line”. print, by itself, also prints the current line.

More Printing

You know how to print one column, but what if you need print many?

$ cat netflix.tsv | awk '{print $1, $6, $5}'
Date        Volume    Close
2016-03-24  10646900  98.360001
2016-03-23  8292300   99.589996
2016-03-22  9039500   99.839996
2016-03-21  9562900   101.059998
2016-03-18  15437300  101.120003
# snip -- output has been reformated to align

A comma between print values will insert a space in the output. AWK also has printf which unleashes infinite formatting power:

$ cat netflix.tsv | awk '{printf "%s %15s %.1f\n", $1, $6, $5}' | sed 1d
2016-03-24        10646900 98.4
2016-03-23         8292300 99.6
2016-03-22         9039500 99.8
2016-03-21         9562900 101.1
2016-03-18        15437300 101.1
# snip

I removed the header line, which had been mangled in the printf.

AWK does string concatenation without an operator: just put 2 values next to each other. This is useful when you don’t want to reach for printf but still want some formatting flexibility:

$ cat netflix.tsv | awk '{print $1 "," $6}'
Date,Volume
2016-03-24,10646900
2016-03-23,8292300
2016-03-22,9039500
2016-03-21,9562900
2016-03-18,15437300
# snip

Ooooh, we’re back to CSV.

Taking inventory: what can you do?

This is just the beginning, and there’s more to cover. But you now have a solid foundation: you know about conditions and actions, columns and printing. You can:

Exercises

Try to:

Answers are here.

What’s next?

Part 2

Discussion, links, and tweets

Follow me on Twitter