Awk Tutorial, part 1
I already mentioned why you
should learn AWK.
Let me show you how you can start using it today.
Example Data
I think it’s hard to learn AWK in a vacuum. I looked for open data on the web and picked Netflix historical stock prices. The CSV data is available to download from Yahoo finance or Google finance. It’s possible to parse this CSV data in AWK, but I replaced commas with TAB characters to make examples easier. Here is the data we’re going to use:
Download the data if you want to try examples yourself.
Printing Columns
Printing columns is probably the most useful things you can do in AWK:
Let’s take it one step at a time:
cat netflix.tsv | awk
to send netflix.tsv to the STDIN of AWK
Alternatively, awk '{print $2}' netflix.tsv
would have given us the same
result. For this tutorial, I use cat
to visually separate the input data from
the AWK program itself. Using cat
also emphasizes that AWK can treat any input and
not just existing files.
{print $2}
to print the 2nd column
Yes, you need the curly brackets – I’ll come to that shortly. You already guessed it: column 1 is $1, column 2 is $2, column 7 is $7, etc…
# snip
to indicate omitted output
There are 3485 lines in the data file. For most examples, I’ll truncate the output because more isn’t always better.
Always Use Single-Quotes with AWK
Let’s get this out of the way: always use single-quotes with AWK.
As you’ve seen above, column names have dollar signs in them ($1, $2, $7…) which would usually be substituted by BASH. Single-quotes are how you tell BASH to keep the content of your strings untouched. Double-quotes won’t work, and backslash escapes might work but are not worth fighting for.
Let’s keep things simple with single-quotes.
If you need to inject some values into your script, I’ll show you how in a follow-up tutorial.
What’s With Those Curly Brackets? { }
What’s the difference between:
Answer: one works and the other doesn’t! (rimshot) We’ll need to take a step back to explain the difference. In AWK, a program is composed of rules which look like:
If it were C code:
In short, the curly brackets ({ }
) tell AWK to do something. AWK allows
either the condition or the action to be missing.
What does it mean when the condition is missing?
A missing condition defaults to “always run”:
if true, print the 2nd column.
What does it mean when the action is missing?
A missing action defaults to “print”:
A missing block prints the whole matching line.
$0
is a special variable that contains the current line, before it was
separated into fields. print $0
means “print the current line”. print
, by
itself, also prints the current line.
More Printing
You know how to print one column, but what if you need to print many?
A comma between print values will insert a space in the output. AWK also has
printf
which unleashes infinite formatting power:
I removed the header line, which had been mangled in the printf.
AWK does string concatenation without an operator: just put 2 values next to each other. This is useful when you don’t want to reach for printf but still want some formatting flexibility:
Ooooh, we’re back to CSV.
Taking inventory: what can you do?
This is just the beginning, and there’s more to cover. But you now have a solid foundation: you know about conditions and actions, columns and printing. You can:
- print only the columns you want
- print them in the order you want
- format with all the power of printf
- use conditions to print only lines you want
Exercises
Try to:
- only print the ‘Date’, ‘Volume’, ‘Open’, ‘Close’ columns, in that order
- only print lines where the stock price increased (‘Close’ > ‘Open’)
- print the ‘Date’ column and the stock price difference (‘Close’ - ‘Open’)
- print an empty line between each line – double-space the file
Answers are here.