Table of Contents

Awk

Awk is one of those tools that have been around on UNIX systems forever. Its syntax feels a little aged, which may be the reason a lot of programmers tend to shy away from it.

Awk is created to perform a set of actions on each line in a text file. Think of it as a spreadsheet for programmers. Awk can use files or the standard input stream, so you can either specify a file or pipe the output of a command to awk and let it do its thing.

The most basic usage of awk could be:

echo "1\n2\n39" | awk '{sum += $1}; END{print sum}'

Here we send three lines containing a single number each to awk, then tell awk to accumulate the number in the first column of each line to the variable <code>sum</code>, and in the end print this accumulated value. The varible $1 represents the first column, $2 represents the second etc. The variable $0 represents the entire row.

Fields/columns

So awk has this notion of columns in a line. Think of each line in your file as records, and each column as a column. Our first example used a single column, let's say we have a file containing the name and income of our employees in a text file employees.txt that looks like this:

John 100000
Jane 50000
Bill 200000

If we want to sum the income of all employees, we could do this:

cat employees.txt|awk '{sum += $2};END{print sum}'

too, so we could make this a little better:

cat employees.txt|awk '{sum += $2};END{printf "Sum: %d\n", sum}'

Which outputs:

Sum: 350000

Variables

In this example, we created a variable <code>sum</code> that we simply incremented using <code>+=</code>, which should feel natural to a programmer. You can create as many variables as you like, and apply most mathemathical functions to them. Let's calculate the average income from our example file:

cat employees.txt|awk '{sum += $2};END{average = sum/NR;printf "Sum: %d\nAverage: %d\n", sum, average}'

which outputs:

Sum: 350000
Average: 116666

"Magic" variables

In the example above, we used a variable that awk provides for us, namely NR, which is the Number of Rows in our file. Awk provides several of these, some of which are:

  • NR: The number of Rows in our file
  • NF: The number of Fields
  • FILENAME: The filename of the input file
  • FS: The Field Separator character. Set this variable to something other than whitespace to split columns based on this character
  • RS: The Record Separator character, which defaults to a newline. Set it to something else to split rows based on this charac

The GNU Awk User's Guide describes other variables offered by gawk, which is probably the version of awk on your system.

BEGIN/END blocks

Awk lets you add special blocks that are run before it starts analyzing each line and after it has completed. Most Awk programs will have an END block which prints the results of the processing - we used it to print the total and average incomes in the examples above.

The BEGIN block can be used to set default values for variables. A contrived example, perhaps, but if we want to start off with an income total of 1000000 we could enter:

awk 'BEGIN{sum=1000000};{sum += $2};END{print sum}' employees.txt

Conditionals

A better example of using the BEGIN block is with conditionals. Let's say we want to find the name of the employee with the highest income:

awk 'BEGIN{highest=0};{$2 > highest (winner=$1)};END{print winner}' employees.txt

which tells us who has the highest income:

Bill

Simply start a block with an expression evaluating to true or false $2 > highest and add parenthesis around the code to be executed if the expression evaluates to true.

Several blocks

You may have as many blocks in an awk program as you wish. A block is performed once for each line in the input file, and is enclosed in brackets:

{ sum_income += $1}
{ sum_tax += $2 }

Will add the value of the first column to the sum_income variable, and the value of the second column to the sum_tax variable.

Program files

You may be a Perl guy, in which case one-liners like this feels natural. If you're like me, however, at some point you feel like editing your code in a tool more suited than a command line. Awk lets you pass an Awk program file using the -f switch:

awk -f average.awk employees.txt

The file average.awk looks like this:

BEGIN{
    highest = 0
}
$2 > highest (winner = $1)
END{
    printf("And the winner is %s\n", winner)
}

Real world usage

As an example of real world usage we will build an awk program that counts the number of requests for each IP address to an Apache web server from its access log.

The Apache access log uses a single line for each request, with a set of fields separated by whitespace - which makes it ideal for being used with awk. A line might look like this:

127.0.0.1 - - [31/Oct/2010:23:20:52 +0100] "GET /foo HTTP/1.1" 200 363 "-" "-"

The first "field" in this line is the IP address making the request. Awk lets us use associative arrays just like any other variable:

1: { requests[$1] += 1 }

Will set the value of the element in the requests associative array with the key of the IP address in the first field to itself plus 1. Furthermore, we can loop each element in the array using the for syntax:

2: for (ip in requests)
3:   printf("Number of requests for %s: %d\n", ip, requests[ip])

Will print the key for each element in the requests array along with its numerical value.

The complete awk script to sum the number of requests for each IP address looks like this:

1: {
2:    requests[$1] += 1
3: }
4: END {
5:    for (ip in requests)
6:        printf("Number of requests for %s: %d\n", ip,     requests[ip])
7: }

Have fun!