Awk is one of those tools that have been around on UNIX systems forever. Its syntax feels a little aged, which may be the reason a lot of programmers tend to shy away from it.
Awk is created to perform a set of actions on each line in a text file. Think of it as a spreadsheet for programmers. Awk can use files or the standard input stream, so you can either specify a file or pipe the output of a command to awk and let it do its thing.
The most basic usage of awk could be:
echo "1\n2\n39" | awk '{sum += $1}; END{print sum}'
Here we send three lines containing a single number each to awk,
then tell awk to accumulate the number in the first column of each
line to the variable <code>sum</code>, and in the end print this
accumulated value. The varible $1
represents the first column,
$2
represents the second etc. The variable $0
represents the
entire row.
So awk has this notion of columns
in a line. Think of each line
in your file as records, and each column as a column. Our first
example used a single column, let's say we have a file containing
the name and income of our employees in a text file employees.txt
that looks like this:
John 100000 Jane 50000 Bill 200000
If we want to sum the income of all employees, we could do this:
cat employees.txt|awk '{sum += $2};END{print sum}'
too, so we could make this a little better:
cat employees.txt|awk '{sum += $2};END{printf "Sum: %d\n", sum}'
Which outputs:
Sum: 350000
In this example, we created a variable <code>sum</code> that we simply incremented using <code>+=</code>, which should feel natural to a programmer. You can create as many variables as you like, and apply most mathemathical functions to them. Let's calculate the average income from our example file:
cat employees.txt|awk '{sum += $2};END{average = sum/NR;printf "Sum: %d\nAverage: %d\n", sum, average}'
which outputs:
Sum: 350000 Average: 116666
In the example above, we used a variable that awk provides for us,
namely NR
, which is the Number of Rows in our
file. Awk provides several of these, some of which are:
The GNU Awk User's Guide describes other variables offered by gawk, which is probably the version of awk on your system.
Awk lets you add special blocks that are run before it starts
analyzing each line and after it has completed. Most Awk programs
will have an END
block which prints the results of the
processing - we used it to print the total and average incomes in
the examples above.
The BEGIN
block can be used to set default values for
variables. A contrived example, perhaps, but if we want to start
off with an income total of 1000000 we could enter:
awk 'BEGIN{sum=1000000};{sum += $2};END{print sum}' employees.txt
A better example of using the BEGIN
block is with
conditionals. Let's say we want to find the name of the employee
with the highest income:
awk 'BEGIN{highest=0};{$2 > highest (winner=$1)};END{print winner}' employees.txt
which tells us who has the highest income:
Bill
Simply start a block with an expression evaluating to true or false
$2 > highest
and add parenthesis around the code to be executed
if the expression evaluates to true.
You may have as many blocks in an awk program as you wish. A block is performed once for each line in the input file, and is enclosed in brackets:
{ sum_income += $1} { sum_tax += $2 }
Will add the value of the first column to the sum_income
variable, and the value of the second column to the sum_tax
variable.
You may be a Perl guy, in which case one-liners like this feels
natural. If you're like me, however, at some point you feel like
editing your code in a tool more suited than a command line. Awk
lets you pass an Awk program file using the -f
switch:
awk -f average.awk employees.txt
The file average.awk looks like this:
BEGIN{ highest = 0 } $2 > highest (winner = $1) END{ printf("And the winner is %s\n", winner) }
As an example of real world usage we will build an awk program that counts the number of requests for each IP address to an Apache web server from its access log.
The Apache access log uses a single line for each request, with a set of fields separated by whitespace - which makes it ideal for being used with awk. A line might look like this:
127.0.0.1 - - [31/Oct/2010:23:20:52 +0100] "GET /foo HTTP/1.1" 200 363 "-" "-"
The first "field" in this line is the IP address making the request. Awk lets us use associative arrays just like any other variable:
1: { requests[$1] += 1 }
Will set the value of the element in the requests associative array
with the key of the IP address in the first field to itself
plus 1. Furthermore, we can loop each element in the array using
the for
syntax:
2: for (ip in requests) 3: printf("Number of requests for %s: %d\n", ip, requests[ip])
Will print the key for each element in the requests
array along
with its numerical value.
The complete awk script to sum the number of requests for each IP address looks like this:
1: { 2: requests[$1] += 1 3: } 4: END { 5: for (ip in requests) 6: printf("Number of requests for %s: %d\n", ip, requests[ip]) 7: }
Have fun!