Friday, 29 April 2011

AWK - a simple tool for parsing files

Intro

Every once in a while, a programmer comes across the chalenge of collecting/parsing data from a structured file. For this, we need to parse the file with some tool/language. You could do this in any language, but let's see how it's done with awk.

So let's start explaining awk:


FACTS
  1. Programming language
  2. Simple and Fast
  3. Inspired Perl
  4. C syntax
  5. Ideal for precessing data files (ex: csv)

AWK file structure

BEGIN { code1 }
{ code2 }
END { code3 }

So:
  • code1 is ran before the actual parsing is initiated. It can be useful to initialize some variables, for example.
  • code2 is ran every time we parse a new line. Treat this as "what to do for each line".
  • code3 is ran in the end of the parsing. You can use this to process data collected at code2 and print it, for example.

Built-In variables

  • $0 -> The current line
  • $N -> The Nth element of the current line
  • NR -> Line number
  • NF -> Number of fields in the current line
  • FILENAME -> Name of the file being parsed
  • ...

Example 1

Consider this simple file "data.txt":

12
50
12
35
12
12
...

Theses values could be anything from grades, to how long did a process take to do a transaction, to a size of data we're saving to disk in every process, etc.

Imagine you wanted the sum and average of these numbers, you could simply write this in the console:

~$ awk '{ s += $1 } END { print "sum: ", s, " average: ", s/NR }' data.txt

You don't even have to write a source code file to do this, simple write directly you're code in apostrophes. Of course, if what you want to do is a little more complex, this is troublesome. Then you could write the code in a file, as we see in the next example.


Example 2

Consider this simple file grades.txt:

name number grade1 grade2
rei 666 20 20
deus 876 17 15
norad 555 5 9
mbp 000 0 0

We have a simple file with a students' name, number and grades.

Let's calculate the grade's average and save in a new file:



1:  BEGIN {} 
2: {
3: if(NR != 0) { # skip the 1st line
4: grades[0] += $3;
5: grades[1] += $4;
6: }
7: }
8: END {
9: # a space between strings concatenates them
10: name = FILENAME "_parsed.txt";
11: # the comma concatenates strings when printing
12: print "Average grade 1: ", grades[0]/(NR-1) >> name;
13: print "Average grade 2: ", grades[1]/(NR-1) >> name;
14: }

So this is the file parse.awk. To run it, simple type in the console:

~$ awk -f parse.awk grades.txt


Conclusion

There it is, two basic examples to get you started with AWK.
You can do many more things like using loops, etc.

The best way to learn more is to have an actual real case problem, so the next time you have to parse a file and do fairly basic stuff with it, give AWK a try.