Throw out your Perl: One-line aggregation in awk

I ran into a student from a class I taught last summer. He’s a really sharp guy, and when I first met him, I was impressed with just how much Perl he could stuff into his brain’s cache. He would write what he called ‘one-liners’ in Perl that, in reality, took up 5-10 lines in his terminal. Still, he’d type furiously, without skipping a beat. But he told me when we met that he no longer does this, because I covered awk in my class.

His one-liners were mostly for data munging. The data he needed to munge was mostly data that was pretty predictable. It had a fixed number of fields, a consistent delimiter — in short, it was perfect for munging in awk without using any kind of esoteric awk-ness.

One thing I cover in the learning module I’ve developed on Awk is aggregation of data using pretty simple awk one-liners. For example, here’s a pretty typical /etc/passwd file (we need some data to munge):

root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
news:x:9:13:news:/etc/news:
uucp:x:10:14:uucp:/var/spool/uucp:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin
gopher:x:13:30:gopher:/var/gopher:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
nobody:x:99:99:Nobody:/:/sbin/nologin
dbus:x:81:81:System message bus:/:/sbin/nologin
apache:x:48:48:Apache:/var/www:/sbin/nologin
avahi:x:70:70:Avahi daemon:/:/sbin/nologin
mailnull:x:47:47::/var/spool/mqueue:/sbin/nologin
smmsp:x:51:51::/var/spool/mqueue:/sbin/nologin
distcache:x:94:94:Distcache:/:/sbin/nologin
nscd:x:28:28:NSCD Daemon:/:/sbin/nologin
vcsa:x:69:69:virtual console memory owner:/dev:/sbin/nologin
rpc:x:32:32:Portmapper RPC user:/:/sbin/nologin
rpcuser:x:29:29:RPC Service User:/var/lib/nfs:/sbin/nologin
nfsnobody:x:65534:65534:Anonymous NFS User:/var/lib/nfs:/sbin/nologin
sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
mysql:x:27:27:MySQL Server:/var/lib/mysql:/bin/bash
dovecot:x:97:97:dovecot:/usr/libexec/dovecot:/sbin/nologin
squid:x:23:23::/var/spool/squid:/sbin/nologin
ldap:x:55:55:LDAP User:/var/lib/ldap:/bin/false
pcap:x:77:77::/var/arpwatch:/sbin/nologin
ntp:x:38:38::/etc/ntp:/sbin/nologin
rpm:x:37:37::/var/lib/rpm:/sbin/nologin
haldaemon:x:68:68:HAL daemon:/:/sbin/nologin
named:x:25:25:Named:/var/named:/sbin/nologin
xfs:x:43:43:X Font Server:/etc/X11/fs:/sbin/nologin
jonesy:x:500:500::/home/jonesy:/bin/bash

It’s not exotic, cool data that we’re going to infer a lot of interesting things from, but it’ll do for pedagogical purposes.

Now, let’s write a super-simple awk aggregation routine that’ll count the number of users whose UID is > 100. It’ll look something like this:

awk -F: '$3 > 100 {x+=1} END {print x}' /etc/passwd

The important thing to remember is that awk will initialize your variables to 0 for you, which cuts down on some clutter.

Let’s abuse awk a bit further. What if we want to know how many users use each shell in /etc/passwd, whatever those shells may be? Here’s a one-liner that’ll take care of this for you:

awk -F: '{x[$7]+=1} END {for(z in x) {print z, x[z]} }' /etc/passwd

While awk doesn’t technically support multi-dimensional arrays, it also doesn’t have to be numerically indexed. So here, we tell awk to increment x[$7]. $7 is the field that holds the shell for each user, so if $7 on the current line is /bin/bash, then we’ve told awk to increment the value in the array indexed at x[/bin/bash]. So, if there’s only one line containing /bin/bash up to the current record, then x[/bin/bash]=1

There’s a lot of great things you can move onto from here. You can do things that others use Excel for right in awk. If you have checkbook information in a flat file, you can add up purchases only in a given category, or, using the technique above, in every category. If you store your stock purchase price on a line with the current price, you can use simple math to get the spread on each line and tell you whether your portfolio is up or down. Let’s have a look at something like that. Here’s some completely made up, hypothetical data representing a fictitious user’s stock portfolio:

ABC,100,12.14,19.12
FOO,100,24.01,17.45
BAR,50,88.90,94.33
BAZ,50,75.65,66.20
NI,23,33.12,43.32

Save that in a file called “stocks.txt”. The columns are stock symbol, number of shares, purchase price, and current price, in that order. This awk one-liner indexes the ‘x’ array using the stock symbol, and the value at that index is set to the amount gained or lost:

awk -F, '{x[$1]=($2*$4)-($2*$3)} END {for(z in x) {print z, x[z]}}' stocks.txt

Hm. Actually, that’s kind of inefficient. I realized while previewing this post that I can shorten it up a bit like this:

awk -F, '{x[$1]=($2*($4 - $3))} END {for(z in x) {print z, x[z]}}' stocks.txt

Glad I caught that before the nitpickers flamed me to a crisp. Always preview your posts! ;-P

Ah, but of course, that’s not enough. This spits out the gain and loss for each stock, but what about the net gain or loss across all of them? You only need to tweak a little bit:

awk -F, '{x[$1]=($2*($4-$3)); y+=x[$1]} END {for(z in x) {print z, x[z]}; print "Net: "y}' stocks.txt

We just added the assignment of the ‘y’ variable before the “END”, and then added a print statement after the “END”.

I hope this helps some folks out there. Also, if your team needs to know stuff like this, I do on-site training!

  • Peter

    You forgot to use the default shell when field 7 is undefined.

  • m0j0

    Thanks, but I didn’t forget – I decided not to care, because I’m hoping that people will be able to pick up on what’s going on when they see a number with no label attached. In the courses I teach, as in the posts I write, I assume some prior knowledge and experience. This isn’t snobbery, it’s really just to save more experienced users from having to read an explanation that they can very easily grasp for themselves. The solution is an exercise for the reader, and is pretty trivial besides.

  • http://olifante.blogs.com olifante

    The two last awk one-liners are identical, apart from a couple of spaces.

  • m0j0

    Fixed. Thank you.

  • Anonymous Rex

    perl -F: -alne’$x++ if $F[2]>100;END{print $x}’ /etc/passwd
    perl -F: -alne’$x{$F[6]}++;END{for(keys%x){print”$_ $x{$_}”}}’ /etc/passwd
    perl -F, -alne’$x{$F[0]}=$F[1]*($F[3]-$F[2]);END{for(keys%x){print”$_ $x{$_}”}}’ stocks.txt
    perl -F, -alne’$y+=$x{$F[0]}=$F[1]*($F[3]-$F[2]);END{for(keys%x){print”$_ $x{$_}”};print”Net: $y”}’ stocks.txt

    Better versions of the last two:

    perl -F, -alpe’$_=”$F[0] “.($F[1]*($F[3]-$F[2]))’ stocks.txt
    perl -F, -alpe’$y+=$_=$F[1]*($F[3]-$F[2]);$_=”$F[0] $_”;END{print”Net: $y”}’ stocks.txt

  • m0j0

    Great – ‘cos that’s so much cleaner, shorter, and more intuitive than the awk code, and I can really see why I would never, ever, ever choose anything over perl for anything now. Thanks for that.

    Seriously, those are nice one-liners. I should’ve noted that I also do know, use, and teach perl (only upon request, by gunpoint — but it happens). The article makes it pretty clear, though, that I’m talking about awk in a particular context, for a given purpose. You Perl hackers should know all about context. It’s not a blanket “awk beats perl” thing – I just think awk is cleaner and easier for these types of tasks. Those people I’ve been able to show awk solutions to, who were using perl, now don’t use perl for those tasks. Not because you can’t do it in perl (duh), but because you don’t have to, and awk is a bit more intuitive for certain tasks.

    I have plenty of other posts I have yet to write about why I hate perl, and why I think it will eventually become the next cobol — save your flames for those future posts.

  • rjp

    rjp> cat stocks.txt
    ABC,100,12.14,19.12
    FOO,100,24.01,17.45
    BAR,50,88.90,94.33
    BAZ,50,75.65,66.20
    NI,23,33.12,43.32
    ABC,200,13.12,19.12

    The “better” perl version for stocks doesn’t aggregate the profit/loss by stock:

    ABC 698
    FOO -656
    BAR 271.5
    BAZ -472.5
    NI 234.6
    ABC 1200
    Net: 1275.6

    Then again, the awk version gets it totally wrong for ABC:

    FOO -656
    ABC 1200
    BAZ -472.5
    NI 234.6
    BAR 271.5
    Net: 1275.6

  • ville vainio

    IPython string lists are also handy (and arguably more intuitive than awk) for this:

    http://ipython.scipy.org/moin/Cookbook/StringListProcessing

    So far, fields() does not support arbitrary separator (whitespace only) but than can be fixed ;-)

  • Agate

    I don’t think the Perl post is really a flame. But it’s a good example of how you can do the same thing in Perl.

    The title is the intentionally-inflammatory “Throw out your Perl,” so is it really surprising that someone felt the need to respond?

    To someone familiar with both languages, they are both about equivalently readable. Perl’s “$” are ugly, but then again, awk uses “$” for the column contents, so overall the one-liners seem to be about equally obfuscated to the casual viewer.

  • http://morlockhq.blogspot.com Jim

    Brian-

    That’s a good point about context. I look at it as having one more tool in the belt. Linux/Unix have so many specialized tools at the user’s disposal (as well as more generalized tools like perl, python, ruby, c, et..) that it is a shame to specialize in only one.