Data munging with Vim and AWK

So, I had some data in a file. It was temporal data. It looked like this:

100 4/15 16:50
143 4/15 16:51
121 4/15 16:52
209 4/15 16:53
105 4/15 16:54
321 4/15 16:55
173 4/15 16:56
205 4/15 16:57
197 4/15 16:58
211 4/15 16:59

But I needed it to be in ISO 8601 format so I could plot it with Timeplot. The data represents hits per minute from an Apache log file. I also needed the time to show up in the first column and the hits in the second column. Here’s what I needed the data to look like:

2008-04-15T16:54 105
2008-04-15T16:55 321
2008-04-15T16:56 173
2008-04-15T16:57 205
2008-04-15T16:58 197
2008-04-15T16:59 211

Well, I knew that the dates I had in the file were from 2008, and all of the other bits are there, just in the wrong format. Here’s what I did to get things in the right format for Timeplot:

:%s/4\//2008-04-/g # search for “4/” and replace it with “2008-04-“

Now my data looks like this:

173 2008-04-15 16:56

But not all of the minutes were two digits for some reason (I don’t remember how I parsed the log to get into this state – it was hurried and… well… wrong). I had times that looked like “17:9″ so I had to zero-pad the minutes that were only single digits.

:%s/:\(.\)$/:0\1/g # find “:” followed by some character and the end of the line, and replace that with a “0” followed by whatever that character was.

So now my minutes look right.

144 2008-04-15 16:09

Now I needed to replace spaces between the date and time values with a “T” as per ISO 8601 rules for date and time representations in a single string:

:%s/\(-..\)\s/\1T/g # find a “-” followed by any two characters, followed by a space, and replace it with whatever those two characters were, followed by a “T”.

That worked well.

213 2008-04-15T16:45

At this point I had everything knocked, but I forgot that some of my *hours* were also single digits :-/

:%s/T\(.\):/T0\1:/g # find a “T” followed by a single character, and replace that with “T0″ and whatever that character was.

There. That did it. Now I just need to comma-separate the values, which is simple after all of this nonsense:

:%s/ /,/g # c’mon, you get this one, right?

Great! Except that the datetime string needs to be the *first* column. Here’s where awk comes in handy:

cat hitspermin_bad.txt | awk -F, ‘{print $2,$1}’ > hitspermin_good.txt

You’ll notice that, since I could see the data and know the source, I didn’t bother explicitly telling Vim to look for *numbers* – I just used “.” to say “find any character”. If I had less confidence in the data I would’ve used “\d” to make sure I had numeric digits there.

Of course, the better solution is to properly parse the log file in the first place, but the log file in this case was 25GB!! Of course I’ll go back and change my script (I used loghetti with a custom (read: flawed) output filter), and test it on smaller data, and eventually get it to be more reliable, but to get a quick Timeplot graph together, this was a fast, if iterative and somewhat annoying, way to go. It also gave me a chance to exercise my Vim search and replace skillz.

  • Stephen P. Schaefer

    I’m curious why you went with awk for the last step. You could have done the same in vi with

    :%s/\(.*\),\(.*\)/\2,\1/

    …although I tested that with vim, not real vi, which isn’t handy to me at the moment, but it *ought* to work in original vi.

    This is all useful for a one-off session. Awk (or sed or perl, or …) for all the above is better if you need to do this sort of thing on a continuing basis – write the script once, run it as often as needed. I understand you corrected the problem at the source (loghetti filter), which is even better.

  • m0j0

    Hi Stephen,

    Thanks much for that comment. There’s no good reason why I didn’t use your method whatsoever. I guess beyond one backref my brain switches to ‘awk mode’. If awk wasn’t availaable, only then would I think “oh, I can just do that here”. I don’t know why that is.

    Of course, as I believe I stated, this *was* all a one-off session. It was an old data file from some unrelated testing that I was repurposing for another test. The loghetti filter has yet to be fixed, but that’s part of a slightly larger project that I’m not likely to get back to for another week.

  • Reynaldo

    It is also redundant the cat used with awk when you can use only awk to do the job.

    so it would be:
    awk -F, ‘{print $2,$1}’ hitspermin_bad.txt > hitspermin_good.txt

  • http://standalone-sysadmin.blogspot.com Matt Simmons

    Speaking as someone who constantly writes 3-line (wrapped) command lines, there are always ways to “improve” or “condense” your code. That doesn’t mean your way isn’t better, at least for you.

    When you’ve just got to get something taken care of, who cares if the “cat” is redundant, or if you could have fixed it with line noise in vim? You got it done, and it worked.

    Neat solution.