I’m working on a new project that will be open sourced if I can ever get it to be generically useful. It’s called “sarviz”, and it’s a visualization tool for output from the “sar” UNIX system reporting utility. I know tools like this exist, but please read on, as I’m looking to do something a bit different from what I’ve seen.
A quick, simple explanation of sar
System administrators typically run sar as a cron job, and each day sar will generate a report that lists the values of various system counters for a specified time interval throughout the day. So you end up with a text file that lists, for example, the cpu iowait value every 10 minutes throughout the day. There are maybe a dozen different categories of counters enabled by default, and more that aren’t (like disk-related counters). Anyway, you wind up with a text file that looks something like this:
23:30:01Â Â Â Â Â Â Â Â Â CPUÂ Â Â Â %userÂ Â Â Â %niceÂ Â %systemÂ Â %iowaitÂ Â Â %stealÂ Â Â Â %idle
23:40:02Â Â Â Â Â Â Â Â Â allÂ Â Â Â Â 0.32Â Â Â Â Â 0.00Â Â Â Â Â 0.32Â Â Â Â Â 6.57Â Â Â Â Â 0.49Â Â Â Â 92.29
23:40:02Â Â Â Â Â Â Â Â Â Â Â 0Â Â Â Â Â 0.32Â Â Â Â Â 0.00Â Â Â Â Â 0.32Â Â Â Â Â 6.57Â Â Â Â Â 0.49Â Â Â Â 92.29
23:50:01Â Â Â Â Â Â Â Â Â allÂ Â Â Â Â 0.74Â Â Â Â Â 0.00Â Â Â Â Â 0.82Â Â Â Â Â 7.14Â Â Â Â Â 0.55Â Â Â Â 90.76
23:50:01Â Â Â Â Â Â Â Â Â Â Â 0Â Â Â Â Â 0.74Â Â Â Â Â 0.00Â Â Â Â Â 0.82Â Â Â Â Â 7.14Â Â Â Â Â 0.55Â Â Â Â 90.76
Average:Â Â Â Â Â Â Â Â Â allÂ Â Â Â Â 0.82Â Â Â Â Â 0.00Â Â Â Â Â 0.72Â Â Â Â 13.54Â Â Â Â Â 0.78Â Â Â Â 84.14
Average:Â Â Â Â Â Â Â Â Â Â Â 0Â Â Â Â Â 0.82Â Â Â Â Â 0.00Â Â Â Â Â 0.72Â Â Â Â 13.54Â Â Â Â Â 0.78Â Â Â Â 84.14
This is just a small part of one section of the file (this box has only one cpu, which is why the ‘all’ and ’0′ numbers are the same, btw). The whole file on one server, running with default configurations, is 4000 lines long.
There’s a ton of great information in here, but… it all looks like the above. There’s no graphical output to be had. This is bad, because it would be nice to use this (historical) monitoring output for things like capacity planning, problem tracking, etc. You would, of course, want to couple this type of monitoring with something else that’ll do real-time monitoring, alerts, dependencies, escalation, etc.
So I want to write an application that’ll generate graphs of all of this stuff. Furthermore, I thought it would be cool to do something like what planetplanet does, which is to say that I want sarviz to run as a cron job, parse all of this stuff, and generate static html files, with an index.html that’ll make it really easy to browse this information either by host, by date, by resource… whatever. Later on I can add features to actually do even more useful stuff like longer-term trending of resource usage (by aggregating across various ‘sar’ output files), and more.
Sar is not alone
Sar comes with some friends, and it turns out they can be extremely useful. The best one for my purposes here is called ‘sadf’, and it is used to basically format the sar output to make it more useful for programmatic processing. It can output the information in CSV format, or make it ready for insertion into a relational database, but what I’m currently using for sarviz (and it’s early, so this could change) is the XML output capability. With XML output, I won’t have to deal with parsing out column headers, scanning an entire file for information from a single sar run, dealing with the blank lines sar uses by default to make it easier to read on a console, etc. So with sadf I can get output that looks like this:
<timestamp date=”2008-06-15″ time=”07:10:01″ interval=”600″>
<processes per=”second” proc=”0.93″/>
<context-switch per=”second” cswch=”221.50″/>
<cpu number=”all” user=”1.77″ nice=”0.00″ system=”0.56″ iowait=”0.04″ steal=”0.08″ idle=”97.55″/>
<cpu number=”0″ user=”1.77″ nice=”0.00″ system=”0.56″ iowait=”0.04″ steal=”0.08″ idle=”97.55″/>
This is a bit nicer to deal with, and I was excited to use Python’s (now built-in) ElementTree module to do something from scratch after having dealt with it being somewhat abstracted in the Python tools for the GData API (which I used to write a command line client for Google Spreadsheets, for example).
Doing Simple Things with ElementTree
Well, as it turns out, I had kind of a hard time getting started doing what I thought were simple things with ElementTree, so I want to post a few examples of how I did them so that I and others have something to refer to online.
The first thing to know about ElementTree is that there are Element objects, and ElementTree objects. ElementTree objects are made up of a hierarchical collection of Element objects, and Element objects are the things you can actually get attributes from that you’re likely to want. For whatever reason, I was a little confused starting out, because I wanted to get an ElementTree object and then ask that object to “scan the tree and give me all of the “time” attributes of the “timestamp” elements in the tree. You might be able to do this with a one-liner, but I never found a document that said how.
So here’s how to load in an XML file, parse it, and return all of the timestamp elements in that tree (or, rather, this is how I did it, which seems reasonable):
strudel:sa jonesy$ python
Python 2.5.1 (r251:54863, Jan 17 2008, 19:35:17)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from xml.etree import ElementTree as ET
>>> tree = ET.parse("sa15.xml")
>>> for ts in tree.findall("host/statistics/timestamp"):
...Â Â Â Â Â Â Â isotime = ts.attrib["date"]+"T"+ts.attrib["time"]
...Â Â Â Â Â Â Â print isotime
So, I imported the ElementTree module, fed my xml file to a method called “parse()”, and that gives me an ElementTree object. In that tree, I then ask for the timestamp elements which are under the root element at “host/statistics/timestamp”. You can then see that I create an ISO8601-formatted timestamp by asking for the “date” and “time” attributes of the timestamp element, and put a “T” between them. I would’ve used something like “T”.join, but there are other attributes in that element, and I only needed two, so I took the easy way out here instead of creating a list first and then doing the join on the list.
Of course, my real interest in the timestamps isn’t to print them, but to get the statistics for each sar run (represented by a timestamp, since sar records statistics for regular time intervals). So now let’s grab the 1-, 5-, and 15-minute load averages according to sar. I want all of this printed on one line along with the timestamp, because this output is going to be graphed using Timeplot, and that’s how Timeplot wants the data. Here goes:
>>>for ts in tree.findall("host/statistics/timestamp"):
...Â Â Â Â Â Â Â isotime = ts.attrib["date"] + "T" + ts.attrib["time"]
...Â Â Â Â Â Â Â for q in ts.findall("queue"):
...Â Â Â Â Â Â Â Â Â Â Â Â qstat = [isotime, q.attrib["ldavg-1"], q.attrib["ldavg-5"], q.attrib["ldavg-15"]]
...Â Â Â Â Â Â Â Â Â Â Â Â print ",".join(qstat)
The thing to note here, in case it escaped your eyeball, is that the second call to ‘findall’ feeds an argument relative to the ‘ts’ object rather than the ‘tree’ object.