Loghetti 0.9 Released: Now worthy of use!

The first released tarball of Loghetti was called the “IPO” release. This version actually warranted having an actual version number. I chose 0.9, and we’ll be moving toward 0.91, in .01 increments to a 1.0 release. Later on I’ll try to detail a roadmap, but I haven’t had enough feedback for that yet (though I’ve had some feedback, and it’s going to be worked into Loghetti soon).

So why is it worthy of use now? Here are a list of key features in 0.9:

  • It can take input from stdin or from a log file named as an argument.
  • You can write your own output plugin without knowing anything at all about Loghetti’s internals, so doing things like formatting output for MapReduce is Mind Numbingly Easy(tm). An example plugin that formats output for insertion into a database is included in the tarball. You’ll see that there is nothing loghetti-specific in the code except the name of the defined function: munge()
  • A few simple code changes and some lazy evaluation later, Loghetti 0.9 is several times faster than the IPO release, which is nice. It can now serve as a reasonable troubleshooting tool on 250MB log files.
  • Loghetti can report/filter on the key=value pairs in the query string. Passing ‘–urldata=foo:bar’ will return lines where foo=bar in the query string found in the request field.
  • You don’t have to get the whole line back in the output. You can tell Loghetti to return only the fields you want. I’ll document the names of the fields shortly, but for now, you can find them all defined in the apachelogs.py file.
  • And much, much more!

Thanks to Kent Johnson and Doug Hellmann, who signed up and were each a tremendous help both in helping me improve the performance of Loghetti, and teaching me a thing or two along the way.

There is, so far, one outstanding issue that is not yet fixed in 0.9: although I’ve tested Loghetti against several million log lines by now, others have occasionally found that some broken (malicious?) client software causes log lines to be created which do not conform to the Apache ‘combined’ log format. These will (presently) cause Loghetti to exit with an error. This is bad, but apparently is relatively rare. 0.9 does *not* contain a fix for this, because I was unsure which way to go with a solution. At this point, I think that, rather than code for every special case, what might happen is Loghetti will continue processing, and keep lines like this aside in a loghetti.log file, and tell you there were ‘x non-conformant lines’, and to see the log for details. Other ideas on how to deal with this are welcome, of course.

A simple nanny script in Python

I have a support issue with a provider of mine, but was able to reverse engineer the problem and put in a stop-gap measure to keep it from ruining my weekend. The issue is a misconfigured daemon supplied by the provider, and occasionally, this daemon just goes away. I don’t know much about the daemon, but the underlying system is standard CentOS, so what I really needed is a way to detect if the daemon failed, and then restart it if that’s the case. The script that does this exists in every shop I’ve ever worked in, and is traditionally called a “nanny script”.

There are actually some nice looking projects that deal with this issue and others, but I didn’t really have time to read all the docs (yet), and I wasn’t sure it wasn’t overkill — but it might be nice to have a daemon instead of a script running from cron.

Anyway, I was shocked that I was unable to find a simple nanny script out on the web – in *any* language. Maybe my google-fu is out of whack. So I went ahead and wrote one up *very* quickly using Python. If you need a script to run every minute or few out of cron and restart a misbehaving daemon if it’s not running, feel free to use my nanny script.

A non-degree-holder’s view of hiring decisions

I get a good number of job offers without sending resumes around. I guess my name shows up in enough places, associated with enough buzzwords, that recruiters fire off emails first and read the fine print later. The “fine print” in my case, says that I do not have a college degree.

99.999% of the time, recruiters, and even hiring managers, tell me that my experience more than makes up for any lack of a formal education (one manager said he had seen many less capable MS degree holders). However, there are a few little quirks I’ve found at some larger companies. Mainly, they fall into two categories:

  1. They just plain don’t hire anyone without a degree
  2. You can’t get past a certain “tier” of employment without a degree

I’ve worked in business. I grew up in family businesses. I understand that, in certain circumstances, corporations can have legitimate reasons for these stances. Probably the only one I’ve ever actually heard myself that seemed almost reasonable is “insurance”. Some positions in some companies can have a drastic effect on things that directly affect the bottom line of that company, and if the company has insurance to protect them against extremely costly one-time errors (like E&O insurance), the insurance company might give them a better rate if they take steps to decrease the likelihood of such errors… like requiring that employees in these positions have a degree. I think it’s kind of a twisted logic, really. Instead of developing processes and procedures to reduce the likelihood of a problem, they think that hiring someone with a degree by itself will help the issue. Like degree-holders are less prone to errors due to the simple human condition. Odd, that.

Oh, and there’s a third quirk, but not with the corporate policies – with he hiring managers themselves. The quirk is that certain hiring managers, without regard for stated policy, won’t hire someone who doesn’t have a degree, presumably because they fear they might be fired for hiring someone who fails to produce because they don’t have a degree. The other possible reasoning here is that they have the attitude that “I went through it, so why should I give someone a job who hasn’t?”

The *real* problem with these hiring managers, and with corporations who have (non-insurance-related) strict educational requirements of applicants, is that that there’s a shortcoming in the business education curricula: they don’t teach the future middle managers of the world how to evaluate an applicant who doesn’t have a traditional, formal education.

This is a guess, of course, since I haven’t been to business school. But aren’t managers unwilling to hire those without formal educations also guessing? I would submit that they are. It’s the same kind of guess, too. It’s a guess based in part (maybe) on experience, and in part based on stereotypes or other preconceptions.

My experience with those who don’t, or won’t hire non-degree holders is that they think of degree-holders as “more well-rounded”. Assuming the non-degree holder hasn’t resigned themselves to a life of flipping burgers, I don’t think this could be further from the truth. It is, in fact, an old wive’s tale with no basis in fact. We were all told as kids that college would make us more “well-rounded”, and so we all worked to attain this nebulous goal. In reality, a college degree, by itself, is simply not any kind of valuable indicator of “well-roundedness”. Colleges are businesses. They produce college graduates. They do it efficiently, with an eye toward the business end of things more than anything else. If a college graduate is well-rounded, it is as much in spite of their college experience as because of it. Most well-rounded people are probably predisposed to being well-rounded, and had a tendency toward things to help them become well-rounded by the time they arrived on campus.

Besides this somewhat lame view of non-degree holders, another assumption is that non-degree holders do not have *any* education, and so *cannot* be prepared to perform the tasks that a graduate can (allegedly) perform. This argument might hold water with me if I didn’t have some idea already how resumes are typically handled by HR departments. The short story there is that there are tons of resumes that a hiring manager never sees because they’re pre-qualified (read: filtered) on the basis of educational status.

My area of expertise is technology. I don’t have a degree. It would therefore be assumed by many a hiring manager that I have no idea what Big-O notation is, don’t know anything about object delegation or polymorphism, and can’t analyze problems the same way as a college grad. The manager would be wrong on the first two counts, because while I didn’t study in college, I *did* study. But what about that third bit about analyzing problems?

I can tell you that it’s absolutely true that I do not analyze problems the same way as a college grad. What’s a real shame, though, is that a lot of managers would assume that “not the same” means “not as well”. There’s no justification for this assumption. In fact, I would argue that it *has* been the case in the past that having one rogue non-degree holder in a room full of grads can help to avoid “group think”, and help the group turn a problem sideways for another look. It is unfortunate that a degree that is supposed to help people “think outside the box” seems to put everyone in the same exact spot outside of that box, looking at it from the same exact perspective, coming to the same exact conclusion.

Finally, there is a certain class of degree-holder that I think is never a win over hiring a young, hungry rogue like myself. This class of graduate has hung their degree on the wall and decided that they no longer have any obligation to continue to keep up with new developments in their field. They code the same way they’ve always coded, use the same collection of old trusty tools, deal with technology the same way they’ve always dealt with technology, and stood more or less completely still, failing to seek out (much less embrace) new tools, techniques, languages, paradigms for getting things done. How can you possibly think outside the box when your vision of the box is 10 years old and assumes that the box is completely static?

I believe it was Nietzsche (sp?) who wrote that truth is not static (of course, I’m paraphrasing, and I might be thinking of James). If you can see yourself subscribing to that idea at all (it seems counterintuitive at first glance, but deeper thought will probably get you there), then how can a person with a notion of “truth” that is tied to their college experience be any better at figuring out what to do with it than someone who doesn’t have a degree, but is forever seeking out interesting things that come out of an ever-evolving truth?

Anyway, that’s my diatribe for the evening. If you’re a hiring manager with preconceived notions about college degree holders (or not) that come from decades of brain-hammering by graybeards, then cling to that safety blanket all you want, but know that it’s old thinking. Learn to be (gasp!) creative about how you evaluate applicants, and how you build your teams, and how you execute on your visions. Try to find the other box. The one that doesn’t look anything like it did in college.

I’m interested in hearing feedback on these ideas. I’m sure some will take offense. I don’t mean any. I’m certainly not saying that not having a degree is better, or that degree-holders are all complacent or anything like that. I *am* saying that *formal* education *can* be an irrelevant point of comparison, and that relying solely on the existence (or not) of a *formal* education as the basis for hiring one applicant over another is ludicrous.

Also, my blog is subscribed to by various sites, and I decided to publish this to all of the categories, because I think it *could* be interesting to pretty much anyone. If this is spam in your eyes, let me know. If you find a lively discussion about this going on anywhere, I’d be really interested in that as well :-)

Amazon Adds Static IP and “Availability Zones” to EC2

This is cool. You can now associate a static IP address with your EC2 instances. No more mucking about with 10-minute DNS timeouts or dynamic DNS routines. You can also elect to start certain instances in multiple locations using “Availability Zones”

These new features will make it a little easier for people to deploy larger web sites and services without quite as much management overhead. There’s also some rumblings in the forums that Amazon is actually working on immutable storage for EC2 images, which would pretty much complete the puzzle for most. A good bit of the custom scripts and routines people come up with for running on EC2 is to get around this “limitation”, although, truth be told, a good part of dealing with that is having a good backup plan, which you really should have anyway – EC2 just forces the issue :-)

Hadoop, EC2, S3, and me

I’m playing with a lot of rather large data sets. I’ve just been informed recently that these data sets are child’s play, because I’ve only been exposed to the outermost layer of the onion. The amount of data I *will* have access to (a nice way of saying “I’ll be required to wrangle and munge”) is many times bigger. Someone read an article about how easy it is to get Hadoop up and running on Amazon’s EC2 service, and next thing you know, there’s an email saying “hey, we can move this data to S3, access it from EC2, run it through that cool Python code you’ve been working with, and distribute the processing through Hadoop! Yay! And it looks pretty straightforward! Get on that!”

Oh joyous day.

I’d like to ask that people who find success with Hadoop+EC2+S3 stop writing documentation that make this procedure appear to be  “straightforward”. It’s not.

One thing that *is* cool, for Python programmers, is that you actually don’t have to write Java to use Hadoop. You can write your map and reduce code in Python and use it just fine.

I’m not blaming Hadoop or EC2 really, because after a full day of banging my head on this I’m still not quite sure which one is at fault. I *did* read a forum post that someone had a similar problem to the one I wound up with, and it turned out to be a bug in Amazon’s SOAP API, which is used by the Amazon EC2 command line tools. So things just don’t work when that happens. Tip 1: if you have an issue, don’t assume you’re not getting something. Bugs appear to be fairly common.

Ok, so tonight I decided “I’ll just skip the whole hadoop thing, and let’s see how loghetti runs on some bigger iron than my macbook pro”. I moved a test log to S3, fired up an EC2 instance, ssh’d right in, and there I am… no data in sight, and no obvious way to get at it. This surprised me, because I thought that S3 and EC2 were much more closely related. After all, Amazon Machine Images (used to fire up said instance) are stored on S3. So where’s my “s3-copy” command? Or better yet, why can’t I just *mount* an s3 volume without having to install a bunch of stuff?

This goes down as one of the most frustrating things I’ve ever had to set up. It kinda reminds me of the time I had to set up a beowulf cluster of about 85 nodes using donated, out-of-warranty PC hardware. I spent what seemed like months just trying to get the thing to boot. Once I got over the hump, it ran like a top, but it was a non-trivial hump.

As of now, it looks like I’ll probably need to actually install my own image. A good number of the available public images are older versions of Linux distros for some reason. Maybe people have orphaned them and gone to greener pastures. Maybe they’re in production and haven’t seen a need to change them in any way. I’ll be registering a clean install image with the stuff I need and trudge onward.

The Power of Open Source

I think my very favorite aspect of the open source development model is that it allows me to practice the philosophies I use in my every day personal life, and apply them to software development as well. In my teens and early 20′s I read quite a lot of Aristotle and Plato, and a very major philosophy that I took away from all of that reading is “be conscious of your own ignorance”. And so I am.

There are just about a million reasons to start an open source project. In the case of loghetti, I made it a project because I know that there are things that other people know, which I do not know, but would probably like to know or benefit from knowing (we’ll not go into epistemological discussions – I’m just going to use the word “know” in the traditional sense here) ;-)

Turns out, just knowing that there’s stuff out there that I don’t know has proven useful. Within hours of launching the Google Code site for the project, Kent Johnson joined the project, changed maybe 5 lines of code in the apachelogs.py module, and according to my testing, that change resulted in a 6x speed increase. If you’re using loghetti from the SVN trunk, it’s gone from being sluggish for anything over 50MB, to being pretty darn quick even up to 250MB, at least for simple queries like –code=404 (which is what I do speed comparisons with). The changes will be in a tarball probably some time next week, for those who don’t want to use svn.

We haven’t even touched threading yet ;-)

Loghetti is now an open source project

I was getting feedback about loghetti, and it was all very useful, and it’s still coming in, and I can’t work full-time on it. At the same time, I’d love for some of the stuff I’ve read about to be implemented, because I certainly could make use of it myself.

So if anyone is interested, you can get loghetti, get more info about loghetti (it’s an apache log filter written in Python), or join the project here.

A GMail option I’d like to see: “Delayed Skip Inbox”

I use GMail extensively. In my main gmail account, I can send mail using a variety of accounts that I’ve authenticated to allow sending from. So, for example, mail from jonesy at pythonmagazine dot com is actually sent from a gmail interface, even though PyMag doesn’t officially use Google Mail for their domain (not that this would be a bad idea. Hrrmmmm…). I also administer a domain using Google Apps for you domain (well, two, but only one is what you’d call “production”). GMail has become a big part of my life in recent years.

This is not to say that it’s perfect by any stretch. In using it over the past few years (or however long it’s been around – I’ve been using it pretty much since it existed), I’ve come up with lots of ideas that would make it better, but I haven’t written many of them down, so I figured I would start doing that, starting now :-)

Here’s today’s “I wish gmail did…” entry:

Time-based filter application!

Oh yeah. Some of you know what I’m talking about already I’m sure. From the user-interface perspective, the change is dead simple. On the screen where you define what to do with messages that match a filter you’ve created, just add an option next to “Skip the Inbox” that says something like “Remove from Inbox after…” and then provide a way for the user to define some time period like “1 day” or “1 week”.

The effect should be that incoming messages that match the filter are placed in the inbox, and are then removed from the inbox after the user-specified time period. The user can still see the message after this time period by clicking on the label, just like you do to view messages that have skipped the inbox. It’s a “delayed skip the inbox”.

The idea here is that labels aren’t perfect, and neither are people. Labels aren’t perfect because labeling alone isn’t really enough to declutter your inbox unless you skip the inbox. That causes problems due to peoples’ imperfections: they don’t click on all of those filter labels every day, and messages fall through the cracks. There are still some filters that would be good to just have skip the inbox altogether, but for all the clubs and organizations I belong to, having even a 2- or 3-day delay would be great.

Feedback and Boredom Result in 35% Performance Boost for Loghetti

Well, I got some feedback on my last post, and I had some time on my hands tonight, and Python is pretty darn easy to use.

As a result, loghetti is making great strides in becoming a faster log filter. To test the performance in light of the actual changes I’ve made, I’m asking loghetti only to filter on http response code, and I’m only asking for a count of matching lines. I’m only asking for the response code because I happen to know that it will cause loghetti to skip a lot of processing which once was done per-line on every run, but which is now done lazily, on an as-requested basis. So, for example, there’s no reason to go parsing out dates and query strings (two costly operations when you’re dealing with large log files) if the user just wants to query on response codes.

Put another way “Hey, I only want response codes, why should I have to wait around while you process dates and query strings?”

So, here’s where I was when this little solo-sprint started:

strudel:loghetti jonesy$ time ./loghetti.py --code=404 --count 1MM.log
Matching lines: 10096


real 5m52.103s
user 5m35.196s
sys 0m3.214s

Almost 6 minutes to process one million lines. For reference, that “1MM.log” file is 246MB in size.

Here’s where I wound up as of about 5 minutes ago:


strudel:loghetti jonesy$ time ./loghetti.py --code=404 --count 1MM.log
Matching lines: 10096


real 3m53.350s
user 3m50.498s
sys 0m1.641s

Hey, looky, there! I even got the same result back. Nice!

Ok, so it’s not what you’d call a ‘speed demon’, especially on larger log files. But testing with a 25MB log with 100k lines in it using the same arguments took 25 seconds, and at that point it’s at least usable, and I’m actually going to be using it to do offline processing and reporting, and it’ll be on a machine larger than my first-generation Intel MacBook Pro, and for that type of thing this works just fine, and it’s easier to run this than to sit around thinking about regular expressions and shell scripts all day.

I’m still not pleased with the performance – especially for simple cases like the one I tested with. I just ran a quick ‘grep | wc -l’ on the file to get the same exact results and it worked in about one half of one second! Sure, I don’t mind trading off *some* performance for the flexibility this gives me, but I still think it can be better.

For now, though, I think I might rather support s’more features, like supporting a comparison operator other than “=”, or specifying ranges of dates and times.

Loghetti Beta – An Apache Log Filter

I’m thinking about just making this an open source project hosted someplace like Google Code or something, because there are folks much smarter than myself who can probably do wonders with the code I’ve put together here. Loghetti takes an Apache combined format access log and a few options as arguments, throws your log lines through a strainer, and leaves you with the bits you actually *want* (kinda like spaghetti, but for logs) ;-)

It’s written in Python, and the two dependencies it has are included in the tarball at the bottom. The dependencies are an altered version of Kevin Scott’s apachelogs.py file (I’ve added more granular log line parsing), and Doug Hellmann’s CommandLineApp, which really made creating a CLI application a breeze, since it handles things like autogenerating options, help output, etc automatically without me having to mess with optparse.

So right now, I use it for offline reporting on what’s in my log files, and it’s great for that. I can run, for example:

./loghetti.py –code=500 access.log

And get a listing of the log lines that have an http response code of 500. You can get fancier of course:

./loghetti.py –ip=192.168.1.2 –urldata=foo:bar –month=1 –day=31 –hour=16 access.log

And that’ll return lines where the client IP is 192.168.1.2, with the date specified using the date-related options. The “–urldata” option allows you to filter log lines on the query string part of the URL. So, in the above case, it’ll match if you have something like “&foo=bar” in the query string of the URL.

There are tons of features I’d like to support, but before I do, I feel compelled to address its performance on large log files. Once you throw this at a log file greater than about 50MB, it’s not a great real-time troubleshooting tool. I believe I’d be better off ripping some of the parsing out of apachelogs.py and making it conditional (for example, don’t bother parsing out all of that date information if the user hasn’t asked to filter on it).

Anyway, it’s still useful as it is, so let me know your thoughts on this, and if it’s something you have a use for or would like to help out with, I’ll set up a project for it. For now, you can Download Loghetti