10 Mistakes in Systems Management

I’ve seen the inside of lots and lots of businesses over the past decade or so. Though the technology has changed somewhat dramatically in many areas of the data center, the general, high-level methodologies for building a sane environment can still be applied. While the implementation is typically done by system administrators, it also helps if their managers and people who hire systems administrators are at least moderately clueful.

While it’s true that the high-level methodologies haven’t changed, likewise, the ways in which people abuse or neglect them hasn’t really changed much either. Here’s a list of things to avoid, and things to jump on, when building and growing your environment.

  1. Don’t hire systems administrators: I have to say, that this is a problem I only used to see in very small businesses who couldn’t afford them, and perhaps didn’t warrant having one full-time, and this is changing it seems. I have now spoken with perhaps 4 managers in as many months that should really have at least one full-time sysadmin, and instead are just assigning systems tasks to volunteers from the engineering team. The results are disastrous, and get worse as time passes and the environment grows.At least develop an ongoing relationship with a systems guru and bring them in for projects as they arise — that’s worlds better than the results you get from doling out the tasks to people who don’t really have a systems background.The problem isn’t that any given task is “hard”. Most aren’t. The problem is actually multi-faceted: First, if something goes wrong, developers typically don’t have the background to understand the implications and impact of that. He also probably doesn’t have the experience to quickly fix the problem. Further, he may not know of a resource to get authoritative information on the topic (hint: online forums are not typically the best source of authoritative information for complex systems issues.)
  2. Don’t automate: I’m an advocate of investing in automation very early and very often. If you’re just starting your business, and it relies heavily on technology, the first wave of systems you buy should include a machine to facilitate automated installation in some form. The reasons for this are many, but a couple of big ones are:
    • Consistency: if your system builds are easily repeatable, you can typically set up an automated install regime to make base installs identical, and alter only those parts of the install that support some unique service the system provides. You can then be sure that, to a very large degree, all of your machines are identical in terms of the packages installed, the base service configurations, the user environment, etc.
    • Server-to-Sysadmin ratio: automating just about anything in your system environment results in less overhead in terms of man hours devoted to that task. Automating the installation, backups, monitoring, log rotation, etc., means that each system administrator you hire can manage a larger number of machines.
  3. Make security an afterthought: Security should be a very well thought out component of everything you do. People have been preaching this for eons, and yet it’s pretty clear to me that plenty (I might even say a majority) of businesses don’t even practice the basics, like keeping on top of security updates to systems and applications, removing/archiving user accounts when people leave the company, and setting up a secure means of remote access.Security breaches are a nightmare, typically. There are a lot of questions to answer when it happens, and most of them take some time and manual drudgery to answer. In addition, machines need to be reimaged, data needs to be recovered, audits need to take place, and of course, the big one: everyone has to spend time in meetings to talk about all of this, and then, magically, projects are planned to immediately insure that it never happens again… until it does, because the only time those projects gain priority is when a breach occurs. Get on top of it now, and save yourself the headaches, and costs, and other potential (and way bigger) disasters.
  4. Don’t plan for failure: “plan for failure” is presently a term bandied about in relation to building scalable, reliable services using large numbers of machines, but the phrase also applies to good old infrastructure services. For example, for some reason, there are managers out there who demand that DNS services be handled by in-house systems, and then they end up with only a single DNS server for their entire domain! I’m not kidding! I’ve seen that twice in the past year, and that’s too much. For every service you deploy, you should make a list of all of the interdependencies that arise from the use of that service. Then determine what your tolerance for downtime is for that particular service, taking into account services that might go away if this service is unavailable. Why? Because it’s going to fail. If the service itself doesn’t fail, something else will — like the hard drive — and your end users won’t know or care about the difference. They’ll just know the service is gone.
  5. Don’t communicate: as environments grow and become more complex, changes, tweaks, and modifications will be required. No systems environment I’ve ever seen is static. You’re going to want to implement services a little differently to offer more security, more reliability, a better overall user experience, or whatever. Communicating with the people you serve about these changes should be a part of the planning for projects like this. Users should know that a change is happening, how and when it’s happening, how they’ll be affected during the time of the change, and, hopefully, how their lives will be better because of the change.In addition, you should be thinking about how you’ll communicate with your systems team, and your users, in the event of a catastrophe. Do you have an out-of-bound mechanism for communication that doesn’t depend on your network being alive? If, I dunno, you lost commercial power, say, causing your UPS to kick itself on and immediately blow up, leaving your entire data center completely black, and you were the only person in the building, how would you get in touch with people to help you through the disaster? Note that this implies that your email server, automated phone routing, etc., are down as well.
  6. Don’t utilize metrics: “If you can’t measure it, you can’t manage it” holds true in systems management as well. How do you know when you’ll need more disk? How do you know when your database server will start to perform badly? How do you know when to think about that reverse proxy for your web servers? You need to monitor everything. Resource utilization metrics are a key to growing your environment in a cost-effective way while still giving services the resources they need.
  7. Don’t utilize monitoring: not metrics in this case, but service and system availability monitoring. What’s the cost to you if your website is unavailable for 1 minute? One hour? One day? If your database server, which serves your web site, goes down at 10pm and nobody knows about it until 8am, how does that affect your business? In reality, you *always* have availability monitoring: your customers will be your alert mechanism in the absence of any other monitoring solution. And what’s the cost in terms of the perception of the service you provide as a result? Monitoring can be non-trivial, but it is absolutely essential in almost all environments.
  8. Don’t use revision control: Revision control can get you and your team out of so many headaches that I can’t list them all here. I’m not even going to tell you which tools to consider, because if you’re running an environment without revision control, almost anything is better than what you have. Revision control can be used to save different versions of all of the configuration files on your systems, documentation for all of your systems, all of the code written in your environment (i.e. those scripts used for system automation, etc.), and your automated installation template files. It can also be utilized in the chain of tools used to perform rollouts of new applications in a sane way (it can also be used to do rollouts in insane ways, but that’s another post). Revision control is equal parts disaster recovery, convenience, accountability, consistency, and control. To the extent that activity can be measured, it also provides metrics.
  9. Don’t use configuration management: Depending on the size of the environment, this can come down to something as simple as an NFS mount or a set of imaging templates, with maybe some rsync-ish scripts around (in a revision control system!), or it can get complex, involving things like Puppet or CFEngine, along with other tools depending on your platform and other restrictions. The idea, though, is to abstract away some of the low-level, manual drudgery that goes along with systems management, so that you can, to quote the Puppet website, “focus more on how things should be done and less on doing them.” This ties in nicely with things like revision control, recoverability, the goals of consistency, automation, and increasing your system-to-sysadmin ratio.
  10. Don’t be a people person: All of IT, not just systems management, has historically had issues communicating with business unit personnel. Likewise, business personnel have no idea how to communicate with IT personnel. No matter which side of the fence you fall on, generally being good with people, communication, perceiving and predicting the needs of others, will be very beneficial in winning consensus and gaining support for your projects. If nothing else, your career benefits from having a reputation of having good interpersonal skills and being an “all around good guy.” I know this is a completely non-technical item, but my experience is that some very large problems in systems management are not related to technology, but rather people.

Throw out your Perl: One-line aggregation in awk

I ran into a student from a class I taught last summer. He’s a really sharp guy, and when I first met him, I was impressed with just how much Perl he could stuff into his brain’s cache. He would write what he called ‘one-liners’ in Perl that, in reality, took up 5-10 lines in his terminal. Still, he’d type furiously, without skipping a beat. But he told me when we met that he no longer does this, because I covered awk in my class.

His one-liners were mostly for data munging. The data he needed to munge was mostly data that was pretty predictable. It had a fixed number of fields, a consistent delimiter — in short, it was perfect for munging in awk without using any kind of esoteric awk-ness.

One thing I cover in the learning module I’ve developed on Awk is aggregation of data using pretty simple awk one-liners. For example, here’s a pretty typical /etc/passwd file (we need some data to munge):

root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
news:x:9:13:news:/etc/news:
uucp:x:10:14:uucp:/var/spool/uucp:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin
gopher:x:13:30:gopher:/var/gopher:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
nobody:x:99:99:Nobody:/:/sbin/nologin
dbus:x:81:81:System message bus:/:/sbin/nologin
apache:x:48:48:Apache:/var/www:/sbin/nologin
avahi:x:70:70:Avahi daemon:/:/sbin/nologin
mailnull:x:47:47::/var/spool/mqueue:/sbin/nologin
smmsp:x:51:51::/var/spool/mqueue:/sbin/nologin
distcache:x:94:94:Distcache:/:/sbin/nologin
nscd:x:28:28:NSCD Daemon:/:/sbin/nologin
vcsa:x:69:69:virtual console memory owner:/dev:/sbin/nologin
rpc:x:32:32:Portmapper RPC user:/:/sbin/nologin
rpcuser:x:29:29:RPC Service User:/var/lib/nfs:/sbin/nologin
nfsnobody:x:65534:65534:Anonymous NFS User:/var/lib/nfs:/sbin/nologin
sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
mysql:x:27:27:MySQL Server:/var/lib/mysql:/bin/bash
dovecot:x:97:97:dovecot:/usr/libexec/dovecot:/sbin/nologin
squid:x:23:23::/var/spool/squid:/sbin/nologin
ldap:x:55:55:LDAP User:/var/lib/ldap:/bin/false
pcap:x:77:77::/var/arpwatch:/sbin/nologin
ntp:x:38:38::/etc/ntp:/sbin/nologin
rpm:x:37:37::/var/lib/rpm:/sbin/nologin
haldaemon:x:68:68:HAL daemon:/:/sbin/nologin
named:x:25:25:Named:/var/named:/sbin/nologin
xfs:x:43:43:X Font Server:/etc/X11/fs:/sbin/nologin
jonesy:x:500:500::/home/jonesy:/bin/bash

It’s not exotic, cool data that we’re going to infer a lot of interesting things from, but it’ll do for pedagogical purposes.

Now, let’s write a super-simple awk aggregation routine that’ll count the number of users whose UID is > 100. It’ll look something like this:

awk -F: '$3 > 100 {x+=1} END {print x}' /etc/passwd

The important thing to remember is that awk will initialize your variables to 0 for you, which cuts down on some clutter.

Let’s abuse awk a bit further. What if we want to know how many users use each shell in /etc/passwd, whatever those shells may be? Here’s a one-liner that’ll take care of this for you:

awk -F: '{x[$7]+=1} END {for(z in x) {print z, x[z]} }' /etc/passwd

While awk doesn’t technically support multi-dimensional arrays, it also doesn’t have to be numerically indexed. So here, we tell awk to increment x[$7]. $7 is the field that holds the shell for each user, so if $7 on the current line is /bin/bash, then we’ve told awk to increment the value in the array indexed at x[/bin/bash]. So, if there’s only one line containing /bin/bash up to the current record, then x[/bin/bash]=1

There’s a lot of great things you can move onto from here. You can do things that others use Excel for right in awk. If you have checkbook information in a flat file, you can add up purchases only in a given category, or, using the technique above, in every category. If you store your stock purchase price on a line with the current price, you can use simple math to get the spread on each line and tell you whether your portfolio is up or down. Let’s have a look at something like that. Here’s some completely made up, hypothetical data representing a fictitious user’s stock portfolio:

ABC,100,12.14,19.12
FOO,100,24.01,17.45
BAR,50,88.90,94.33
BAZ,50,75.65,66.20
NI,23,33.12,43.32

Save that in a file called “stocks.txt”. The columns are stock symbol, number of shares, purchase price, and current price, in that order. This awk one-liner indexes the ‘x’ array using the stock symbol, and the value at that index is set to the amount gained or lost:

awk -F, '{x[$1]=($2*$4)-($2*$3)} END {for(z in x) {print z, x[z]}}' stocks.txt

Hm. Actually, that’s kind of inefficient. I realized while previewing this post that I can shorten it up a bit like this:

awk -F, '{x[$1]=($2*($4 - $3))} END {for(z in x) {print z, x[z]}}' stocks.txt

Glad I caught that before the nitpickers flamed me to a crisp. Always preview your posts! ;-P

Ah, but of course, that’s not enough. This spits out the gain and loss for each stock, but what about the net gain or loss across all of them? You only need to tweak a little bit:

awk -F, '{x[$1]=($2*($4-$3)); y+=x[$1]} END {for(z in x) {print z, x[z]}; print "Net: "y}' stocks.txt

We just added the assignment of the ‘y’ variable before the “END”, and then added a print statement after the “END”.

I hope this helps some folks out there. Also, if your team needs to know stuff like this, I do on-site training!

Teaching a Course on Profiling and Debugging in Linux

Dear Lazyweb,

So, I’ve been in Chicago for a week teaching a beginner and an intermediate course on using and administering Linux machines. This week, I’ll teach an intermediate and an advanced course on Linux, and the advanced course will cover profiling and debugging. The main tools I’m covering will be valgrind and oprofile, though I’ll be going over lots of other stuff, like iostat, vmstat, strace, what’s under /proc, and some more basic stuff like sending signals and the like.

So what makes me a bit nervous is, being that the advanced students are mostly CS-degree-holding system developers, they’ll probably be expecting me to know very low-level details of how things are implemented at  the system/kernel level. I’d love to know more about that myself, and actively try to increase my knowledge in that area! Alas, most of my experience with low-level tools like this is in the context of trying to understand how things like MySQL do their jobs.

They may also turn up their nose at the admin-centric coverage that I believe is actually very important in order to get a complete view of the system and to reduce duplication of effort. Of course, I’ll use a bit of time at the beginning of day 1 to properly set the expectation, and we’ll see how they respond. As they say in the hospitality industry, presentation is everything.

The portion of the course that covers valgrind and oprofile won’t be until Thursday, or perhaps even Friday, so I figured I’d take this opportunity to ping the lazyweb and find out a couple of things:

  • What tools do you use in conjunction with valgrind and/or oprofile?
  • What kinds of problems are you solving with these and similar tools?
  • What most annoys you about these and similar tools?
  • Do you use these tools for development, administration, or both?
  • If you have cool links, share!
  • If you’ve been able to make effective use of oprofile inside of a vmware instance, share (because my thinking is that this probably *should* be nearly impossible unless vmware actually simulates the hardware counters oprofile needs access to!)
  • This one is just for me, not the course: are there any demos/tutorials on using valgrind with Python? I’ve seen the standard suppression file, but it still seems like profiling a Python script would be difficult being that you are actually going to be profiling the interpreter (or so it seems).

Thanks!

Shell Scripting: Bash Arrays

I’m actually not a huge fan of shell scripting, in spite of the fact that I’ve been doing it for years, and am fairly adept at it. I guess because the shell wasn’t really intended to be used for programming per se, it has evolved into something that sorta kinda looks like a programming language from a distance, but gets to be really ugly and full of inconsistencies and spooky weirdness when viewed up close. This is why I now recode in Python where appropriate and practical, and just about all new code I write is in Python as well.

One of my least favorite things about Bash scripting is arrays, so here are a few notes for those who are forced to deal with them in bash. 

First, to declare an array variable, you can assign directly to a variable name, like this: 

myarr=('foo' 'bar' 'baz')

Or, you can use the ‘declare’ bash built-in: 

declare -a myarr=('foo' 'bar' 'baz')

The ‘-a’ flag says you want to declare an array. Notice that when you assign elements to an array like this, you separate the elements with spaces, not commas. 

Arrays in bash are zero-indexed, so to echo the value of the first element of myarr, we do this: 

echo ${myarr[0]}

Now that you have an array, and it has values, at some point you’ll want to loop over it and do something with each value in the array. Almost anyone who utilizes an array will at some point want to do this. There’s a little bit of confusion for the uninitiated in this area. For whatever reason, there is more than one way to list out all of the elements in an array. What’s more, the two different ways act different if they are used inside of double quotes (wtf?). To illustrate, cut-n-paste this to a script, and then run the script: 

#!/bin/bash
myarr=('foo' 'bar' 'baz')
echo ${myarr[*]}
echo ${myarr[@]}
echo "${myarr[*]}"
echo "${myarr[@]}" # looks just like the previous line's output
for i in "${myarr[*]}"; do # echoes one line containing all three elements
   echo $i
done
for i in "${myarr[@]}"; do  # echoes one line for each element of the array.
   echo $i
done

Odd but true. The “@” expands each element of the array to its own “word”, while the “*” expands the entire set of elements to a single word. 

Another oddity — to get just a count of the elements in the array, you do this: 

echo ${#myarr[*]} 

Of course, this also works: 

echo ${#myarr[@]}

And the funny thing here is, these two methods do not appear to produce different results when inside of double quotes. I’d be hard pressed, of course, to figure out a use for counting the entire set of array elements as “1″, but it still seems a little inconsistent. 

Also note that you don’t have to count the elements in the array – you can count the length of any element in the array, too: 

echo ${#myarr[0]} 

That’ll return 3 for the array we defined above. 

Have fun!

Heading to Chicago

I’ll be landing in Chicago tonight, assuming all goes well. I’ll be there through Jan 23. If there are any Linux User Groups, LOPSA meetings, Python user groups, or anything else cool (a brewer’s club maybe?) then find me on twitter (bkjones), or shoot me an email (same name, at gmail).

I’m teaching courses on beginner, intermediate, and advanced Linux administration while I’m there, with some coverage (by request!) of Python. I currently have no clients requesting coverage of Perl — just shell and Python. Sweet!

Advanced Linux Course… In Chicago… In January!

You heard it right, folks. I’ll be in lovely downtown Chicago for two weeks. Actually, I’m teaching 4 classes, each one consisting of a week’s worth of half-day sessions. 1 beginner course, two intermediate courses, and an advanced course. I’ll also be returning in February to do an intermediate and advanced course. This was the result of a successful full-week course I delivered in NYC that was 5 full days of advanced Linux training. Of course, what’s beginner and what’s advanced, I’ve learned, varies very widely among training clients. The beginner course I’m teaching next week is geared toward power users of other OSes, so I can assume a lot of basic high-level knowledge, while another beginner course I’m doing in Feb or March for a different client assumes that the user is not even very advanced at being a Windows end user!

What determines “advanced” is different too. Once you are “Advanced”, you can be advanced in different aspects and usage scenarios. The advanced course I’m teaching one client is dealing very heavily with two main areas: scripting/data munging, and system profiling and performance. An odd mix, but I do custom content development for on-site training clients, unless I have existing modules covering the topics they need, in which case they can pick and choose to put together their course, or have me query them for information and put together a proposed package.

There’s still one nagging issue with my Linux training handout. The content is good. I’ve gotten good feedback on it from some sharp people. However, I’m using OpenOffice to put it all together, and I’m having a bear of a time putting together a good index. My belief is that all indexes, for all books, are lacking, but this goes beyond that into “wtf?”-space. The main problem is that the index generating tool in OOo lets you say that this word should be matched on the “Whole word”, but that’s the exact opposite of what I need. What that feature does is it only puts a page listing in the index if the *entire word* exists on that page. What I need is an option that says something like “standalone”, where the page isn’t listed unless the word is surrounded by a word boundary on either side. You’d be shocked at how many everyday words contain standard Linux commands in them. “rm” and “ls” are particularly troublesome. Almost every page would be listed in the index! If anyone has tips on external tools or other OOo techniques, definitely leave links or comments!!

At some point, probably while I’m in Chicago holed up in a hotel, I’ll post the modules I have put together so far on the web site of my business that I perform training out of (I have a one-man LLC these days).

The Bandwidth Delay Product

I’ve actually needed to perform this calculation in the past, but never knew it had a proper name! The value produced by this simple calculation will tell you how much of your network pipe you can actually fill at any given point in time. To figure this out, you need two values: the available bandwidth, and the latency (or delay) between the two communicating hosts.

In this example, I’ll figure out my BDP between my server in the basement, and this blog.

First, I’ll use ‘ping’ to figure out the “delay” value:

jonesy@canvas:~$ ping www.protocolostomy.com
PING www.protocolostomy.com (74.53.92.66) 56(84) bytes of data
64 bytes from web36.webfaction.com (74.53.92.66): icmp_seq=1 ttl=252 time=57.2 ms
64 bytes from web36.webfaction.com (74.53.92.66): icmp_seq=2 ttl=252 time=57.4 ms
64 bytes from web36.webfaction.com (74.53.92.66): icmp_seq=3 ttl=252 time=57.3 ms
64 bytes from web36.webfaction.com (74.53.92.66): icmp_seq=4 ttl=252 time=57.3 ms
64 bytes from web36.webfaction.com (74.53.92.66): icmp_seq=5 ttl=252 time=57.2 ms
64 bytes from web36.webfaction.com (74.53.92.66): icmp_seq=6 ttl=252 time=57.1 ms
64 bytes from web36.webfaction.com (74.53.92.66): icmp_seq=7 ttl=252 time=57.0 ms
64 bytes from web36.webfaction.com (74.53.92.66): icmp_seq=8 ttl=252 time=57.0 ms
64 bytes from web36.webfaction.com (74.53.92.66): icmp_seq=9 ttl=252 time=56.9 ms
64 bytes from web36.webfaction.com (74.53.92.66): icmp_seq=10 ttl=252 time=56.8 ms
64 bytes from web36.webfaction.com (74.53.92.66): icmp_seq=11 ttl=252 time=56.7 ms
64 bytes from web36.webfaction.com (74.53.92.66): icmp_seq=12 ttl=252 time=56.7 ms

This is a nice set of values, all hovering around 57ms. I’ll use 57 as my RTT. For my available bandwidth, I guess I could just use my ISPs advertised speed of 6Mbps upload and 3Mbps download. Let’s see what my BDP looks like in that scenario.

(6,000,000 b/s * 0.57s)  = ~342kb/s

(342 kb/s) / 8 = 42KiB/s

So, if there were no other components involved, I could potentially move 42 kilobytes per second across my connection. However, there are other components involved, most notably the send and receive buffers in the Linux kernel. In kernels prior to 2.6.7, these buffers were really configured by default to strike a balance between good performance for local network connections and low system overhead (e.g. CPU and memory used to process connections and packets). They were not optimized for moving large data sets over long-haul paths. However, more recent kernels now automatically tune the values so that you should receive excellent performance in machines with over 1GB RAM and a BDP of under 4MB. I only just learned that this is the case – I had been searching around for my notes from 2001 about echoing values into files under /proc/sys/net, and using the sysctl variables…. no more!

If you want to see if your system has autotuning enabled, check to see if /proc/sys/net/ipv4/tcp_moderate_rcvbuf is set to “1”, by cat’ing the file. The corresponding ‘sndbuf’ file is essentially irrelevant since sender-side autotuning has been enabled since the early 2.4 kernels.

Of course, I don’t actually see this kind of performance. This was a quick and dirty test using the worst possible tool for the job: scp. Tools like ssh and scp add lots and lots of overhead at various levels due to protocol overhead and (especially) encryption. So given that, my performance really isn’t bad. I’ll see how high I can get it and post my results next week, just for giggles. :-)