Nagaina - structured systems monitoring platform

Micro-monitoring system that I've actually continued to use this entire time.

nagaina: Fri Feb 9 02:19:00 2007

Fri Feb 9 02:19:00 2007

Still running multihomed, clearly need to do at least some tests in a channel-specific manner. The easy thing to start with is probe_ping, since it has a -S option for source address, which should do the right thing when combined with our iprules. However, since this is testing the interface, not the remote resource, it isn't an option to probe_ping, but a new probe_interface or something like that.

Later versions should probably do similar tests with TCP.

nagaina: Mon Dec 4 01:20:00 2006

Mon Dec 4 01:20:00 2006

"dependencies" break down into three categories:

class-wide: "ssh requires ping", "ntp requires ping". These can probably go into the probe implementations directly. (That also simplifies the "http requires ping" case, where the ping target is only part of the http target.
instance-wide: "ping remote-host requires ping router". These clearly need to be in the config file, but need a clear non-repetitive syntax.
category-wide: "http thisurl requires this batch of afsXX tests to pass". This is trickier mostly in terms of grouping the afs tests.

Since the latter two require some naming cleverness, I'll let them stew and just implement class-wide requirements for now. It looks like ping is the only thing that actually works as a target, which simplifies things further...

So now we have a set() of results to test against, of which only "DOWN machinename" is currently implemented, and a decorator to fail a test if self.machine is listed as DOWN. This means the pings have to be explicitly run first for them to matter. Also, this underlying structure should be easy to use for instance prerequistes, once I figure out how to name them.

Future additions:

more bad_probes tests for the rest of the ping dependencies
modify notifier to complain differently (or not at all) for untested

nagaina: Fri Nov 17 02:06:00 2006

Fri Nov 17 02:06:00 2006

"any" worked well (and shows that there are still dns outages that appear to take out all three of their servers, but they're brief and not every day. Good enough for now; this is stable enough that I should clearly

release another snapshot
make a major change, like parallelism or dependencies

Straight dependencies probably come first, but should keep parallel probing in mind because "don't bother unless" dependencies are a strict constraint on ordering and grouping of parallel tests (though there are also "don't hammer related resources" constriants it will add that a sequential probe doesn't need.

nagaina: Tue Nov 7 17:01:00 2006

Tue Nov 7 17:01:00 2006

The point of probing like this is to notice system detail-level failures directly, instead of getting reports of abstract feature failures and having to diagnose them "downward" from the complaint. As such, it makes sense to think about where user complaints come from; this lead to me monitoring the DNS servers for the network provider my mom uses - so if she calls with the (reasonable, end user) report that "email doesn't work" I can say right away that "yeah, your ISP has been having trouble, nothing you need to do, if they don't fix it soon bug them directly" instead of even looking for a local problem on her system.

This lead me to notice that a couple of their servers bounce fairly often, which lead me to look for ways to raise the timeout... which led me to discover that I couldn't get to http://pydns.sourceforge.net/ at all. (This turned out to just be a random sourceforge outage.)

The timeout change was easy enough (just add timeout=90 to the DNS.Request constructor...) but all it showed was that no, in fact, it wasn't slow - if it didn't answer promptly, it wasn't going to answer at all, so changing the timeout doesn't help.

This does suggest implementing another category of combined test - while in many cases we care if any members of a redundancy group fail, in a case like this we only really want to hear about it if 2/3 or 3/3 are out.

nagaina: Mon Nov 6 21:14:00 2006

Mon Nov 6 21:14:00 2006

Noted today that probe_ntp doesn't work too well if ntpq isn't installed. (I should either start packaging nagaina instead of checking it out on the server, or implement an is-package-installed probe to check for the dependencies :-)

Karl points out that "having infrastructure so the probes can be clever about reporting subsidiary programs would be good" to which I responded that one of the "soon" features is test dependencies ("don't bother with Y if X failed") and it would fit well in that. (This particular test did "report file not found" on the ntp invocation, I just thought it was path until I looked closer.

Also, after this morning, I want a nagaina probe for "excess tire wear"...

nagaina: Sat Nov 4 18:36:00 2006

Sat Nov 4 18:36:00 2006

Putting run_probes in cron is a reasonable first-cut, but as it gets longer, it doesn't work out. Strictly speaking, cron is the wrong model; don't want to say "run every N minutes", do want to say "delay between runs". Might possibly want to say "maximum time to run" but that could be done directly in the tool. The concern is noticing if it ever stops running and restarting it.

This comes up enough that it surprises me that it isn't part of a standard tool - but that's true of lots of things lately. (Rather, keeping something running is the traditional job of init - but there isn't really a good user-level configuration for it.) Further digging turns up daemontools, which is unlicensable, and runit, which seems to focussed on serving as the primary init.]

So, nagaina now includes run_forever, and an @reboot crontab entry.

nagaina: Fri Nov 3 19:38:00 2006

Fri Nov 3 19:38:00 2006

Just discovered the hard way that my domain anniversary is today :-}

It's a reminder that I should make my old "checkwhois" a nagaina test. However, that's really a fallback - really something should let me schedule the updates a month in advance... except that I've never really used a todo system for that long...

Also probing whois will be expensive, or rather shouldn't be done every five minutes - so that means adding a way to do explicitly cached probes, which means having state (which so far this doesn't...)

nagaina: Sun Oct 29 22:33:00 2006

Sun Oct 29 22:33:00 2006

Implemented zephyr_server; should add jabber_server once I'm running one.

nagaina: Sat Oct 28 21:03:00 2006

Sat Oct 28 21:03:00 2006

[done]Having handed off a version to kcr, actually arrange some release content here: * index.html * list the tarballs * rss feed generated from the VERSION file * tarball as attachment? * link on interesting-stuff page

nagaina: Sun Oct 22 04:45:00 2006

Sun Oct 22 04:45:00 2006

Ok, the trivial notifier is done too, with zephyr. An afs-restart related outage showed that

probe_afs_anyserver output needs to be split up and rendered into multiple lines somehow. s/;/<br>/g might suffice...
nagios responded faster, partly due to not having to wait for other slow probes - so parallelism and a shorter recycle time are probably a good idea
notifier failed to send the OK, due to a stupid bug :-)

nagaina: Sun Oct 22 02:46:00 2006

Sun Oct 22 02:46:00 2006

For now, we've got the simplest thing:

run_probes --html as above
cron job to run it.

Once that's in place, what we really need next isn't prettier html, but notification. We've got zephyr-via-mit now, but should really add jabber, possibly locally. This means the notifier should be pretty transport-agnostic.

The simplest thing would be to run a separate notifier of change after each run_probes, and send the delta. Possibly also send the baseline once an hour if there are any failures... parsing the html is a lazy way to do this.

nagaina: Sat Oct 21 04:59:00 2006

Sat Oct 21 04:59:00 2006

So we have parity with the existing thok.org nagios configuration, purely in terms of things tested, with effectively no rendering at all. So the next steps:

run_probes --html generating a .html file with a refresh line and not much else
- multiple (js?) display modes:
- verbose
- green-red status bar
- quiet unless broken
- groupby host?
wrapper to run-and-sleep that to keep the file fresh

nagaina: Fri Oct 20 04:29:00 2006

Fri Oct 20 04:29:00 2006

new things we need: combining named sets, so we can run anyserver on all of fileservers and dbservers once each; likewise, loginservers from everything some kind of don't if not, then don't override, ie. ping host failing -> don't run ssh host

nagaina: Thu Oct 19 02:19:00 2006

Thu Oct 19 02:19:00 2006

with afs set up and porting some more ping tests we could be done, as far as duplication goes.

of course, we also need to implement a /proc/mdstats check, of some sort...

nagaina: Wed Oct 18 01:13:00 2006

Wed Oct 18 01:13:00 2006

Since nagios-plugins are separate, support them. First, though, just do some basic ones directly in python, given how powerful check_system_health was...

Also - don't need cherrypy directly yet - just have a report page, and run out of cron (or some equivalent loop-runner.) So start with "generate one report", add --html, then add continuous-runner...

Run as a non-privileged user, as well.

nagaina: Sat Oct 14 06:24:00 2006

Sat Oct 14 06:24:00 2006

More constraints:

commandline interface - web page for cheap reporting only, don't even try the whole personnel-management thing nagios does.
document it more in terms of "service is working assertions"
immediate types:
afs
mdstats
- is "local file" enough for that, or do we need delegated agents?
thok, www-images, momrouter
- use twill? actually get an image, at least...
is python-logging a useful tool for the output stage (given that there's a jabber hook for it, a zephyr hook is easy)

nagaina: Sun Oct 2 12:28:00 2005

Sun Oct 2 12:28:00 2005

Some basic tricks:

support new modules dynamically. Have a database of functions, the modules they come from, the last-reload-time of the module, and handle them.
use the BSD license, but maybe add some notes emphasizing the no-warrantee aspects, since it could be used for critical monitoring.