Micro-monitoring system that I've actually continued to use this entire time.
Still running multihomed, clearly need to do at least some tests in a
channel-specific manner. The easy thing to start with is
since it has a
-S option for source address, which should do the right
thing when combined with our iprules. However, since this is testing
the interface, not the remote resource, it isn't an option to
probe_ping, but a new
probe_interface or something like that.
Later versions should probably do similar tests with TCP.
"dependencies" break down into three categories:
Since the latter two require some naming cleverness, I'll let them stew and just implement class-wide requirements for now. It looks like ping is the only thing that actually works as a target, which simplifies things further...
So now we have a
set() of results to test against, of which only "DOWN
machinename" is currently implemented, and a decorator to fail a test
self.machine is listed as DOWN. This means the pings have to be
explicitly run first for them to matter. Also, this underlying
structure should be easy to use for instance prerequistes, once I
figure out how to name them.
bad_probestests for the rest of the ping dependencies
"any" worked well (and shows that there are still dns outages that appear to take out all three of their servers, but they're brief and not every day. Good enough for now; this is stable enough that I should clearly
Straight dependencies probably come first, but should keep parallel probing in mind because "don't bother unless" dependencies are a strict constraint on ordering and grouping of parallel tests (though there are also "don't hammer related resources" constriants it will add that a sequential probe doesn't need.
The point of probing like this is to notice system detail-level failures directly, instead of getting reports of abstract feature failures and having to diagnose them "downward" from the complaint. As such, it makes sense to think about where user complaints come from; this lead to me monitoring the DNS servers for the network provider my mom uses - so if she calls with the (reasonable, end user) report that "email doesn't work" I can say right away that "yeah, your ISP has been having trouble, nothing you need to do, if they don't fix it soon bug them directly" instead of even looking for a local problem on her system.
This lead me to notice that a couple of their servers bounce fairly often, which lead me to look for ways to raise the timeout... which led me to discover that I couldn't get to http://pydns.sourceforge.net/ at all. (This turned out to just be a random sourceforge outage.)
The timeout change was easy enough (just add
timeout=90 to the
DNS.Request constructor...) but all it showed was that no, in fact, it
wasn't slow - if it didn't answer promptly, it wasn't going to answer
at all, so changing the timeout doesn't help.
This does suggest implementing another category of combined test - while in many cases we care if any members of a redundancy group fail, in a case like this we only really want to hear about it if 2/3 or 3/3 are out.
Noted today that
probe_ntp doesn't work too well if
installed. (I should either start packaging nagaina instead of
checking it out on the server, or implement an is-package-installed
probe to check for the dependencies :-)
Karl points out that "having infrastructure so the probes can be clever about reporting subsidiary programs would be good" to which I responded that one of the "soon" features is test dependencies ("don't bother with Y if X failed") and it would fit well in that. (This particular test did "report file not found" on the ntp invocation, I just thought it was path until I looked closer.
Also, after this morning, I want a nagaina probe for "excess tire wear"...
run_probes in cron is a reasonable first-cut, but as it gets
longer, it doesn't work out. Strictly speaking, cron is the wrong
model; don't want to say "run every N minutes", do want to say "delay
between runs". Might possibly want to say "maximum time to run" but
that could be done directly in the tool. The concern is noticing if
it ever stops running and restarting it.
This comes up enough that it surprises me that it isn't part of a standard tool - but that's true of lots of things lately. (Rather, keeping something running is the traditional job of init - but there isn't really a good user-level configuration for it.) Further digging turns up daemontools, which is unlicensable, and runit, which seems to focussed on serving as the primary init.]
So, nagaina now includes
run_forever, and an
@reboot crontab entry.
Just discovered the hard way that my domain anniversary is today :-}
It's a reminder that I should make my old "checkwhois" a nagaina test. However, that's really a fallback - really something should let me schedule the updates a month in advance... except that I've never really used a todo system for that long...
Also probing whois will be expensive, or rather shouldn't be done every five minutes - so that means adding a way to do explicitly cached probes, which means having state (which so far this doesn't...)
zephyr_server; should add
jabber_server once I'm running one.
[done]Having handed off a version to kcr, actually arrange some release content here: * index.html * list the tarballs * rss feed generated from the VERSION file * tarball as attachment? * link on interesting-stuff page
Ok, the trivial notifier is done too, with zephyr. An afs-restart related outage showed that
probe_afs_anyserveroutput needs to be split up and rendered into multiple lines somehow.
For now, we've got the simplest thing:
Once that's in place, what we really need next isn't prettier html, but notification. We've got zephyr-via-mit now, but should really add jabber, possibly locally. This means the notifier should be pretty transport-agnostic.
The simplest thing would be to run a separate notifier of change after each run_probes, and send the delta. Possibly also send the baseline once an hour if there are any failures... parsing the html is a lazy way to do this.
So we have parity with the existing thok.org nagios configuration, purely in terms of things tested, with effectively no rendering at all. So the next steps:
--htmlgenerating a .html file with a refresh line and not much else
new things we need: combining named sets, so we can run anyserver on all of fileservers and dbservers once each; likewise, loginservers from everything some kind of don't if not, then don't override, ie. ping host failing -> don't run ssh host
with afs set up and porting some more ping tests we could be done, as far as duplication goes.
of course, we also need to implement a
/proc/mdstats check, of some sort...
Since nagios-plugins are separate, support them. First, though, just
do some basic ones directly in python, given how powerful
Also - don't need cherrypy directly yet - just have a report page, and
run out of cron (or some equivalent loop-runner.) So start with
"generate one report", add
--html, then add continuous-runner...
Run as a non-privileged user, as well.
Some basic tricks: