nagaina: Tue Nov 7 17:01:00 2006

Tue Nov 7 17:01:00 2006

The point of probing like this is to notice system detail-level failures directly, instead of getting reports of abstract feature failures and having to diagnose them "downward" from the complaint. As such, it makes sense to think about where user complaints come from; this lead to me monitoring the DNS servers for the network provider my mom uses - so if she calls with the (reasonable, end user) report that "email doesn't work" I can say right away that "yeah, your ISP has been having trouble, nothing you need to do, if they don't fix it soon bug them directly" instead of even looking for a local problem on her system.

This lead me to notice that a couple of their servers bounce fairly often, which lead me to look for ways to raise the timeout... which led me to discover that I couldn't get to http://pydns.sourceforge.net/ at all. (This turned out to just be a random sourceforge outage.)

The timeout change was easy enough (just add timeout=90 to the DNS.Request constructor...) but all it showed was that no, in fact, it wasn't slow - if it didn't answer promptly, it wasn't going to answer at all, so changing the timeout doesn't help.

This does suggest implementing another category of combined test - while in many cases we care if any members of a redundancy group fail, in a case like this we only really want to hear about it if 2/3 or 3/3 are out.