Most recently, switched to conkeror+firefox, and now hit c-c c-n in the web browser to get the next comic. The same queue is fed by a one-liner in ~/.snownews/browser:
so urls from that are queued up the same way (see ../firefox.)
Still should do some kind of auto-training; the simplest algorithm is to fetch the whole page, then fetch it small-T (an hour? 5m?) later, and see what changed, and ignore it. Keep a record (vector?) of the whittling, and use that instead of check_maybe. Possibly pick a threshold of change and display it to me - and let me vote on same vs. different...
Inspired by a discussion (on raw?) it occurred to me that running a single "common" aggregator might actually be useful for me and be sharable, saving polling bandwidth.
User interface: a "next comic" bookmarkable link. You hit it, it authenticates you and finds your set, and redirects you to the next available comic. You might have different links (cgi arguments) for subsets or priorities, maybe even for triggers (or at very least, the "no more" page can have some features, like add, rearrange, "unread", force another scan, etc.)
Then the question is - should similar processing be used as a rss inflow mechanism... it'd be a good place to put the categorizer...
curl http://news.bbc.co.uk/2/hi/uk_news/magazine/default.stm | sgrep 'attvalue("") in attribute("HREF") in (elements parenting "10 things")' is a URL I'd like to fetch when it changes, which should be weekly.
Some fixes from kcr that deal with running this for the first time; some new comics; some quoting horror that I should work around (or get the upstream site to fix; '' are not valid attribute quotes...)
Switching to FancyURLparser had the unexpected side effect of not raising an exception for 304 anymore; worked around that, also treat etag-not-changed as a query, just like any other not-matched case.
Implement Jarno Virtanen's etag/last-modified example, since that causes even less load on the target server, if it supports them. Switched the db entries to hashes - even though we have to explicitly read and write them, that fits the commit model well enough, and lets us add new verbs later.
Also implemented "summary" verb which tells us that we have 57 entries, 26 have etags, 28 have last-modified times - so this change makes a difference on more than half of the comics sites...
added KeyboardInterrupt test, to cleanly abort the right way. Upgraded to NNWLite 1.0.5fc1, but it still seems to only notice one change per batch.
added lastBuildDate, since NNWLite is still not noticing changes.
add explicit IOError case since most get_content errors are of that form. add a check for the-gadgeteer.com changes, based on a regexp.
looks like having multiple identical links with different titles isn't enough for nnw to call them different - so now we actually parse the old items, filter them against themselves (as a migration step, but no reason not to keep it), and then have write_item nuke anything it is supplying. This automatically bounds the size of the feed, too.
51 out of 56 - but a typo ate the rss, so now it needs to use a temp file. (Turns out to be a trivial change to the rssfile constructor and destruct.) Also looks like we need separate times for "last changed" and "last seen". This leads to implementing a "fix" command to update the database format. After commenting out a couple of pages that don't actually lead to comics anymore, we're now up to 53/53, 36/53 going through check_maybe and the rest being (currently) singletons. Good enough for now...
Dropped the rest-time to 6 hours. Implemented insert-at-start mode of dealing with an existing RSS file. Implemented a few more handlers, also just repointed a few start-urls to help urlbase work (47/57 now, and the remaining ones are kind of odd.)
4.5 hours later, it handles 38/55 comic sites, including one with frames (helen.) It may pick up a few more if keenspace gets unhosed tomorrow. The ad-hoc ones could probably be unified a bit more, a few of them could probably use a "this year" match string. pillars turned out to be more easily handled just by checking the title, and I suspect some others fall to that as well.
It should probably do an insert-and-prune on the rss file, so that I can run it at will.
comics page to rss feed - not to scrape the comic itself, just to handle notification of changes.
given a comic page url, pick a regexp to find the "this comic" link. If that match changes, generate an rss "item" pointing to the comic top level, or maybe the archive page if we have a way to do that.
have generic "process_keenspace" functionality.