Web Comics Reader/Helper

Going from comic to comic is more tedious than it should be, especially with a browser groaning under the load of the ad networks that support the comics (more relevant in 2003 than today, of course.) Also, why impose expectations of schedule on the author, even unvoiced, when the computer can tell me when there's an actual update?

Eventually piperka came along and does a good job at this, shared among many users (so it saves much more time in aggregate than any single-user personal effort ever could.) Some of the code is documented here.


comics: Sun Nov 27 00:16:00 2005

Sun Nov 27 00:16:00 2005

Most recently, switched to conkeror+firefox, and now hit c-c c-n in the web browser to get the next comic. The same queue is fed by a one-liner in ~/.snownews/browser:

curl -Hurl:\ %s -d "" <http://localhost:3383/push_url>

so urls from that are queued up the same way (see ../firefox.)

Still should do some kind of auto-training; the simplest algorithm is to fetch the whole page, then fetch it small-T (an hour? 5m?) later, and see what changed, and ignore it. Keep a record (vector?) of the whittling, and use that instead of check_maybe. Possibly pick a threshold of change and display it to me - and let me vote on same vs. different...


comics: Tue Dec 16 23:09:00 2003

Tue Dec 16 23:09:00 2003

Inspired by a discussion (on raw?) it occurred to me that running a single "common" aggregator might actually be useful for me and be sharable, saving polling bandwidth.

User interface: a "next comic" bookmarkable link. You hit it, it authenticates you and finds your set, and redirects you to the next available comic. You might have different links (cgi arguments) for subsets or priorities, maybe even for triggers (or at very least, the "no more" page can have some features, like add, rearrange, "unread", force another scan, etc.)

Then the question is - should similar processing be used as a rss inflow mechanism... it'd be a good place to put the categorizer...


comics: Sat Nov 22 21:44:00 2003

Sat Nov 22 21:44:00 2003

Other scraping:

curl <http://news.bbc.co.uk/2/hi/uk_news/magazine/default.stm> | sgrep 'attvalue("*") in attribute("HREF") in (elements parenting "10 things")'

is a URL I'd like to fetch when it changes, which should be weekly.


comics: Tue Nov 11 03:16:00 2003

Tue Nov 11 03:16:00 2003

Some fixes from kcr that deal with running this for the first time; some new comics; some quoting horror that I should work around (or get the upstream site to fix; '' are not valid attribute quotes...)


comics: Mon Oct 6 01:05:00 2003

Mon Oct 6 01:05:00 2003

Switching to FancyURLparser had the unexpected side effect of not raising an exception for 304 anymore; worked around that, also treat etag-not-changed as a query, just like any other not-matched case.


comics: Sun Oct 5 18:02:00 2003

Sun Oct 5 18:02:00 2003

Implement Jarno Virtanen's etag/last-modified example, since that causes even less load on the target server, if it supports them. Switched the db entries to hashes - even though we have to explicitly read and write them, that fits the commit model well enough, and lets us add new verbs later.

Also implemented "summary" verb which tells us that we have 57 entries, 26 have etags, 28 have last-modified times - so this change makes a difference on more than half of the comics sites...


comics: Sat Oct 4 00:23:00 2003

Sat Oct 4 00:23:00 2003

added KeyboardInterrupt test, to cleanly abort the right way. Upgraded to NNWLite 1.0.5fc1, but it still seems to only notice one change per batch.


comics: Sun Sep 21 03:39:00 2003

Sun Sep 21 03:39:00 2003

added lastBuildDate, since NNWLite is still not noticing changes.


comics: Sat Sep 20 18:57:00 2003

Sat Sep 20 18:57:00 2003

add explicit IOError case since most get_content errors are of that form. add a check for the-gadgeteer.com changes, based on a regexp.


comics: Thu Sep 18 16:33:00 2003

Thu Sep 18 16:33:00 2003

looks like having multiple identical links with different titles isn't enough for nnw to call them different - so now we actually parse the old items, filter them against themselves (as a migration step, but no reason not to keep it), and then have write_item nuke anything it is supplying. This automatically bounds the size of the feed, too.


comics: Tue Sep 16 23:19:00 2003

Tue Sep 16 23:19:00 2003

51 out of 56 - but a typo ate the rss, so now it needs to use a temp file. (Turns out to be a trivial change to the rssfile constructor and destruct.) Also looks like we need separate times for "last changed" and "last seen". This leads to implementing a "fix" command to update the database format. After commenting out a couple of pages that don't actually lead to comics anymore, we're now up to 53/53, 36/53 going through check_maybe and the rest being (currently) singletons. Good enough for now...


comics: Tue Sep 16 15:11:00 2003

Tue Sep 16 15:11:00 2003

Dropped the rest-time to 6 hours. Implemented insert-at-start mode of dealing with an existing RSS file. Implemented a few more handlers, also just repointed a few start-urls to help urlbase work (47/57 now, and the remaining ones are kind of odd.)


comics: Tue Sep 16 04:42:00 2003

Tue Sep 16 04:42:00 2003

4.5 hours later, it handles 38/55 comic sites, including one with frames (helen.) It may pick up a few more if keenspace gets unhosed tomorrow. The ad-hoc ones could probably be unified a bit more, a few of them could probably use a "this year" match string. pillars turned out to be more easily handled just by checking the title, and I suspect some others fall to that as well.

It should probably do an insert-and-prune on the rss file, so that I can run it at will.


comics: Mon Sep 15 23:19:00 2003

Mon Sep 15 23:19:00 2003

comics page to rss feed - not to scrape the comic itself, just to handle notification of changes.

given a comic page url, pick a regexp to find the "this comic" link. If that match changes, generate an rss "item" pointing to the comic top level, or maybe the archive page if we have a way to do that.

have generic "process_keenspace" functionality.