Codes well with others.

Picked up a preference for tools I've coded myself, even when they're nowhere near the best choice; it's time to explicitly put some time in learning the great wealth of shared tools out there, and aim a little higher...


print, logging, upstart

2013-02-09 02:59:05 -0500

Originally posted on tumblr

Decided recently that using print directly instead of python logging was a minor instance of "coding when I should be using"; likewise daemonizing/restarting code shouldn't be done by hand either, so I finally figured out how to actually use upstart "user configs". (Together they made a good set of things to experiment with while the net was offline due to the northeast blizzard.)

Turns out that both of them were mostly straightforward, except for one impossible/undocumented step. Even on ubuntu, it's impossible to start user scripts explicitly (though you can trigger them on events, which almost counts) because these lines are missing from /etc/dbus-1/system.d/Upstart.conf:

    <allow send_destination="com.ubuntu.Upstart"
       send_type="method_call" send_member="Start" />
    <allow send_destination="com.ubuntu.Upstart"
       send_type="method_call" send_member="Stop" />

(Similar lines are needed for restart/reload.) I haven't studied the security implications (they don't matter for the machines I'm using this on, but they might matter for your machine) since I haven't checked if these lines are missing from Quantal, and if they are, what bug report I should file. (Filing bug reports instead of kludging around things is another kind of long term vs. short term tradeoff, but that's the subject for a later "codes well with others" post.)

Python logging has a similar "missing chunk of magic". To be fair, if all you're trying to do is code more cleanly, and have easy to use instrumentation for later, logging.basicConfig and changing your print calls to logging.warn with explicit format strings is enough. However, much of what I write should run as services (under upstart, see above) and should actually syslog... and that's supposed to be easy to do with logging. In the past, I've just used syslog directly, and now that I dig into it, I see why: in order to use logging with syslog, you need

    syslog_tag = "%s[%d]: " % (os.path.basename(sys.argv[0]), os.getpid())
    syshandler = SysLogHandler(address="/dev/log")
    sysformatter = logging.Formatter(syslog_tag + "%(message)s")

Turns out that since SysLogHandler doesn't use openlog (because when it logging was written, the syslog module apparently wasn't threadsafe) so you have to cook up your own process name/pid tagging. Then, even though /dev/log is pretty much the only way to access syslog in the past decade, the default is still UDP to port 514, so you have to override that as well. Once you drop in this 6-line blob, you can use logging in the normal fashion - figuring out a way to folded into one helper that gets pushed upstream would be nice.

New Years' Python Meme (2012-2013)

2012-12-25 23:52:37 -0500

Originally posted on tumblr

(inspired by who credits )

ipython-notebook. Of course, it's the second or third time I've "discovered" it, but it's gotten very sophisticated.

Pushed some limits for mock testing on Unix - a python script that creates Subversion and Git repositories and substitutes calls to sendmail, for example. Also spent a lot of quality time with boto...

A couple of patches to kivy and pypi-install - I've been doing a lot more bespoke automation and release engineering that are at the top of a tall pyramid of open source that actually works, so there have been fewer things for me to dive in and debug...

Reddit r/python and Python Planet - and anything that comes by on twitter; I'm not a hugely faithful blog reader this decade, but twitter is a great circulator of ideas...

A low-overhead mobile (Android, but Firefox OS would be tempting too...) native python app development environment.

Using nltk on the AI-Class mini-shredder problem

2012-01-03 00:13:10 -0500

Originally posted on tumblr

The Free On-line Stanford AI Class didn't have the programming problems that the in-person class did, due to lack of grading resources - but we did get a simple, optional, mini shredder challenge where it was suggested that we use NLP techniques to reassemble 19 vertical (2 character wide) slices of a paragraph back together in the right order. The popular approach was to simply produce some letter statistics (usually bigram) on english words, and then glue stripes together picking the addition that improves the score most.

Most of the approaches discussed afterwards on the aiclass Reddit involved getting a big chunk correct using machine techniques and then fixing up the last chunks by hand; my personal approach found "Claude" and "Shannon" recognizably, leaving very little hand-work. (Note that the actual shredder challenge was won by a team using a hybrid computer-vision/nlp/human approach, so one could argue that this is a state-of-the-art method.)

In my example, I grabbed /usr/share/dict/words and chewed it into statistics by hand; not very sentence like, but by pairing spaces with initial and terminal letters I got good word-assembly out of it. Later, I figured that a good example of "coding well with others" would be to actually use the widely acclaimed NLTK Natural Language Toolkit, especially if that showed that the additional effort of understanding an existing solution would have some measurable benefit towards executing that solution.


The first approach was to replace my dictionary extractor:

ascii_words = [word.lower().strip() for word in file("/usr/share/dict/words")
                    if word.strip() and not (set(word.strip()) - set(string.letters))]

with code that pulled the words from the Project Gutenberg sample:

import nltk"gutenberg")
ascii_words = nltk.corpus.gutenberg.words()

Then I reran the code (as a benchmark to see how much of a difference having "realistic" words-in-sentences probabilities would make.) Inconveniently (but educationally), it made so much of a difference that the first run produced a perfect reassembly of the 19 slices, with no need for further human work (either on the output data or on the code...) Still, there are a bunch of ad-hoc pieces of code that could be replaced by pre-existing functions - which should in theory be more robust, and perhaps also easier to share with others who are familiar with NLTK without forcing them to unravel the ad-hoc code directly.

Bigrams and Trigrams

A thread on nltk-users points out the master identifier index where one can find a bunch of tools that relate to word-pair bigrams, but also the basic nltk.util.bigrams that we're looking for. In fact, our existing code was a somewhat repetitive use of re.findall:

for ascii_word in ascii_words:
    for bi in re.findall("..", ascii_word):
        in_word_bigrams[bi] += 1
    for bi in re.findall("..", ascii_word[1:]):
        in_word_bigrams[bi] += 1

which turns into

for ascii_word in ascii_words:
    for bi in nltk.util.bigrams(ascii_word):
        in_word_bigrams["".join(bi)] += 1

(in_word_bigrams is a defaultdict(int); the join is necessary to keep this a piecewise-local change - otherwise I'd have to look at the later scoring code which uses substrings and change it to use tuples.)

The use of nltk.util.trigrams is nearly identical, replacing three more re.findall loops.


It turns out we can go another step (perhaps a questionable one) and replace the defaultdict above with nltk.collocations.FreqDist() giving this:

in_word_bigrams = nltk.collocations.FreqDist()
for ascii_word in ascii_words:
    for bi in nltk.util.bigrams(ascii_word):"".join(bi))

In practice, most Python developers should be more familiar with defaultdict (those with long term experience might even have a bit of jealousy about it, as it wasn't introduced until Python 2.5.) One benefit of using FreqDist is that it includes a useful diagnostic method, .tabulate, which lets us confirm that the statistics are what we expect from english-language text:

print "In Word Bigrams:"

In Word Bigrams:
  th   he   an   in   er   nd   re   ha   ou   at   en   hi   on   of   or
310732 284501 153567 147211 140285 134364 112683 103110 95761 88376 84756 82225 81422 77028 76968

and similarly for the other tables. Even better, adding one more line:


uses matplotlib to show us a nice graph that again confirms our suspicion that bigram statistics have a tail, without having to get distracted by the mechanics of doing the plotting ourselves - some past nltk developer decided that frequency distributions were sometimes worth visualizing, and we get to take advantage of that.


Since the actual task involves coming up with the probability that a given partial combination of strips of text is in english, as the completely unscrambled one will be - bigram frequency counts aren't directly useful, we need to normalize them into probabilities. Again, this isn't much Python code:

bi_total_weight = sum(all_bigrams.values())
norm_bigrams = dict(((k, v/bi_total_weight) for k,v in all_bigrams.items()))

but the nltk.probability module has a class that performs this conversion directly:

norm_bigrams_prob = nltk.probability.DictionaryProbDist(all_bigrams, normalize=True)

A quick loop to print out the top ten from both shows that

There is an interface difference; instead of dictionary lookup, DictionaryProbDist provides .prob() and .logprob() methods. Nothing else that leaps to the eye as helpful to this particular problem - but it's useful to notice the .generate() method, which is proportional sampling with replacement - which is basically the core of the Particle Filter algorithm.

Laplace Smoothing

For the final step, we need to apply our bigram (and trigram) probabilities to our partially assembled bits of text, so that we can greedily pick the combinations that are approaching english the most quickly. However, we have bigrams that appear in the target that don't appear in the model corpus; as explained in Unit 5, Section 21, if you assign those missing values a probability of zero, they falsely dominate the entire result. (If you use the model that way anyhow, it does in fact come up with a poor combination of strips.)

In the original implementation, I assigned an arbitrary and non-rigorous "10% of a single sample" value, which was at least effective:

for i in range(len(word) - 1):
    # go back and do the smoothing right!
    s *= bigrams.get(word[i:i+2], 0.1/bi_total_weight)
return math.log(s)

It turns out that nltk.probability has some more useful surprises in store - we can simply use LaplaceProbDist directly, since k=1 is sufficient. It turns out that the general form is also called Lidstone smoothing so we could use LidstoneProbDist(gamma=k) if we needed to experiment further with smoothing; in the mean time,

norm_bigrams = nltk.probability.LaplaceProbDist(all_bigrams)
for i in range(len(word) - 1):
    s += bigrams.logprob(word[i:i+2])
return s

simply (and convincingly) "multiplies" probabilities (adding logs, just to get numbers that are easier to handle - that is, easier to read in debugging output.)

Final Output

One interesting result of looking at the logs is that while the chosen strips are typically 2**15 to 2**90 more likely than the second best choice, once we've assembled "Claude Shannon founded" it jumps out that the best next choice (by a factor of "merely" 5) is actually to (incorrectly) add a strip on the left... so at least with the gutenberg corpus, the assemble-to-the-right bias actually helps reach the desired answer, so there's more room to improve the method. The numbers also tell us that it's a path problem, not a result problem - the evaluation function gives a 2**94 higher probability for the correct result than for the (still slightly-scrambled) result that the symmetric (but still greedy) version produces.


While it was certainly faster to code this up in raw Python the first time around - note that this was a single instance problem, with new but thoroughly explained material, where the code itself was not judged and would not be reused. Once you add any level of reuse - such as a second problem set on the same topic, other related NLP work, or even vaguely related probability work - or any level of additional complexity to the original problem, such as a result that couldn't be judged "by eye" - digging in to NLTK pays off rapidly, with dividends like

The last one is perhaps the most relevant benefit, because even though the code may perform identically, and the line counts are the same, the version that uses the NLTK-supplied abstractions should communicate more clearly to the next human reader of the code - an important "coding well with others" principle.