You are viewing benlast

Store your AWS keys in your KeyChain.


So you're on your Macbook, and you want to run some AWS utility, or reference your AWS keys in your code. Of course, you could wire them into the environment, with something like:

export AWS_ACCESS_KEY_ID=my-aws-key
export AWS_SECRET_ACCESS_KEY=my-super-secret-key



But you're security-conscious, and you don't want to do that. Enter the power of the MacOS KeyChain: you can run a command to look up the keys from the KeyChain.

First, add both the AWS key and the AWS secret to the keychain, as Passwords:


  • For the AWS key, use "AWS" as the name, "AWS_KEY" as the account, and put the key in the password.

  • For the AWS secret key, use "AWS" as the name, "AWS_SECRET_KEY" as the account, and put the secret key in the password.

Now define an alias in your .bash_profile:


alias with_aws='env AWS_ACCESS_KEY_ID=$(security find-generic-password -a AWS_KEY -w) AWS_SECRET_ACCESS_KEY=$(security find-generic-password -a AWS_SECRET_KEY -w) bash -c'

This alias lets you run any bash command in a subshell with those environment variables set, and when the command ends, the subshell exits and the values are forgotten.

To just run a subshell with the environment variables set, use with_aws bash.


If you'd rather have the variables defined for any bash session (which is not that secure, but it's your call), then add this to your .bash_profile:

export AWS_ACCESS_KEY_ID=$(security find-generic-password -a AWS_KEY -w)
export AWS_SECRET_ACCESS_KEY=$(security find-generic-password -a AWS_SECRET_KEY -w)

More Grimacing


Following on from my previous post about grimace, a Python package that allows generation of regular expressions using fluent syntax, I've added some more tricks.

In the previous post, I showed how one can use grimace to write regular expressions in this form:


RE().any_number_of().digits().followed_by().dot().then().at_least_one().digit()


All good, but do we really need to invoke each method and have all those parentheses cluttering up the line? Because grimace is written using functional principles, each method returns a new RE object, based on the previous one but modified. Thus in theory we must call a method because we need to do some work, not just return an attribute.

Python descriptors will save us. A descriptor is a class that defines one or more of the __get__ , __set__ or __delete__ methods. Like properties in C#, they allow code to be executed when a descriptor is accessed as through it were an attribute.

Previously in grimace, we had methods like:

def start(self):
    """Return a new RE with the start anchor '^' appended to the element list"""
    return RE(self, '^')


Now we can define an Extender class and use it to rewrite start().

class Extender(object):
    """An Extender is a descriptor intended for use with RE objects, whose __get__ method returns a
    new RE based on the RE on which it was invoked, but with the elements extended with a given
    element, set when the Extender is initialized. It exists so that methods like start() and end()
    can be invoked as attributes or methods."""
    def __init__(self, element=None):
        self.element = element

    def __get__(self, instance, owner):
        if isinstance(instance, RE):
            if self.element is not None:
                return RE(instance, self.element)
            else:
                return RE(instance)
        return None

#... and here's how start() is now defined in the RE class...
class RE(object):
    start = Extender('^')


So far so good. We can now use grimace as in this example:

RE().any_number_of.digits.followed_by.dot.then.at_least_one.digit


But what happens if a poor programmer gets confused and invokes digits or dot as methods? We can fix this easily. The result of getting digits or dot or any other Extender is a new RE object, so executing digits() will get that new RE and then try to call it. All we need to do is make a RE instance callable, by adding this __call__ method:


def __call__(self): return self


That's all we need. It means that the execution of digit() becomes get the digit attribute, which is a new RE instance, then call it, which returns itself.

There is one more nicely functional trick. We can use an Extender to append more than just strings; we can also use it to append special objects, like the grimace Repeater that defines how many repetitions of a character are required. Because all objects in grimace are functional, they're immutable, and that means any instance may be shared between any number of threads or other bits of code. So I can write the not_a method (which adds a Not instance to the regular expression, to invert the meaning of the next match) as:


not_a = Extender(Not())


That instance of Not() is then shared between every RE() object that uses it. Functional programming is an excellent tool.

(The documentation for grimace is now in the github wiki page)

Expressing Oneself, Fluently


I use regular expressions[1] in Python often enough that I know many of the character classes and syntax tricks.

I use regular expressions in Python seldom enough that I forget many of the character classes and syntax tricks.

This is annoying, but I have lived with it. Then I came across a little article in Dr Dobb's Journal (which used to be great, many years ago, but is now a mere ghost of its former glory), in which Al Williams writes about (ab)using the C/C++ compiler to allow regular expressions to be written as:

start + space + zero_or_more + any_of("ABC") + literal(":") + group(digit + one_or_more)

rather than

^\s*[ABC]:(\d+)

I quite like the idea of fluent syntax (though I appreciate that it's not necessarily so appealing if your native language isn't English), but I spend marginally more time writing Python than C++ these days. Also, I like the idea of trying to build a fluent interface in a functional way. So, I started out by writing some examples of what I would want to be able to do:

start().end() would give the minimal empty string regex "^$". Easy enough.

Or how about

any_number_of().digits().followed_by().dot().then().at_least_one().digit()

You get the general idea: the fluent syntax describes the expression and results in a string that matches what it describes. One really good thing is that it avoids the "backslash plague" that can confuse those new to writing regular expressions in Python.

This now exists, and is on github at https://github.com/benlast/grimace and it will do the above and more. Time for some more intense code examples:

The grimace.RE object is our starting point; any method we call on it returns a new RE object. Let's get the regex to match an empty string.

>>> from grimace import RE
>>> print RE().start().end().as_string()
^$

The as_string() call turns the generated expression into a string that can then be used as the argument to the standard Python re module. There's also as_re() which will compile the regular expression for you and return the resulting pattern-matching object.


>>> #Extract the extension of a short DOS/FAT32 filename, using a group to capture that part of the string
>>> regex = RE().start().up_to(8).alphanumerics().dot().group(name="ext").up_to(3).alphanumerics().end_group().end().as_string()
>>> print regex
^\w{0,8}\.(?P\w{0,3})$
>>> #Use the re module to compile the expression and try a match
>>> import re
>>> pattern = re.compile(regex)
>>> pattern.match("abcd.exe").group("ext")
'exe'
>>>
>>> #We can do that even more fluently...
>>> RE().start().group(name="filename").up_to(8).alphanumerics().end_group() \
... .dot().group(name="ext").up_to(3).alphanumerics().end_group() \
... .end().as_re().match("xyz123.doc").groups()
('xyz123', 'doc')('xyz123', 'doc')
>>>
>>> #The cool example that I wrote out as a use case
>>> print RE().any_number_of().digits().followed_by().dot().then().at_least_one().digit().as_string()
\d*\.\d+
>>>


You can also split a complex regex over several lines:

# My python module
from grimace import RE

def is_legal_number(number):
    #Match a US/Canadian phone number - we put the RE() stuff in parentheses so that we don't
    #have to escape the ends of lines
    north_american_number_re = (RE().start()
      .literal('(').followed_by().exactly(3).digits().then().literal(')')
      .then().one().literal("-").then().exactly(3).digits()
      .then().one().dash().followed_by().exactly(4).digits().then().end()
      .as_string())
    number_re = re.compile(north_american_number_re)
    return number_re.match("(123)-456-7890") is not None


There is more to do: control over greedy matching, and ways to express some of the more complex tricks like backreferences and multiple matching subexpressions. And I'll also package it properly for installation via pip. But for now, it's available and it works.

(The documentation for grimace is now in the github wiki page)

[1] I can't write any article on regular expressions without quoting Jamie Zawinski: Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Memento Mori


A discussion at work took place about design patterns, and it prompted me to go back and look at something. One of the behavioural patterns is Memento, which (from Wikipedia) "provides the ability to restore an object to its previous state (undo)". The implementation described uses three objects: a caretaker, an originator and a memento object that the caretaker can use to restore the originator to a previous state.

From a functional programming viewpoint, I find this overly complex. Consider a system in which an object follows the functional principle of immutability. To "change state", a caller calls a method on that object, but the original object is unchanged: the method returns a new object with a different state. There's then no need to have the complexity of undo - the previous state is available via the original instance.

Of course, this is already the case with immutable strings in many languages. Consider this C#:

    string s1="This String is Mixed Case";
    string s2=s1.ToLower();
    //Oops - we need to undo the ToLower() operation.
    s2=s1;  //This is effectively an "undo"


The only principle to keep in mind is to save a reference to a previous version of the object if there is a potential need to undo it.

Benefits: huge saving of complexity, and therefore of testing. I find that the testing load is often forgotten when developers assess how much work is involved in an implementation, unless TDD is in the blood of the team.
Downside: potential extra use of memory to keep hold of previous objects. With intelligent sharing of expensive resources between objects, this can be minimal.

Scales And Patterns And Modes, Oh My!


I wrote this article a while back, for WholeNote.com, but I thought I'd revisit it. I've been doing some improvisation practice recently, and I wanted to refresh my memory on this quick way to find the scales for different modes.

This little chart gives you an easy fretboard pattern that'll help you quickly remember the scale for any of the modes listed below:

  • Min - natural minor

  • Dor - Dorian

  • Ion - Ionian (major)

  • Loc - Locrian

  • Mix - Mixolydian

  • Lyd - Lydian

  • Phr - Phrygian


 
|-----|-----|-----|-----|-----|-----|-----|
|-----|-----|-----|-----|-----|-----|-----|
|-----|-Min-|-Dor-|-----|-Ion-|-Loc-|-----|
|-----|-----|-Mix-|-----|-Lyd-|-Phr-|-----|
|-----|-----|--C--|-----|-----|-----|-----|
|-----|-----|-----|-----|-----|-----|-----|



First, you need to know the major (Ionian) scale patterns - you need to know how to play a major scale anywhere on the neck, given the root note. Take your time and come back when you know how to do that. It shouldn't take long: there's only one scale to learn.

Ready? Good. What the chart gives you is a way to find out which major scale contains the same notes as the mode you want to play. Confused? Let's try and explain it this way.

The Ionian mode (the major scale) starts at the root note and progresses up in steps as follows: tone, tone, semitone, tone, tone, tone, semitone. So the Ionian mode of C contains C (root), then a tone up to D, then a tone up to E, then a semitone to F, a tone to G, a tone to A, a tone to B and a final semitone to C again. So C Ionian contains the notes C, D, E, F, G, A, B.

Conveniently enough, those same notes also form the Aoelian (natural minor) scale for A, though you'd start with A as the root note, giving A, B, C, D, E, F, G. Those same notes also form the notes of the D Dorian scale: D, E, F, G, A, B, C. Get the idea? If you want to know the notes that make up a modal scale for a given note, they'll be the same as the major scale for another note.

For example, here are some more equivalents:

  • C Mixolydian has the same notes as F Ionian (major)

  • C Lydian has the same notes as G Ionian

  • C Phrygian has the same notes as G#/Ab Ionian

  • C Dorian has the same notes as A#/Bb Ionian

  • C Locrian has the same notes as C#/Db Ionian

The chart gives you an easy way to remember these. See the C note? To find the scale for, say, the Mixolydian mode, you look to see where Mix is (it's on F, one string above). That tells you that the Mixolydian scale for C has the same notes as F major. So to play your groovy mixolydian improv, just play the notes in the F major scale.

Learn the pattern on the chart (it's easy, only two strings to think about). Remember the names (I use the order in which they occur if you played the notes ascending): "Mix, Lyd, Phryg, Dor, Ion, Loc". Or Mixolydian, Lydian, Phrygian, Dorian, Ionian, Locrian. Of course, the Ionian in there isn't that useful: it tells you that to play C Ionian, you play, er, C Ionian. I left it there because it makes the pattern easier to remember.

Of course, you can move the pattern up and down the fretboard. For example, if you do it based on D (two frets up from where it's shown), it'll tell you that D Mixolydian is G Ionian, D Lydian is A Ionian, etc.

For an extra trick, if you know the natural minor scale patterns, you can use those to give you alternate ways to play the modes. For this, you need to know how to find the relative minor for any major scale. For example, the relative minor of C is A (the Am scale has all the same notes as C). If you look at the chart, you can see the Min, which gives you a way to work out the relative minor: the chart shows that for C, the relative minor is Am.

So, if you want to play in, say C Dorian, you can play (look at the chart) either Bb Ionian, or its relative minor, which would be Gm. They both contain the same notes as C Dorian.

This plays a lot better than it reads.  If you truly want to get it, find yourself a backing track that's all on one note (if you have a keyboard, pick a note, hold the key down and let it play).  Then try different scales over the top and get used to how they sound.  Then you can go out and start playing improv jazz...

Immutability in C++


The guys who work with me know all too well that I have a bit of a thing for functional programming.  Given the slightest opportunity I can ramble on about the advantages of immutable value types and side-effect free functions. But I find that too many introductions to FP focus on abstract problems like generating to Fibonacci series. Interesting, but I don't think too many developers face that challenge on an everyday basis.

So I thought I'd write about applying some FP principles to everyday languages, such as C++. I chose C++ because it's often considered amongst the least functional languages. Many of the same principles can apply in Java or C#, but we'll start here, and we'll start with immutable value types.

An type is immutable if its value (and for a class instance that means any values of any members) can't change after it's been created. This gives us several advantages:

  • No functions that takes a value type as an input can change it, so it reduces side-effects

  • There's no need to lock an immutable object for access by multiple threads, since no thread can change the value

  • You can use the value of an immutable object as a key in a hashed data structure, since the hash value can't ever change after creation

  • You can compare value types using simple value equivalence - if two value type instances have the same value, then for all practical purposes they are the same instance


  • Incidentally, it can be easier for C# developers to understand immutability because the System.String class is a good example. All the methods that "modify" a string return a new string - you can't actually modify the contents of a string (unless you get very clever with reflection, but that's just breaking the rules).

    Now let's take a C++ example and think about immutability.

    //A non-immutable Employee class
    
    #include 
    
    class EmployeeNF {
    public:
    	//The constructor body is in EmployeeNF.cpp so that
    	//there's something for the linker to use.
    	EmployeeNF(const std::string& aGivenName,
    		   const std::string& aFamilyName,
    		   const int	aNumber,
    		   const double &aSalary);
    
    	void SetNumber(const int aNumber) {
    		number=aNumber;
    	}
    
    	int Number() const {
    		return number;
    	}
    
    	void SetSalary(const double& aSalary) {
    		salary=aSalary;
    	}
    
    	double Salary() const {
    		return salary;
    	}
    
    	void SetGivenName(const std::string& aGivenName) {
    		givenName=aGivenName;
    	}
    
    	std::string GivenName() const {
    		return givenName;
    	}
    
    	void SetFamilyName(const std::string& aFamilyName) {
    		familyName=aFamilyName;
    	}
    
    	std::string FamilyName() const {
    		return familyName;
    	}
    
    private:
    	std::string	givenName;
    	std::string	familyName;
    	int		number;
    	double		salary;
    };


    Here we have a class that any C++ developer might write. It observes some good OO principles: the members are private and all access is via setters and getters. But it's fully mutable: any code can change any value.

    Now let's rewrite it so that it's immutable. Here's a first draft:

    //An immutable Employee class
    
    #include 
    
    class EmployeeF {
    public:
    	//The constructor body is in EmployeeF.cpp so that
    	//there's something for the linker to use.
    	EmployeeF(const std::string& aGivenName,
    		  const std::string& aFamilyName,
    		  const int	aNumber,
    		  const double&	aSalary);
    
    	const int Number() const {
    		return number;
    	}
    
    	const double& Salary() const {
    		return salary;
    	}
    
    	const std::string& GivenName() const {
    		return givenName;
    	}
    
    	const std::string& FamilyName() const {
    		return familyName;
    	}
    
    private:
    	const std::string	givenName;
    	const std::string	familyName;
    	const int		number;
    	const double		salary;
    };


    Some C++ points to note:
  • The member variables are now all marked const. This means that they can only be initialised in a constructor, using initialization lists. This is the most fundamental way of making them immutable.

  • The getter methods mostly return const references. There are long debates about whether this is a good idea, or prevents compiler optimizations (RVO or NRVO, if you wanted to Google), but I've chosen to do to illustrate that since the member variables don't change, it's possible to return references to them.

  • The getters return const values - whether this is appropriate depends on design intentions, but it prevents some common errors (such as use of = rather than == in a comparison). It also reflects the fact that we're dealing with an immutable type.


  • Okay, this gives us immutability, but what if I need to change the salary for an Employee? To do this efficiently, we can write "setters" that copy most of the values for an instance, such as:
    	const EmployeeF SetNumber(const int aNumber) const {
    		return EmployeeF(givenName,familyName,aNumber,salary);
    	}
    
    	const EmployeeF SetSalary(const double& aSalary) const {
    		return EmployeeF(givenName,familyName,number,aSalary);
    	}
    
    

    (Here I'm assuming the default copy constructor will do a good enough job for us - in fact it will lead to duplication of the strings, but avoiding that would be a whole different problem and require strings with reference counting).

    So a "setter" like this allows me to get a new EmployeeF instance that is the same as the one on which I called the method, but with a new value for the number member.

    Let's dig a little further. How would we use such an immutable type?

    //Let's create our first EmployeeF
    EmployeeF e("Ben","Last",1,1000.0);
    
    //Great, let's now change the Number
    EmployeeF e2 = e.SetNumber(2);
    
    //Now let's alter the salary too
    e2 = e2.SetSalary(1100.00);


    But if you try this... that line won't compile. Why? Because you're trying to modify e2, and e2 is an immutable type. The compiler will try and build you a default operator= and it'll fail, because all the member fields of e2 are const. Instead, you have to adopt a functional programming pattern and use variables that you don't modify:

    //Let's create our first EmployeeF
    const EmployeeF e("Ben","Last",1,1000.0);
    
    //Great, let's now change the Number
    const EmployeeF e2 = e.SetNumber(2);
    
    //Now let's alter the salary too
    const EmployeeF e3 = e2.SetSalary(1100.00);


    Every time we make a change to an immutable EmployeeF, we need a new variable to hold the new value. This isn't actually a bad thing. A basic principle of functional programming is that you don't modify state. Your code takes values and applies transformations that result in new values - it doesn't modify the original ones. And that's exactly what this immutable class forces you to do, and why I declared the variables themselves as const in that example above.

    Okay, so what have I tried to show you here?

  • C++ supports you in defining types that are immutable once created

  • A key principle of immutable types is that functions operating on them return new values and leave the old ones unmodified

  • Variables holding immutable types are also immutable


  • And that'll do for today...

    Counting down in Python


    I have events coming up in my life; some are good and some are not so good. Anticipating the good ones is positive, and one way to do that is to have a little countdown visible on my desktop. I run Ubuntu Linux on my home laptop, and so I wanted a way to drop a countdown into the menu bar (where Unity shows indicator output).

    Unity (the Ubuntu window manager/interface) does this using indicators (yet another term to learn for gadgets). It took about 30 minutes to work out how to knock up a little indicator in Python, and here it is for anyone who needs it. I based it on a bare-bones clock indicator: the link to the original is in the comments.

    #!/usr/bin/env python
    # Unity indicator for countdown to a fixed date
    # author: ben last
    # based on Phil Ayres' clock indicator
    # (see https://gist.github.com/philayres/1309895)
    # 17 Mar 2013
     
    import gobject
    import gtk
    import appindicator
    import os, sys
    import time
    
    #Define the target time as a time_t
    TARGET=time.mktime((2013,12,24,23,59,0,0,0,0))
    
    def on_left_click(widget,ind):
        """Quit"""
        ind.set_label("")
        sys.exit(0)
      
    def on_timer(ind):
      """Update the label - we always use local time."""
    
      #Get the local time as a time_t (seconds since the epoch)
      t=time.mktime(time.localtime())
      
      #Get the offset to the target such that a positive value
      #means the target is in the future, in seconds.  Also save
      #in offset, so we can spot the end of the countdown.
      offset=seconds=TARGET-t
    
      #If we're at or past the target, reset to zero
      if seconds < 0: seconds=0
    
      #Convert to days, hours, minutes and seconds
      (days,seconds) = divmod(seconds,24*60*60)
      (hours,seconds) = divmod(seconds,60*60)
      (minutes,seconds) = divmod(seconds,60)
    
      #Generate the label
      label="%02d:%02d:%02d:%02d"%(days,hours,minutes,seconds)
    
      #Debug output
      #sys.stdout.write("%s %s\n"%(offset,label))
      
      #ind is the indicator - setting the label shows the countdown
      #in the indicator area.
      ind.set_label(label)
    
      #returning False means no more calls to this method
      return offset>=0
    
    if __name__ == "__main__":
        #Create the indicator with a clock icon
        ind = appindicator.Indicator("simple-countdown",
                                     "clock",
                                     appindicator.CATEGORY_APPLICATION_STATUS)
        ind.set_status (appindicator.STATUS_ACTIVE)
    
        #Set the label
        on_timer(ind)
    
        #Creata a menu and an item
        menu = gtk.Menu()
        item = gtk.MenuItem("Quit")
    
        item.connect("activate", on_left_click, ind)
        item.show()
        menu.append(item)
        ind.set_menu(menu)
    
        #Start a one-second tick
        gtk.timeout_add(1000, on_timer, ind)
    
        gtk.main()
    

    The Care And Feeding Of Songs


    It is a truth universally acknowledged that some songs are better than others. And it ought to be pretty widely acknowledged that there are some people who can take a perfectly good song and perform it in a way that makes you wish you were temporarily deaf.

    The second thought came to me this morning. Our neighbour on one side is a perfectly nice old Aussie bloke who likes music from a certain period and/or genre. The genius of the Shadows frequently features on his playlist. Top hits of the late 60s and early 70s can often be heard, especially since he's a little deaf and likes to turn it up a bit. But this is all fine... or was, until he played Richard Clayderman whilst we were eating breakfast.

    To be fair, it may not have been Mr Clayderman. I'm sure there are other instrumentalists who make their livings by reducing good songs to feeble elevator muzak... in any event it got me thinking about what makes a good song. Or rather, a good "track", and by that I mean a performance of a song, rather that just the song itself.

    Here's my theory: in order to be listenable, a track has to have at least one of:
    (a) an interesting melody that's balanced between predictability and surprise
    (b) interesting lyrics that mean something to the listener
    (c) a genuine and moving performance

    Time for some examples. Let's take (c) first, because I'm feeling perverse. The Sex Pistols 'God Save The Queen' has no real melody and the lyrics aren't all that hot ("potential H-bomb"? What?). But it works (for me) because it's played with a sneering, raucous enthusiasm that makes a very simple song work. The same's true of, say, Iggy Pop's 'I Wanna Be Your Dog': four chords, not very many words and yet it can grab some people by the scruff of their brain. Or (c) can be achieved by just a voice; personally, I could listen to Cerys Matthews reciting the digits of pi for hours just to hear her voice.

    Moving away from tracks to songs: it's easy to find examples for (b). And it's this that Mr Clayderman's performance was making me ponder this morning. Since there were no words, all you could hear was the melody and when he performed Foreigner's 'I Wanna Know What Love Is', it showed up just how boring the melody of that song is. Not that it's a song I particularly like anyway, but if it works at all, it only works with the words. As a piece of music, it's tedious.

    Or we could go back a generation or two: the first verse of Cole Porter's 'Night And Day' is all sung on one note. The words are what's important. Or take almost anything by AC/DC: there's more melody in one bar of any of Angus's solos than in any verse, but (especially when Bon Scott was writing them) the lyrics carry the song. Or for an example that's a whole subculture: rap. No melody, all words.

    So finally, we come to (a): songs with melodies that can lift even simple lyrics. This is what (for me) separates real musicians from wannabees. Example one: Kate Bush's 'Wuthering Heights'. Now, I quite like Ms B's music, but I wouldn't argue that the words make the song. Or even that the words make sense half the time... but you could play the melody of that song without any words and it stands up as a damn good piece of music.

    Example two (and this is going to date me): any one of a whole bunch of tracks by Yes. Let's take 'Close To The Edge': the lyrics are incomprehensible stoned-hippy trash, and yet (assuming you're ok with progressive rock) they're carried by interesting music.

    And then there are the exceptions that prove the rule: tracks with none of the above. Well, you could turn on the radio and wait ten minutes and you're bound to hear an example or two. Anything by Good Charlotte would do: worthless, lazy songwriting. Churning out 'product' with as much attention as the average burger-flipper pays to the hundredth Big Mac of the day.

    But then again... there are the tracks with two or even three of the Key Attributes. In the 70s you could have heard Carole King produce a whole album of them (Tapestry). In the last few years, Elbow have done the same. The art of songwriting is far from dead, and one great song can remind me that music is worth persevering with.

    Of course, these are my examples. Yours will be different. But I reckon that in every track you really love, there will be at least one of (a), (b) or (c).

    Posted via LiveJournal app for iPad.

    Ready-to-handness, and bed.


    So here's an interesting study done on a sample size of one (me). I'm pondering a blog entry, and I have two ways to write it. I can (a) get up, walk to the study, unlock one of the Linux laptops waiting patiently, fire up the LiveJournal web page and then type, on a proper keyboard. And it'll be a fast machine, with a sprightly web browser to find interesting and relevant links.

    Or, (b), I could use the LiveJournal app on the iPad. Which means two-finger typing on a screen keyboard, fighting the autocorrect. And a slow browser, with the sluggish copy-and-paste that iOS 5 has brought to the original iPad (thanks, Apple, I was worried the iPad was too fast before you released iOS 5). So, in terms of the actual task I want to do, no contest.

    But (b) means I don't have to get up yet. And the iPad's already here. And I'm comfy. Damn.

    Posted via LiveJournal app for iPad.

    Hash and cache smash


    Working around the egregious Firefox cache collision defect.  Properly.

    It just occurred to me that the subject line of this post won't work nearly so well if you're Australian, or American, since natives of those nations pronounce "cache" in a novel colonial way; it should of course rhyme with "cash", which is how the Queen would pronounce it, if she knew what a cache was.

    Anyway... those of you who have to deal with the complex and complicated world of efficiently serving up HTTP responses may know about an issue with Firefox (or possibly Mozilla, depending on how you like to classify it).  Probably the most relevant bug report  is here as Mozilla bug 290032, but here's a summary:
    • Firefox stores cached data by hashing the URL to a value.
    • The hash algorithm is weak, and generates duplicate values for similar URLs.
    • The cache can only store one cached URL per has value.
    The upshot is that some URLs are always reloaded from the server and never cached.  Now, this may not seem like a very big deal, and if you're just a plain old Firefox user, you may not care if there's a little bit more network traffic.  But for anyone running a high-traffic server, it can be a serious issue.  Every time a user reloads data from your server rather than their cache, you have a server hit and a bandwidth hit.  Multiply that by a large number of users and it becomes a problem worth considering.

    Let's look at the problem in more detail.  First, we could do with a way to generate hashes from URLs so that we can do some tests.  Here, in the Programming Language Of The Gods, is an implementation of the Firefox/Mozilla cache hash algorithm:
    def hash(url):
            #Hash starts as zero
            h=0
            #Iterate through the characters of the URL
            for c in url:
                    #Take the ASCII value of each character
                    cv=ord(c)
                    #Rotate the hash value left four bits,
                    #as a 32-bit value
                    bits1=(h&0xF0000000) >> 28
                    bits2=(h&0x0FFFFFFF) << 4
                    h = bits1 | bits2
                    #XOR the ASCII character value with the
                    #hash
                    h = h ^ cv
            return h
    

    There are various plavces online where you can find the fault with this algorithm summed up as:
    the Firefox disk cache hash functions can generate collisions for URLs that differ only slightly, namely only on 8-character boundaries and the hash algorithm generates collisions for urls that are "similar enough". From the bug "similiar enough" seems to mean every 4 characters (or perhaps 8) the urls are the same.  In fact, it's more complex than that.  Let's take two example URLs provided by sembiance at stackoverflow.
    ben@quad-14:~/working/hashing$ python
    Python 2.6.2 (release26-maint, Apr 19 2009, 01:58:18)
    [GCC 4.3.3] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from hashcheck import hash
    >>> print "0x%08x" % hash("http://worldofsolitaire.com/decks/paris/5/12c.png")
    0x40d8c8e9
    >>> print "0x%08x" % hash("http://worldofsolitaire.com/decks/paris/5/13s.png")
    0x40d8c8e9
    >>>
    
    Yow.  Both URLs generate the same hash, though they differ by two adjacent characters.  This means that only one of them can be cached by Firefox; if pages on the site regularly use both images, then at least one of them will be reloaded more often than it should. If you care to do the binary mathematics of the hash, you'll see that the reason for the clash is that differences between two characters in the two URLs can be cancelled out by differences in later characters.

    Someone else who suffers from this bug: Google, or more specifically, Google Maps.  Like most mapping sites that are based on "image tiles" (which includes Google Maps, Bing Maps, OpenStreetMap, NearMap, etc), there is a URL to load any individual 256x256 pixel "tile" image from their servers.  Since there are many[1] possible tiles, there are many very similar URLs to load them.  Let's consider a couple of Google URLs that clash:
    >>> print "0x%08x"%hash("http://khm2.google.com/kh/v=52&x=118346&y=80466&z=17")
    0x198cc05f
    >>> print "0x%08x"%hash("http://khm2.google.com/kh/v=52&x=118350&y=80470&z=17")
    0x198cc05f
    >>>
    Because these two URLs clash, as with the other example above, only one of them can be cached by Firefox at any one time.  I picked these two also because they're URLs for two images that are quite likely to appear together; they're only 4 tiles away from each other horizontally and vertically, so it's entirely possible to have them displayed together.

    But if you crank up Firebug and watch Google URLs, you'll see that they use a trick to work around the issue; they append a substring (as the s parameter) to their URLs.  This adds a substring of "Galileo" to each URL. This is sometimes (incorrectly) referred to as a "safeword" and described as a security measure that Google use to restrict access to their site; it's nothing of the sort, as you can see by testing the URLs without the s parameter, or even with it set to something completely different.  The use of the substring is just to make simiular URLs sufficiently different that they don't generate the same hash value in the Firefox cache system.

    The length of the substring is generated from the x and y values using the algorithm <i>length=((3*x)+y)%8</i> (where % means modulo).  So if we use the URLs with the s parameter added, we get:
    >>> print "0x%08x"%hash("http://khm2.google.com/kh/v=52&x=118346&y=80466&z=17&s=")
    0xcc05d095
    >>> print "0x%08x"%hash("http://khm2.google.com/kh/v=52&x=118350&y=80470&z=17&s=")
    0xcc05d095
    >>>
    Oops. What happened?  The answer is that in both cases, the s parameter value is the same; 0 characters from the string "Galileo":
    >>> ((3*118346)+80466)%8
    0
    >>> ((3*118350)+80470)%8
    0
    
    So there are still clashes. Google's extra parameter reduces the number of collisions, but certainly doesn't elimate them.  In fact, it's extremely difficult to completely eliminate collisions[2] between URLs, but we can do better than Google's approach.
    At NearMap, we've been testing out a new way to generate substring, which uses the individual characters of the variant parts of the URL: the x and y and also the nmd (date) parameter that Nearmap have (and Google don't). Here's a Python sample implementation:
    substring="Vk52edzNRYKbGjF8Ur0WhmQlZs4wgipDETyL1oOMXIAvqtxJBuf7H36acCnS9P"      #Nonrepeating A-Z, a-z and digits
    def differenceEngine(s,a):
            """Use string a to return a deterministic but pseudo-random substring from within substring s"""
            result=""
            #Take the characters of a, which MUST all be digits
            offset=0
            for c in a:
                    try:
                            v=int(c)
                    except ValueError:
                            #This exception fired if the character is not a digit, which is wrong, but
                            #we should cope without crashing.
                            continue
                    offset += v
                    #Use the value of this digit as an offset into the string, and take
                    #the character at that position.
                    p=s[offset%len(s)]
                    result += p
            return result
    
    def substringFor(x,y,nmd=None):
            """Return the substring parameter for a URL, given the x, y and nmd values.
            The x and y should be integers, the nmd should be a string - if no date is
            included in the URL, pass an empty string or None."""
            arg=str(x)+str(y)+str((3*x)+y)
            if nmd:
                    arg += nmd
            return differenceEngine(substring,arg)
    

    To test this, I took the logs for a day's worth of traffic on the NearMap site and analyzed the has collisions.  With the Google "Galileo" algorithm, around 10% of all URLs collided.  With this alternative approach, the collision rate is around 0.02%.

    Okay, so this isn't a simple and clear chunk of code that you can use to modify your own URLs, but it demonstrates some principles:
    • The hash algorithm is based on characters, so if you use the 'add a substring parameter' approach, base that substring on the characters of the URL, not parameter values.
    • Use as many different characters as possible in your substring so that it varies as much as possible; "Galileo" doesn't allow for very many different substrings.
    • Build yourself a test setup (feel free to use the hash method above) that you can pass URLs through to spot collisions.  Use that to tune your URL construction code to minimize the collision rate.



    [1]Many? The standard tiling system that all these sites use divides the world up into square tiles. At zoom level 0, the whole world is shown in one tile. At zoom level 1, the world is shown in 4 tiles, at zoom level 2, 8 tiles, and so on. So at each zoom level z, the number of tiles is 2z squared. By the time you're at zoom level 21, where Google tend to stop, there are more than 4x1012 tiles. Which is many, by most standards.
    [2]Actually, it's impossible, since to completely avoid them, you'd need to modify all the URLs used in the world. Which you can't. What I'm really aiming at here is to try and avoid collisions across the set of similar URLs used on one site.
    [3]Paolo Emilio Sfondrati was the Secretary of the Inquisition at the time that Galileo was denounced for heresy and brought to trial. So probably Sfondrati would be an invalid string to replace Galileo...