Lexi-Sortable Number Strings

When working with key-value databases, it’s common practice to structure key strings to take advantage of database-specific features. With Cassandra, for instance, you can perform range queries over lexicographically sorted row keys. As an example of how this feature might be used, imagine an index consisting of rowkeys representing attributes like “population” or “region” that map to cities:

    "population/0389625": "Tulsa"  
    "population/0815358": "San Francisco"  
    "population/2851268": "Chicago"
    "population/3831868": "Los Angeles"
    "population/8391881": "New York"
    "region/central": "Chicago", "Tulsa"
    "region/east": "New York"
    "region/west": "Los Angeles", "San Francisco"

To find cities with at least 1 million people, we would issue a range query from “population/1000000” to “population/9999999”.

We had to pad out our numbers with leading zeroes to get a fixed-width, but that’s easy enough, and this approach works fine for positive integers. But how do we deal with negative integers, or floating point numbers? We’ll focus on Java longs and doubles, but the code should be trivially modifiable to work with ints and floats, and the ideas should translate easily to most languages.

First up, negative numbers. We can’t simply concatenate a negative number to a string that’s supposed to sort lexicographically, because larger negative numbers sort as smaller strings:


In Java, longs are specified as signed two’s complement 64 bit integers, meaning that if we treat a long’s underlying bits as a string, we would solve this problem at the cost of storing negatives after positives (using 4 bits for demo purposes):

    0100 (4)
    0111 (7)
    1001 (-7)
    1100 (-4)

But this is easy to fix: just flip the sign bit:

    0001 (-7)
    0100 (-4)
    1100 (4)
    1111 (7)

When we use this method in production, we’ll want to do two more things: represent the number in hex to avoid ridiculously long keys, and prepend a type token to remind us that a given hex string is a long. Here’s the code:

What about floating point? The most common representation of floating point values, and the one that you’re almost certainly using, is standardized by IEEE 754. Conveniently enough, IEEE 754 is designed such that:

if two floating-point numbers in the same format are ordered (say x < y), then they are ordered the same way when their bits are reinterpreted as Sign-Magnitude integers.

Said another way, just like with longs, if we just treat the underlying bits of IEEE 754 floating point numbers as strings, we’ll get proper sorting with only minor complications.

What are the new minor complications? A sign-magnitude integer uses its highest bit to signal signedness, and the remainder of its bits to signal the magnitude of the number:

    0100 (4)
    0111 (7)
    1100 (-4)
    1111 (-7)

We see that positive numbers sort correctly, but negative numbers have some problems – they sort high-to-low, and they sort after positive numbers. To solve the latter problem, we’ll again flip the sign bit for all numbers:

    0100 (-4)
    0111 (-7)
    1100 (4)
    1111 (7)

And to make negative numbers sort low-to-high, we’ll flip all their other bits, as well:

    0000 (-7)
    0011 (-4)
    1100 (4)
    1111 (7)

Here’s the code, building on the utility methods defined earlier:

You can also check out the full code snippet, which includes tests.

Now that we can store floating point attributes, our city index can look like:

    "area/dC06759999999999A": "Tulsa"
    "area/dC06CFD70A3D70A3D": "San Francisco"
    "area/dC06D400000000000": "Chicago"
    "area/dC07D4E6666666666": "New York"
    "area/dC07F24CCCCCCCCCD": "Los Angeles"    
    "population/l800000000005F1F9": "Tulsa"
    "population/l80000000000C70FE": "San Francisco"
    "population/l80000000002B81C4": "Chicago"
    "population/l80000000003A783C": "Los Angeles"
    "population/l8000000000800CC9": "New York"
    "region/central": "Chicago", "Tulsa"
    "region/east": "New York"
    "region/west": "Los Angeles", "San Francisco"

There are a couple of areas left for improvement. First, hex was chosen because it’s easy to work with, but it produces strings that are longer than necessary. If the encoding were changed to Base 64, only 10 characters per key would be needed, and if the keys in the system we’re using can be given as raw byte arrays, we can get away with 8 bytes + n bits (where n varies based on how many types you want to handle).

Second, this doesn’t even pretend to work with arbitrary-precision integers or decimals. After typing this post up, I found an implementation of these ideas that does extend to BigDecimal values, available as part of Lily CMS, as described by Bruno Dumon.

You can also take a look at Lucene‘s NumericUtils class, though you should take care to note that the long returned from the doubleToSortableLong method is not suitable for direct translation to a lexicographically sortable string, as it does not alter the sign bit.

Will it optimize? Another awesome post from one of the best programming bloggers around.

How to Disable Accesskeys in Firefox

This took way too long for me to Google, so here you go:

If you need to disable accesskey support in Firefox, because, say, you’re learning vimperator but would still like to visit wikipedia without having the rage, set ui.key.contentAccess to 0 in your about:config (you’ll also need to make sure ui.key.generalAccessKey is set to -1, but it already was for me).

Woot calls out the AP.

Mariano Rivera’s cutter dissected. Even if you’re not a baseball fan, the video has some great visualizations. (via daringfireball).

You know how when you draw a map for someone, you just sketch out a few lines and labels, and they usually end up getting where you want them to go? Or how when you print a map from online, every time you look at it, you end up spending ten seconds just filtering out all the information you don’t want before focusing on what you do? Bing Destination Maps might be your happy place, assuming you’re willing to install Silverlight. (via kottke)

Clive Thompson has a great article in New York Times Magazine about IBM’s Watson question-answering AI.

The IRS released data that Forbes turned into a map of migrations for every county in the United States in 2008. It’s really weird to see that I was one of less than 10 total people who made the move from Norman to San Francisco that year.

Absolutely spot-on analysis of AT&T’s new data pricing plans, from Christopher Schanck. In summary: by looking at how people use the phone today and developing a pricing model to lock people into those habits, AT&T is screwing Apple and application developers who want the device and data to become ubiquitous parts of people’s lives, and devaluing their own product.

Aza Raskin describes a clever phishing attack. If you use a password manager instead of manually typing your passwords, you’re safe from this one.

← Before After →