Lexi-Sortable Number Strings
When working with key-value databases, it’s common practice to structure key strings to take advantage of database-specific features. With Cassandra, for instance, you can perform range queries over lexicographically sorted row keys. As an example of how this feature might be used, imagine an index consisting of rowkeys representing attributes like “population” or “region” that map to cities:
"population/0389625": "Tulsa" "population/0815358": "San Francisco" "population/2851268": "Chicago" "population/3831868": "Los Angeles" "population/8391881": "New York" "region/central": "Chicago", "Tulsa" "region/east": "New York" "region/west": "Los Angeles", "San Francisco"
To find cities with at least 1 million people, we would issue a range query from “population/1000000″ to “population/9999999″.
We had to pad out our numbers with leading zeroes to get a fixed-width, but that’s easy enough, and this approach works fine for positive integers. But how do we deal with negative integers, or floating point numbers? We’ll focus on Java longs and doubles, but the code should be trivially modifiable to work with ints and floats, and the ideas should translate easily to most languages.
First up, negative numbers. We can’t simply concatenate a negative number to a string that’s supposed to sort lexicographically, because larger negative numbers sort as smaller strings:
"-4" "-7" "04" "07"
In Java, longs are specified as signed two’s complement 64 bit integers, meaning that if we treat a long’s underlying bits as a string, we would solve this problem at the cost of storing negatives after positives (using 4 bits for demo purposes):
0100 (4) 0111 (7) 1001 (-7) 1100 (-4)
But this is easy to fix: just flip the sign bit:
0001 (-7) 0100 (-4) 1100 (4) 1111 (7)
When we use this method in production, we’ll want to do two more things: represent the number in hex to avoid ridiculously long keys, and prepend a type token to remind us that a given hex string is a long. Here’s the code:
What about floating point? The most common representation of floating point values, and the one that you’re almost certainly using, is standardized by IEEE 754. Conveniently enough, IEEE 754 is designed such that:
if two floating-point numbers in the same format are ordered (say x < y), then they are ordered the same way when their bits are reinterpreted as Sign-Magnitude integers.
Said another way, just like with longs, if we just treat the underlying bits of IEEE 754 floating point numbers as strings, we’ll get proper sorting with only minor complications.
What are the new minor complications? A sign-magnitude integer uses its highest bit to signal signedness, and the remainder of its bits to signal the magnitude of the number:
0100 (4) 0111 (7) 1100 (-4) 1111 (-7)
We see that positive numbers sort correctly, but negative numbers have some problems – they sort high-to-low, and they sort after positive numbers. To solve the latter problem, we’ll again flip the sign bit for all numbers:
0100 (-4) 0111 (-7) 1100 (4) 1111 (7)
And to make negative numbers sort low-to-high, we’ll flip all their other bits, as well:
0000 (-7) 0011 (-4) 1100 (4) 1111 (7)
Here’s the code, building on the utility methods defined earlier:
You can also check out the full code snippet, which includes tests.
Now that we can store floating point attributes, our city index can look like:
"area/dC06759999999999A": "Tulsa" "area/dC06CFD70A3D70A3D": "San Francisco" "area/dC06D400000000000": "Chicago" "area/dC07D4E6666666666": "New York" "area/dC07F24CCCCCCCCCD": "Los Angeles" "population/l800000000005F1F9": "Tulsa" "population/l80000000000C70FE": "San Francisco" "population/l80000000002B81C4": "Chicago" "population/l80000000003A783C": "Los Angeles" "population/l8000000000800CC9": "New York" "region/central": "Chicago", "Tulsa" "region/east": "New York" "region/west": "Los Angeles", "San Francisco"
There are a couple of areas left for improvement. First, hex was chosen because it’s easy to work with, but it produces strings that are longer than necessary. If the encoding were changed to Base 64, only 10 characters per key would be needed, and if the keys in the system we’re using can be given as raw byte arrays, we can get away with 8 bytes + n bits (where n varies based on how many types you want to handle).
Second, this doesn’t even pretend to work with arbitrary-precision integers or decimals. After typing this post up, I found an implementation of these ideas that does extend to BigDecimal values, available as part of Lily CMS, as described by Bruno Dumon.
You can also take a look at Lucene‘s NumericUtils class, though you should take care to note that the long returned from the doubleToSortableLong method is not suitable for direct translation to a lexicographically sortable string, as it does not alter the sign bit.