Maze62: a dense and speedy alphanumeric encoding for binary data

Abstract

There is a need to encode binary data into a text format, such that it can be used without further encoding where only a subset of ASCII text is normally used: identifiers, URL parameters, HTTP form content, HTTP cookie values, and so on.

There is number of existing solutions to this problem, but I believe there is room for improvement to all of them, and I propose a new encoding which I dub “Maze62″.

Existing solutions

Base64

Every three bytes (24 bits) of the input get regrouped into four 6-bit numbers, and each such number gets represented with one of the 62 alphanumerics and two other characters. Read more: http://en.wikipedia.org/wiki/Base64

Base16 (BinHex)

Each byte of the input is split into two 4-bit numbers, and each such numbers gets represented by one of the alphanuerics, typically constrained to the set of the hex digits [0-9a-f].

Base32

Every five bytes (40 bits) of input get split into a  group of eight 5-bit numbers, and each such number is encoded with an alphnumeric.

Base10

The input byte array gets represented as a a very large integer number, which is printed in decimal notation.

Base62

The input byte array gets represented as a a very large integer number, which is then converted to base-62 representation, similar to Base10 just with a different base. Entire set alphanumerics is used to represent the “digits”.

Shortcomings of existing solutions

While base64 is the usual choice, there are often problems with the two non-alphanumeric characters. What’s worse, when encoding algorithm is chosen, it is not known which symbols will pose problems in any of the future places where the data is to be used. For example, a forward slash is difficult to use in a URL, and a dash is illegal as a first character in a DNS name. Ideally, this mess would be excluded from consideration, which can only be achieved by sticking to the 62 alphanumerics.

An ideal solution therefore is one that only uses alphanumerics, is as dense as possible, and allows for speedy processing. None of the aforementioned solutions satisfy that criteria. In particular:

Base64 is dense with only 33% explosion in size, but is using more than the 62 alphanumerics and thus is prone to mangling.

Base16 is easy to code, but 100% explosion in size make it undesirable.

Base32 is still a 60% explosion. Absent any better ideas this would be my weapon of choice.

Base62 is almost as dense as Base64, and is neatly contained in its character set, but unlike other discussed solutions the cost of encoding is quadratic to the size of the input, which is clearly unacceptable.

Proposed solution

If you would like to try to solve the problem yourself, read no further. Otherwise, the algorithm below defines Maze62 encoding and presents it for your attention:

The algorithm

  1. The input data is represented as a stream of bits.
  2. Six bits are read from the input stream into current value CV. CV is therefore 63 or less.
  3. CV is divided by 62, and the remainder of the division, a value that is certain to be less than 62, is represented with alphanumeric and is written into the output stream of characters
  4. In theory, we must also store the quotient, which is either one or zero. However there is an important fact to be noted  – whenever the remainder of the division by 62 is larger than one, the quotient cannot be anything but zero. Given that the remainder is already emitted into the output stream, emitting the quotient would result into no new information being added, unless said remainder is either zero or one. Only in the latter case we also need to preserve and emit the quotient, which is also limited to single bit of data.
  5. Therefore, if the remainder is more than one, go to step 2
  6. If remainder is zero or one, memorize the one bit which represents the quotient
  7. Read five bits from the input stream into CV.
  8. Shift CV by one bit to the left, and insert the quotient from step 6 as the least significant bit.
  9. Go to step 3

The algorithm terminates after no more data can be read from the input and last piece of data is written out, including the potential stray quotient.

Analysis

The proposed encoding algorithm is linear in speed, nearly as dense as Base64, and fully constrained to alphanumeric set of characters. The only downside is a somewhat tricky implementation, but it’s not much worse than Base64. As I refine my python implementation of the algorithm I will post results on the comparative encoding efficiency.

The name

The name of the encoding is an amalgamation of “base62″ and the word amazing, hence “maze62″.

What’s next.

I will be publishing the python source code I have written to test this idea in the coming days. If there is interest, we could organize a public repository of Maze62 implementations in other languages, as well as a set of acceptance tests for independent implementations.

Discussion

In addition to the comments below there’s a discussion going on on hacker news: http://news.ycombinator.com/item?id=2553753

Further reading

In particular the reference to base85 encoding may prove useful: http://en.wikipedia.org/wiki/Base85

About these ads

18 Responses to “Maze62: a dense and speedy alphanumeric encoding for binary data”

  1. Denis Says:

    Hello Denis,

    Interesting idea. I would like to see some results.

    And while you’re at it, fix some typos for posterity:
    “Base62 is almost as dense as base62…”
    “…as an stream of bits.”
    “…a values that is…”
    “…we also need to preserver…”
    “If reminder is zero…”
    “…the one bit the represents…”

  2. Chris Gonnerman Says:

    Also, you called it “Maze64″ at the end there.

    Cool algorithm, by the way.

  3. William Swanson Says:

    I see a problem here. A lot of binary data contains runs of 0 bytes for padding or other purposes. Your algorithm penalizes 0’s by storing them 5 bits at a time rather than 6 bits at a time. If you are going to penalize certain numbers, penalizing 0 is about the worst possible choice.

    • Denis Altudov Says:

      I found this to work well with ASCII though :)

      We could certainly tune the algorithm for any given character set by adding/substracting a particular number to each input byte. If there are lots of zeroes we could simply add 2 to each byte before encoding, and that’ll take care of the problem. I might actually do this. :)

  4. bd_ Says:

    It sounds to me like this encoding is essentially equivalent to taking base64, then passing each character through a table-based replacement, in which the characters corresponding to 0, 1, 62, and 63 are simply replaced by some corresponding two-byte escape sequence. Given this, one can clearly do better, by only replacing three characters – that is, you’d have 61 normal characters, an escape character, and three acceptable characters after that. Defining it in this manner makes it easier to select less-common values for your escapes as well – 0 and 63 are far too common in real files.

    Of course, you might be able to do even better by assigning more escape sequences – the entropy of the byte following the escape is rather miserably low :)

    • Denis Altudov Says:

      While it may sound like, that’s not what it is. The occasional 0,1,62, or 63 results into one additional bit being stored, not one additional alphanumeric digit. It’s a lot less wasteful than would appear otherwise.

      I have purposefully avoided any attempt at compression. I fear I would get trampled by the compression experts for amateur-ism, and deservedly so. :)

      But yes, this is a fascinating subject too – if yo analyze the data distribution typical for your problem domain you might be able to come up with different encoding schemes.

      • bd_ Says:

        Ah, I see – yes, that’s not as bad as I thought, then. That said, there are a lot of ways to avoid the 12.5% expansion of repeated 0s – for example, you could XOR the 6-bit block with a simple, random function based on the offset into the output stream modulo some magic number. This would avoid penalizing any strings not specifically designed to exploit your random function.

      • Denis Altudov Says:

        Excellent idea about the function! Much better than my idea about adding a constant.

  5. Acorn Says:

    Here’s something similar I came up with a while ago:
    http://stackoverflow.com/questions/5940416/compress-a-series-of-1s-and-0s-into-the-shortest-possible-ascii-string/5941361#5941361

    It would be interesting to see how much shorter a result would be achievable using some kind of compression..

  6. Bob Foster Says:

    Very clever. In the worst case (and in the limit), this encoding converts every 5 bits of input to 8 bits of output. This is a 37.5% overhead. In the best case, it converts 6 bits of input to 8 bits of output, a 25% overhead. It is possible that for some input it would beat Base64.

    A couple of minor points:

    – In uncompressed binary data, the most likely 6-bit values will be 0 and 63, corresponding to the high-order bits of positive and negative numbers, respectively. You might get an advantage from swapping the input values 0, 1, 62 and 63 with values toward the middle of the range 2..61.

    – Some text editors and text tools have a fixed-size line buffer. Just passing encoded files through them will alter the data. It would be harmless to emit a new line every thousand or so characters and treat whitespace as noise in the decoder.

    – As I’m sure you’re well aware, t’s unlikely that the input is an even multiple of 5 or 6 bits, so there needs to be some extra hokey pokey at the end of input.

    • Denis Altudov Says:

      bd_’s comment at May 16, 2011 at 4:49 pm provides a way to solve the all-zeroes problem rather neatly, I think.

      Agree on the line breaking, that certainly won’t hurt anything.

      The byte boundary needs careful coding. The encoder implementation I am working on pads the last chunk of bits (between 1 and 5 of them) with zeroes to generate the last alphanumeric, while decoder then is sure to discard the trailing zero bits. The trick is to remember that the result must align at 8 bits. Similarly, base64 can be implemented without padding.

      • Bob Foster Says:

        Yes. I started writing my comment before bd_ posted. It’s a good idea.

        Another suggestion is some sort of version number, like 0, in the first output character. These things evolve.

        Love to see some code.

  7. Andrew Jennings Says:

    An optimal base-62 encoding could represent 5.954 bits of input data per output byte.

    Maze62, on the other hand, will turn 6 bits of input data into one output byte 60/64 of the time and turn it into one-and-one-sixth output bytes 4/64 of the time. That’s 5.938 bits of input data per output byte.

    So it’s 2.7% less efficient (on random data), but I’ll bet the algorithmic simplicity would be worth that in many cases.

    • Andrew Jennings Says:

      I should’ve said:

      With Maze62, on the other hand, one output byte will represent 5 input bits 4/64 of the time and 6 input bits 60/64 of the time.

      Still 5.938 bits of input data per output byte. Still 2.7% less efficient than optimal.

  8. Matt Briançon Says:

    It’s been about a week since you posted this and I’m wondering if you’ve had the time to post an implementation of Maze62 somewhere. I have a very specific need to encode binary data into an alphanumeric string and I’d love to give Maze62 a try.

    Thanks.

  9. Alexey Tourbin Says:

    Hello, there seems to be one problem which has to be addressed. Suppose the alphabet is 0-9A-Za-z and you need to encode 6 bits, all being zero. So you put ‘0’ character, and then there’s another q=0 bit to clarify that this was actually 0 and not 62. With 5 trailing zero bits, you put the second ‘0’ character but then again, by the rules, ‘0’ has to be complemented by the q=0 bit, and so on. Thus by trying to encode a sequence of zero bits, you enter an infinite loop.

    On the other hand, consider that you have to decode the string which consists only of ‘0’ and ‘1’ characters. What do you do? Suppose the first character is ‘0’. You then you need to decode the second character to decide whether the first character was 0 or 62. But to complete the second character, you need to peek at the third character, and so on. And so you may end up with unlimited lookahead.

    Thus the problem which has to be addressed is consecutive special characters. One possibility is to devise the algorithm which avoids them altogether (that is, guarantees that ‘0’ or ‘1’ is never followed by ‘0’ or ‘1’, so that only one-character lookahead is ever needed), but it would take slightly more bits. It may (or may not) be also possible, though, to tweak the algorithm so that multiple specials are handled gracefully.

Comments are closed.


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: