Abstract
There is a need to encode binary data into a text format, such that it can be used without further encoding where only a subset of ASCII text is normally used: identifiers, URL parameters, HTTP form content, HTTP cookie values, and so on.
There is number of existing solutions to this problem, but I believe there is room for improvement to all of them, and I propose a new encoding which I dub “Maze62″.
Existing solutions
Base64
Every three bytes (24 bits) of the input get regrouped into four 6-bit numbers, and each such number gets represented with one of the 62 alphanumerics and two other characters. Read more: http://en.wikipedia.org/wiki/Base64
Base16 (BinHex)
Each byte of the input is split into two 4-bit numbers, and each such numbers gets represented by one of the alphanuerics, typically constrained to the set of the hex digits [0-9a-f].
Base32
Every five bytes (40 bits) of input get split into a group of eight 5-bit numbers, and each such number is encoded with an alphnumeric.
Base10
The input byte array gets represented as a a very large integer number, which is printed in decimal notation.
Base62
The input byte array gets represented as a a very large integer number, which is then converted to base-62 representation, similar to Base10 just with a different base. Entire set alphanumerics is used to represent the “digits”.
Shortcomings of existing solutions
While base64 is the usual choice, there are often problems with the two non-alphanumeric characters. What’s worse, when encoding algorithm is chosen, it is not known which symbols will pose problems in any of the future places where the data is to be used. For example, a forward slash is difficult to use in a URL, and a dash is illegal as a first character in a DNS name. Ideally, this mess would be excluded from consideration, which can only be achieved by sticking to the 62 alphanumerics.
An ideal solution therefore is one that only uses alphanumerics, is as dense as possible, and allows for speedy processing. None of the aforementioned solutions satisfy that criteria. In particular:
Base64 is dense with only 33% explosion in size, but is using more than the 62 alphanumerics and thus is prone to mangling.
Base16 is easy to code, but 100% explosion in size make it undesirable.
Base32 is still a 60% explosion. Absent any better ideas this would be my weapon of choice.
Base62 is almost as dense as Base64, and is neatly contained in its character set, but unlike other discussed solutions the cost of encoding is quadratic to the size of the input, which is clearly unacceptable.
Proposed solution
If you would like to try to solve the problem yourself, read no further. Otherwise, the algorithm below defines Maze62 encoding and presents it for your attention:
The algorithm
- The input data is represented as a stream of bits.
- Six bits are read from the input stream into current value CV. CV is therefore 63 or less.
- CV is divided by 62, and the remainder of the division, a value that is certain to be less than 62, is represented with alphanumeric and is written into the output stream of characters
- In theory, we must also store the quotient, which is either one or zero. However there is an important fact to be noted – whenever the remainder of the division by 62 is larger than one, the quotient cannot be anything but zero. Given that the remainder is already emitted into the output stream, emitting the quotient would result into no new information being added, unless said remainder is either zero or one. Only in the latter case we also need to preserve and emit the quotient, which is also limited to single bit of data.
- Therefore, if the remainder is more than one, go to step 2
- If remainder is zero or one, memorize the one bit which represents the quotient
- Read five bits from the input stream into CV.
- Shift CV by one bit to the left, and insert the quotient from step 6 as the least significant bit.
- Go to step 3
The algorithm terminates after no more data can be read from the input and last piece of data is written out, including the potential stray quotient.
Analysis
The proposed encoding algorithm is linear in speed, nearly as dense as Base64, and fully constrained to alphanumeric set of characters. The only downside is a somewhat tricky implementation, but it’s not much worse than Base64. As I refine my python implementation of the algorithm I will post results on the comparative encoding efficiency.
The name
The name of the encoding is an amalgamation of “base62″ and the word amazing, hence “maze62″.
What’s next.
I will be publishing the python source code I have written to test this idea in the coming days. If there is interest, we could organize a public repository of Maze62 implementations in other languages, as well as a set of acceptance tests for independent implementations.
Discussion
In addition to the comments below there’s a discussion going on on hacker news: http://news.ycombinator.com/item?id=2553753
Further reading
In particular the reference to base85 encoding may prove useful: http://en.wikipedia.org/wiki/Base85