An Encoding Scheme For 2048 Bit Integers

Overview

Number of Bytes Number of Data Bits First Value Last Value 1st Byte 2nd Byte All Other Bytes
1 0 0 0 00000000 N/A N/A
--- --- --- --- --- --- ---
0+2 (0*8)+0 2^0 2^1-1 00000001 00000000 N/A
0+2 (0*8)+1 2^1 2^2-1 0000001x 00000000 N/A
0+2 (0*8)+2 2^2 2^3-1 000001xx 00000000 N/A
0+2 (0*8)+3 2^3 2^4-1 00001xxx 00000000 N/A
0+2 (0*8)+4 2^4 2^5-1 0001xxxx 00000000 N/A
0+2 (0*8)+5 2^5 2^6-1 001xxxxx 00000000 N/A
0+2 (0*8)+6 2^6 2^7-1 01xxxxxx 00000000 N/A
0+2 (0*8)+7 2^7 2^8-1 1xxxxxxx 00000000 N/A
--- --- --- --- --- --- ---
1+2 (1*8)+0 2^8 2^9-1 00000001 00000001 xxxxxxxx
1+2 (1*8)+1 2^9 2^10-1 00000001 0000001x xxxxxxxx
1+2 (1*8)+2 2^10 2^11-1 00000001 000001xx xxxxxxxx
1+2 (1*8)+3 2^11 2^12-1 00000001 00001xxx xxxxxxxx
1+2 (1*8)+4 2^12 2^13-1 00000001 0001xxxx xxxxxxxx
1+2 (1*8)+5 2^13 2^14-1 00000001 001xxxxx xxxxxxxx
1+2 (1*8)+6 2^14 2^15-1 00000001 01xxxxxx xxxxxxxx
1+2 (1*8)+7 2^15 2^16-1 00000001 1xxxxxxx xxxxxxxx
--- --- --- --- --- --- ---
2+2 (2*8)+0 2^16 2^17-1 00000010 00000001 xxxxxxxx
2+2 (2*8)+1 2^17 2^18-1 00000010 0000001x xxxxxxxx
2+2 (2*8)+2 2^18 2^19-1 00000010 000001xx xxxxxxxx
2+2 (2*8)+3 2^19 2^20-1 00000010 00001xxx xxxxxxxx
2+2 (2*8)+4 2^20 2^21-1 00000010 0001xxxx xxxxxxxx
2+2 (2*8)+5 2^21 2^22-1 00000010 001xxxxx xxxxxxxx
2+2 (2*8)+6 2^22 2^23-1 00000010 01xxxxxx xxxxxxxx
2+2 (2*8)+7 2^23 2^24-1 00000010 1xxxxxxx xxxxxxxx
... ... ... ... ... ... ...
255+2 (255*8)+0 2^2040 2^2041-1 11111111 00000001 xxxxxxxx
255+2 (255*8)+1 2^2041 2^2042-1 11111111 0000001x xxxxxxxx
255+2 (255*8)+2 2^2042 2^2043-1 11111111 000001xx xxxxxxxx
255+2 (255*8)+3 2^2043 2^2044-1 11111111 00001xxx xxxxxxxx
255+2 (255*8)+4 2^2044 2^2045-1 11111111 0001xxxx xxxxxxxx
255+2 (255*8)+5 2^2045 2^2046-1 11111111 001xxxxx xxxxxxxx
255+2 (255*8)+6 2^2046 2^2047-1 11111111 01xxxxxx xxxxxxxx
255+2 (255*8)+7 2^2047 2^2048-1 11111111 1xxxxxxx xxxxxxxx

Comments

This encoding scheme is mostly just big-endian with an 8-bit length prefix, but there's at least one difference. Specifically: Most single-byte values are encoded by adding a trailing null byte, not an 8-bit length prefix. That makes overlong encodings impossible, and it frees up one possible value from the length prefix. We could use that extra value to encode larger numbers, but it's probably better if we can't encode values in the range [2^2048, 2^2056-1]. Since the extra value isn't otherwise useful, we simply allow 0x00 to be encoded as a single byte.

You could actually modify this encoding such that other values in the range [0x00, 0xFF] are encoded using a single byte. If you encoded values in the range [0x00, 0x80] using a single byte, then latin text (and many text-based file formats) could be encoded more efficiently. I believe that 2^1024-1 is the largest value that could be encoded, in that case.