base32cx

posted on 13 Nov 2019

base32cx is a base-32 encoding with letter-case checksums inspired by Ethereum’s EIP55.

It is designed for encoding short byte string identifiers as human-skimmable strings, with use cases like file hashes or cryptocurrency addresses in mind.

The alphabet maximizes the number of alpha characters to increase the average number of checksum bits per string.

base32cx has a variant, base32ux, which is the same alphabet without a checksum.

The unchecked variant base32ux has the property that a lexical sort of encoded data is a bitwise sort of decoded data, like base32hex.

This page (permalink) is the home of the specification.

An initial implementation is here.

(This post is undergoing a series of revisions as the spec is finalized. The latest revision was 2019-Nov-21.)

alphabet

                       base32cx alphabet

                   value:  0,1,2,3,4,5,[6..31]
                encoding:  4,5,6,7,8,9,[A..Z] 
                 lowered:  4,5,6,7,8,9,[a..z] 

value encoding  value encoding  value encoding  value encoding
----- --------  ----- --------  ----- --------  ----- --------
    0 4             8 C (c)        16 K (k)        24 S (s)
    1 5             9 D (d)        17 L (l)        25 T (t)
    2 6            10 E (e)        18 M (m)        26 U (u)
    3 7            11 F (f)        19 N (n)        27 V (v)
    4 8            12 G (g)        20 O (o)        28 W (w)
    5 9            13 H (h)        21 P (p)        29 X (x)
    6 A (a)        14 I (i)        22 Q (q)        30 Y (y)
    7 B (b)        15 J (j)        23 R (r)        31 Z (z)

The alphabet is selected so that:

  • all alpha characters are present to maximize the probability of getting a check bit ‘hit’
  • like base32hex, an ascii sort is a bitwise sort (true for base32ux only, not base32cx)

Uppercase letters are chosen for the unchecked variant because the numeric characters are “tall”. This makes unchecked data appear uniform while checked data appears mixed-height – see example section.

checksum

To checksum, take the sha256 of the bytes to be encoded. Call this hash CHECK. Encode the bytes using the alphabet above, like any other base32 alphabet without padding. Lowercase the i‘th character of encoded string if the (i % 256)‘th bit of CHECK is a 0. Keep it uppercased if it is a 1.

On average, this encoding gives (26/32 alphas per alphabet) * (8/5 chars per byte) = 1.3 bits of checksum per byte.

max size

base32cx is only defined for byte sequences up to length 2^20 - 1, that is, one byte less than 1 MiB. It is most likely not an appropriate choice of encoding for larger data. For completeness, a standard method for hashing large data and applying the checksum in chunks will be specified in the future. Until then, base32cx is simply not defined if the data to be encoded is longer than 2^20 - 1 bytes.

example

encode("Hello")

    encoding    result      note
   
    base32cx    d5mQSv7j    appears mixed / passes checksum
    base32ux    D5MQSV7j    appears uniform / fails checksum
      none      d5mqsv7j    appears mixed  / fails checksum, uppercased might be base32ux

tests

These are generated from the first implementaion. Please check them with your own.

ascii    : H
base32cx : d4
ascii    : He
base32cx : d5MK
ascii    : Hel
base32cx : D5MQs
ascii    : Hell
base32cx : D5mqSV4
ascii    : Hello
base32cx : d5mQSv7j
ascii    : Hello!
base32cx : d5MQsv7J88
ascii    : 000011112222333344445555666677778888
base32cx : A4s74g5La8sn6glMact7AGtNagu7cH5oAoUneHdqasv7GhTrAwvnki5Sb4

multibase

Using base32cx as a multibase requires selecting a prefix that is not reserved in the table. A candidate might be X/x for ‘chECKSum’. This section will be revised when a prefix set in stone.