base32cx
posted on 13 Nov 2019base32cx
is a base-32 encoding with letter-case checksums inspired by Ethereum’s EIP55.
It is designed for encoding short byte string identifiers as human-skimmable strings, with use cases like file hashes or cryptocurrency addresses in mind.
The alphabet maximizes the number of alpha characters to increase the average number of checksum bits per string.
base32cx
has a variant, base32ux
, which is the same alphabet without a checksum.
The unchecked variant base32ux
has the property that a lexical sort of encoded data is a bitwise sort of decoded data, like base32hex
.
This page (permalink) is the home of the specification.
An initial implementation is here.
(This post is undergoing a series of revisions as the spec is finalized. The latest revision was 2019-Nov-21.)
alphabet
base32cx alphabet
value: 0,1,2,3,4,5,[6..31]
encoding: 4,5,6,7,8,9,[A..Z]
lowered: 4,5,6,7,8,9,[a..z]
value encoding value encoding value encoding value encoding
----- -------- ----- -------- ----- -------- ----- --------
0 4 8 C (c) 16 K (k) 24 S (s)
1 5 9 D (d) 17 L (l) 25 T (t)
2 6 10 E (e) 18 M (m) 26 U (u)
3 7 11 F (f) 19 N (n) 27 V (v)
4 8 12 G (g) 20 O (o) 28 W (w)
5 9 13 H (h) 21 P (p) 29 X (x)
6 A (a) 14 I (i) 22 Q (q) 30 Y (y)
7 B (b) 15 J (j) 23 R (r) 31 Z (z)
The alphabet is selected so that:
- all alpha characters are present to maximize the probability of getting a check bit ‘hit’
- like
base32hex
, an ascii sort is a bitwise sort (true forbase32ux
only, notbase32cx
)
Uppercase letters are chosen for the unchecked variant because the numeric characters are “tall”. This makes unchecked data appear uniform while checked data appears mixed-height – see example section.
checksum
To checksum, take the sha256 of the bytes to be encoded. Call this hash CHECK
.
Encode the bytes using the alphabet above, like any other base32 alphabet without padding.
Lowercase the i
‘th character of encoded string if the (i % 256)
‘th bit of CHECK
is a 0
.
Keep it uppercased if it is a 1
.
On average, this encoding gives (26/32 alphas per alphabet) * (8/5 chars per byte) = 1.3
bits of checksum per byte.
max size
base32cx
is only defined for byte sequences up to length 2^20 - 1, that is, one byte less than 1 MiB.
It is most likely not an appropriate choice of encoding for larger data.
For completeness, a standard method for hashing large data and applying the checksum in chunks will be specified in the future.
Until then, base32cx is simply not defined if the data to be encoded is longer than 2^20 - 1 bytes.
example
encode("Hello")
encoding result note
base32cx d5mQSv7j appears mixed / passes checksum
base32ux D5MQSV7j appears uniform / fails checksum
none d5mqsv7j appears mixed / fails checksum, uppercased might be base32ux
tests
These are generated from the first implementaion. Please check them with your own.
ascii : H
base32cx : d4
ascii : He
base32cx : d5MK
ascii : Hel
base32cx : D5MQs
ascii : Hell
base32cx : D5mqSV4
ascii : Hello
base32cx : d5mQSv7j
ascii : Hello!
base32cx : d5MQsv7J88
ascii : 000011112222333344445555666677778888
base32cx : A4s74g5La8sn6glMact7AGtNagu7cH5oAoUneHdqasv7GhTrAwvnki5Sb4
multibase
Using base32cx
as a
multibase
requires selecting a prefix that is not reserved in the table.
A candidate might be X/x
for ‘chECKSum’.
This section will be revised when a prefix set in stone.