-

@ mleku
2025-06-06 14:12:44
a lexicon is a form of compression in itself, most compression algorithms contain a lexicon of repeating symbols but it's a thing that needs to be shared between participants on a consistent and timely basis
a fixed lexicon is less flexible but potentially more compact
also, if you use a variable integer length scheme so you have 8 bits cipher, and mark the end of a value with the 8th bit being on, you get 128 ciphers per 8 bits and 2^14 gives you 16384 ciphers and if you order your lexicon so the most frequent words are lower numbers, most of the time it will be 1 and 2 bytes for like 95% of messages, and that last 5% you need 24 bits.
this is a subject that is very dear to my heart and i have even written a novel scheme for variable integers that can be encoded and decoded over a stream process without forward scanning or going backwards to clarify what a message is.
this encoding is a trinary code though, so making a variable integer encoding with it would be a different procedure because with binary you just use the 8th bit as a continue/end flag (so each segment has a zero or one at the 8th bit except to mark the end of a cipher.
i would have to think about it a bit to convert this varint style to a trinary encoding, you could even just have a third state (like the black or the full bright) indicates the end of each cipher, this is the simplest form of compression that exists for general encoding, but instead you could change it up so you segment your atomic values, a trinary is based on 3, probably you could join them into groups of 3 and then you get 27, so then for this encoding you could have a simple all lower case and the all-three-in-bright would represent the space, and use telegraphy words like "stop"
it just occurs to me that morse code is almost exactly this, it's a binary code, with 3 bits per unit.
anyway, haha. the real thing is how do you such an encoding scheme, whether you have a lexicon, or whether you have a protocol for appending entries to a lexicon, and then you have single letter codes, and then if the sign of that is missing, you mean a lexicon entry, but you have a problem of consistency issue in that messages can become indecipherable if all participants don't have all of the lexicon.