# Compressing a set of random numbers

Date: Message-Id: https://www.5snb.club/posts/2023/compressing-random-number-set/

Despite what it may seem, compressing a *set* of random numbers is indeed possible, since not
needing to care about the *order* lets you decrease the entropy by sorting the numbers ahead of
time.

One basic scheme for doing this is to just sort the numbers, and then compress the most significant
byte separately from the rest, by making an array of *counts*.

For example, in base 10, this would be turning

```
0123
0512
1312
2345
4444
```

into

```
2 1 1 0 1
123
512
312
345
444
```

And then on decoding, you just scan the counts list, decrementing by one for every value, and sticking the index on the front of the number.

If you go and encode 65536 `u64`

s with this scheme, you get 524 288 bytes used for the
source data, 459 264 bytes used for the encoded form, which is a compression ratio of 1.142

Not great, but we’re working with completely random numbers here, so there is a limit to how well we can do.

Code for that is

```
#![feature(array_chunks)]
#![feature(is_sorted)]
const ITEM_COUNT: usize = 1 << 16;
fn encode(data: &[u64]) -> Vec<u8> {
assert!(data.is_sorted());
let mut lengths = [0u16; 1 << 8];
for item in data {
let item_arr = item.to_be_bytes();
let bucket = item_arr[0];
lengths[usize::from(bucket)] += 1;
}
let mut output = Vec::new();
for length in lengths {
output.extend(length.to_be_bytes());
}
for item in data {
let item_arr = item.to_be_bytes();
let value: [u8; 7] = item_arr[1..].try_into().unwrap();
output.extend(value);
}
output
}
fn decode(data: &[u8]) -> Vec<u64> {
let lengths = data[0..(256 * 2)]
.array_chunks::<2>()
.copied()
.map(u16::from_be_bytes);
let mut data = data[256 * 2..].array_chunks::<7>().copied();
let mut output = Vec::new();
for (idx, length) in lengths.enumerate() {
for _ in 0..length {
let mut value = [0_u8; 8];
value[0] = u8::try_from(idx).unwrap();
value[1..].copy_from_slice(&data.next().unwrap());
output.push(u64::from_be_bytes(value));
}
}
output
}
fn main() {
let mut data: Vec<u64> = (0..ITEM_COUNT).map(|_| rand::random()).collect();
data.sort();
println!("Size of data: {}", data.len() * 8);
let encoded = encode(&data);
println!("Size of encoded data: {}", encoded.len());
println!(
"That's a compression ratio of: {:.3}",
(data.len() * 8) as f64 / (encoded.len() as f64)
);
let decoded = decode(&encoded);
assert_eq!(data, decoded);
}
```

I *believe* the working out for the actual amount of entropy for a sorted list of 65536 u64s is
`log2(256^524288 / 65536!) bits to bytes`

(as entered into
Qalculate!), which gives us 405 033. Which seems plausable?
That’d be a compression ratio of 1.294, which is a bit better.

Oh, right, use cases! This would work well for transmitting a set of hashes, say for git? Honestly, not sure. Probably not worth it, but a fun trick that has probably been done better elsewhere :)

How does lzma do?

```
fn main() {
use std::io::Write;
let mut data: Vec<u64> = (0..ITEM_COUNT).map(|_| rand::random()).collect();
data.sort();
for item in data {
std::io::stdout().write_all(&item.to_be_bytes());
}
}
```

```
# cargo run > output.bin
# lzma -9e --keep output.bin
# wc -c output.bin*
524288 output.bin
444228 output.bin.lzma
```

Which, if you’ve been playing along at home, is *better* than what I’ve gotten. Though my algorithm
is almost definitely faster. And deterministic in output size.

Oh well, at least I had fun :3