Subject:

Compressing a set of random numbers


Date: Message-Id: https://www.5snb.club/posts/2023/compressing-random-number-set/

Despite what it may seem, compressing a set of random numbers is indeed possible, since not needing to care about the order lets you decrease the entropy by sorting the numbers ahead of time.

One basic scheme for doing this is to just sort the numbers, and then compress the most significant byte separately from the rest, by making an array of counts.

For example, in base 10, this would be turning

0123
0512
1312
2345
4444

into

2 1 1 0 1

123
512
312
345
444

And then on decoding, you just scan the counts list, decrementing by one for every value, and sticking the index on the front of the number.

If you go and encode 65536 u64s with this scheme, you get 524 288 bytes used for the source data, 459 264 bytes used for the encoded form, which is a compression ratio of 1.142

Not great, but we’re working with completely random numbers here, so there is a limit to how well we can do.

Code for that is

#![feature(array_chunks)]
#![feature(is_sorted)]

const ITEM_COUNT: usize = 1 << 16;

fn encode(data: &[u64]) -> Vec<u8> {
    assert!(data.is_sorted());

    let mut lengths = [0u16; 1 << 8];

    for item in data {
        let item_arr = item.to_be_bytes();
        let bucket = item_arr[0];
        lengths[usize::from(bucket)] += 1;
    }

    let mut output = Vec::new();

    for length in lengths {
        output.extend(length.to_be_bytes());
    }

    for item in data {
        let item_arr = item.to_be_bytes();
        let value: [u8; 7] = item_arr[1..].try_into().unwrap();
        output.extend(value);
    }

    output
}

fn decode(data: &[u8]) -> Vec<u64> {
    let lengths = data[0..(256 * 2)]
        .array_chunks::<2>()
        .copied()
        .map(u16::from_be_bytes);

    let mut data = data[256 * 2..].array_chunks::<7>().copied();

    let mut output = Vec::new();

    for (idx, length) in lengths.enumerate() {
        for _ in 0..length {
            let mut value = [0_u8; 8];
            value[0] = u8::try_from(idx).unwrap();

            value[1..].copy_from_slice(&data.next().unwrap());

            output.push(u64::from_be_bytes(value));
        }
    }

    output
}

fn main() {
    let mut data: Vec<u64> = (0..ITEM_COUNT).map(|_| rand::random()).collect();
    data.sort();

    println!("Size of data: {}", data.len() * 8);

    let encoded = encode(&data);

    println!("Size of encoded data: {}", encoded.len());

    println!(
        "That's a compression ratio of: {:.3}",
        (data.len() * 8) as f64 / (encoded.len() as f64)
    );

    let decoded = decode(&encoded);
    assert_eq!(data, decoded);
}

I believe the working out for the actual amount of entropy for a sorted list of 65536 u64s is log2(256^524288 / 65536!) bits to bytes (as entered into Qalculate!), which gives us 405 033. Which seems plausable? That’d be a compression ratio of 1.294, which is a bit better.

Oh, right, use cases! This would work well for transmitting a set of hashes, say for git? Honestly, not sure. Probably not worth it, but a fun trick that has probably been done better elsewhere :)

How does lzma do?

fn main() {
    use std::io::Write;

    let mut data: Vec<u64> = (0..ITEM_COUNT).map(|_| rand::random()).collect();
    data.sort();

    for item in data {
        std::io::stdout().write_all(&item.to_be_bytes());
    }
}
# cargo run > output.bin
# lzma -9e --keep output.bin
# wc -c output.bin*
524288 output.bin
444228 output.bin.lzma

Which, if you’ve been playing along at home, is better than what I’ve gotten. Though my algorithm is almost definitely faster. And deterministic in output size.

Oh well, at least I had fun :3