Subject:

Order Preserving 2d Array To 1d Array Function


Date: Message-Id: https://www.5snb.club/posts/2020/order-preserving-2d-array-to-1d-array-function/

Short post here, but I was interested in a problem and think I found a good solution to it.

The problem is: Write a function f that takes a 2 dimensional byte array (Vec<Vec<u8>>), and converts it into a 1 dimensional byte array (Vec<u8>) in an order-preserving way. Empty outer or inner vectors are allowed, and must order correctly. As little overhead as possible is ideal.

Use case is if you have a B-tree and want to query it using tuples instead of a plain key, and have ordering work correctly (as in, a bytewise comparison should give the same results as ordering the tuples do).

Below I’ve put my solution.

If you want to solve it yourself, don’t scroll down. I’ve put some spacing so it’s lower down off the page.

 
pub fn route(parts: &[impl AsRef<[u8]>]) -> Vec<u8> {
    let mut result = Vec::new();

    for part in parts {
        let part = part.as_ref();

        for byte in part.into_iter().copied() {
            match byte {
                0 => result.extend(&[1, 1]),
                1 => result.extend(&[1, 2]),
                2..=255 => result.push(byte),
            }
        }

        result.push(0x00);
    }

    result
}

That’s it. I used to have it be a lot more complex, but this seems to work fine. I’ve ran it through a fuzzer and it seems to work fine.

The way it works is that we need to be able to uniquely identify the end of each list in such a way that orders before any possible byte inside the list, so we use a 0 byte. But then we have the issue of collisions with a valid 0 byte, so we need to remap that to a multi byte sequence, 1, 1. But then you could emit 2 1’s and it would be encoded identically to a 0 byte in the input. So we need to remap 1 to something that sorts after the encoding of 0, but otherwise can’t be produced. So I used 1, 2.

Note how you are able to produce 1, 1, 2 by giving 0, 2, but this is not a problem since you can’t create a single 1 on its own, so there’s no input that is not 0, 2 that can give you 1, 1, 2. Or, to put it another way, taken as a whole, the sequence is unambiguous, but there’s no error recovery if you need to undo the mapping from the middle. You could likely change the escape sequences to 1, 2 and 1, 3 to make parsing less ambiguous, but for my use case this is unimportant.