Subject:

In Place String Mapping


Date: Message-Id: https://www.5snb.club/posts/2021/in-place-string-mapping/

A while ago, I saw a post about how C did in-place string modification better than Rust. So, for example, you could take a string that is percent encoded, and decode it, perfectly in place, without external allocations.

This seemed rather neat, so I went out to implement it in Rust, so this can be done safely.

The C version that inspired this post is:

#include <stdlib.h>
#include <stdio.h>

void decode_inplace(char *s) {
    char *in = s, *out = s;
    char cur[3] = { 0 };

    while (*in) {
        if (*in == '%') {
            cur[0] = *(in + 1);
            cur[1] = *(in + 2);
            *out = strtol(cur, NULL, 16);
            in += 2;
        } else {
            *out = *in;
        }
        ++in;
        ++out;
    }

    *out = '\0';
}

int main() {
    char s[] = "abc%64%65fg";
    decode_inplace(s);
    printf("%s\n", s);
}

Let’s see what that looks like using my library.

use in_place_string_map::MapInPlace;

fn decode_percent(s: &mut str) -> &mut str {
    let mut m = MapInPlace::new(s);

    while let Some(c) = m.pop() {
        match c {
            '%' => {
                let num = m.pop_chars(2).expect("not enough chars");
                let n = u8::from_str_radix(num, 16).expect("invalid hex");
                m.push(n as char).expect("no more capacity");
            }
            _ => {
                m.push(c).expect("no more capacity");
            }
        }
    }

    m.into_mapped()
}

fn main() {
    let mut percent = "abc%64%65fg".to_string();
    let mapped_percent = decode_percent(&mut percent);
    assert_eq!(mapped_percent, "abcdefg");
}

Pretty neat!

The MapInPlace takes a &mut str of arbitrary lifetime, and lets you do operations such as pop_chars and push on it.

It works by internally converting the &mut str into a &mut [u8] and doing operations on that byte array. It needs to be careful to ensure that as soon as it gives control back to you, it’s valid UTF-8, since you cannot rely on Drop being called in order to ensure safety, since you could leak the object and regain access to the backing &mut str.

There are two offsets internally, mapped_head and unmapped_head. Anything from the start of the string to mapped_head is data that’s been pushed to the string. Anything from unmapped_head to the end of the string is data that is yet to be popped. And the in-between zone is unspecified but must still be valid UTF-8 (But the current implementation zeros it out after every operation, conservatively.)

If you try to push more data than you are popping, then you will get an error, as there’s no space. So this technique only works when the output string is always as long as or shorter than the input string.

After you do your string modifications, you can get a reference to the part of the slice that has been mapped using into_mapped(). Continuing to use the original slice that was passed in is a footgun, as it will also contain the unspecified section, as well as any unmapped characters. Originally the design took a &mut String to avoid that, and truncated the buffer on Drop, but taking a reference to a String means this library can’t be used in no_std crates.

The library is on my github at https://github.com/5225225/in-place-string-map. There’s definitely room for performance optimisations (in push_str, the zeroing step isn’t ideal, since it makes push_str O(n) where n is the number of characters in the in-between zone, but other solutions seemed to allow invalid UTF-8 to slip through).

I haven’t proven this code safe to myself yet, but it’s probably fine.