An introduction of Content-defined chunking

Date: Message-Id:

Content-defined chunking (CDC) is a method to more efficiently store various versions of the same file, while achieving deduplication both in the same file and across different files.

Content-defined chunking

So what is CDC and why do file versioning tools use it?

Say you wanted to version a big file that has had various modifications made to it over time, so that all 3 versions need to be stored. Here we’ll represent it as 3 distinct sentences.

  1. The quick brown fox jumps over the lazy dog!
  2. The quick brown dog jumps over the lazy fox?
  3. A quick brown dog jumped over a lazy dog.

One way of storing this would just be to store the whole file as many times as it exists in the repository. This works fine for small pieces of data (and is what git does initially, before it packs your data), but wastes a lot of space when you’re storing 50GB disk images that get changed slightly, and often. And even for small files such as source files, it’s still very wasteful to re-store the whole file every time a change is made.

Another method would be to compute a diff between the files. This is what git does, and works well for the kind of files git handles, namely smallish source files that get changed often and slightly. But trying to store very large files in git gets slow since computing diffs of very large files is slow, so mechanisms like Git LFS and git-annex were made to store the actual file content in a way that git doesn’t try to diff.

Finally, we can chunk the file into individual chunks, and store each chunk in the repository, and then store the main file as a list of chunk identifiers. This way, chunks that are the same between files can be deduplicated naturally, you don’t need to do any diffs to read off files, just read the chunks in the list you’re given.

But there’s multiple ways to do the chunks. The simplest way would be to split on fixed byte boundaries, so every some number of bytes bytes is a new chunk.

Let’s try that on the above 3 sentences, splitting every 8 bytes. I will bold the new chunks as they come up.

  1. The quick brown fox jumps over the lazy dog!
    The quic, k brown , fox jump, s over t, he lazy , dog

  2. The quick brown dog jumps over the lazy fox?
    The quic, k brown , dog jump, s over t, he lazy , fox?

  3. A quick brown dog jumped over a lazy dog.
    A quick , brown do, g jumped, over a, lazy dog, .

Here we can see that while the first two sentences worked fine, the last one was completely ruined, and shares no chunks with earlier sentences, even though it has a similar number of changes!

This is because we removed some bytes from the start, which shifted over all the byte boundaries in future chunks. Therefore, if we used this method to version a file, and we added a single byte to the very start of a very large file, the whole file would need to be stored again. That’s not good.

Content-defined chunking avoids this issue, by making the chunk points depend on the actual content. Therefore, if the content shifts, the chunks will follow. A small change in any place in the file will only cause at maximum, two chunks to be changed (If the change spans two chunks).

Content-defined chunking can be seen as a function that takes a specific window of data, and returns if now is a good time to cut.

Here we’ll simply split at the end of words. So "Hello, World!" would be split into ["Hello, ", "World!"]. And I’ll do the same thing as before, bold new chunks as they come up.

  1. The quick brown fox jumps over the lazy dog!
    The , quick , brown , fox , jumps , over , the , lazy , dog!

  2. The quick brown dog jumps over the lazy fox?
    The , quick , brown , dog , jumps , over , the , lazy , fox?

  3. A quick brown dog jumped over a lazy dog.
    A , quick , brown , dog , jumped , over , a , lazy , dog.

Here we can see that even the last sentence has managed to re-use many chunks from the earlier two, even though it has modifications that changed the length of chunks.

Rolling Hash (not a weed joke i promise)

Splitting on spaces works fine for text, but we want a more general solution that works for all file types.

So we’ll walk through the file, one byte at a time, and look at the last n bytes, where n is a window size. I’ll use a window size of 8, and we’ll only split when the first byte of the SHA256 hash of the window starts with 4 1 bits in a row.

So here we cut twice, one at the very start of the sentence, and one half way through. So we end up with

A quick , brown fox a, nd a dog..

This is similar to the way we split on spaces, but here we have much finer control over how often splits happen. If we want splits to happen half as often, we can simply check for five 1 bits in a row.

Now we have one more thing to do. Above, I used SHA256 to generate the hash, but you absolutely do not want to do a full invocation of a strong hash function for every byte you want to chunk, or it will be very slow. (You’ll need to be hashing 64 bytes per input byte if you have a window size of 64).

So instead, we use a rolling hash function, which is able to easily push bytes into it, and remove old bytes.

A very simple example of a rolling hash is simply a sum of the window. We can add bytes to it by adding, and we can remove bytes from it by subtracting.

Let’s do the 4-byte window rolling hash of 30 32 26 77 41 92 63 21 16.

We’ll start out by summing up the first 4 bytes and going from there.

30+32+26+77 = 165

Previous Hash ValueSubtractedAddedHash Value

As you can see, the amount of work done per byte is not very high, just one subtraction and one addition. If you had a 64 byte window and were recalculating the hash every time, you would need to be doing 64 additions per input byte.

Now this is a bad rolling hash, because if you re-order bytes in the window, the resulting hash value will be the same. Better hashes such as the Rabin fingerprint don’t have this issue.

In any case, that’s an introduction to content-defined chunking based on a rolling hash. This is what backup utilities like Borg and Restic use to deduplicate files in your repository.

Next week I’ll look at the privacy implications of this, and how it can go wrong.