Subject:

My problem with git-annex


Date: Message-Id: https://www.5snb.club/posts/2020/my-problem-with-git-annex/

I have a few problems with git-annex which lead me to stop using it (I’m running git annex uninit as I write this).

It’s close to what I want, but not quite there yet.

These suggestions are for a new system, not one based on git. And it may be biased towards what I tried to use git-annex for.

This new system should be usable to manage all user data on a system. That means it needs to be able to sync data too, and act as a backup system. My ideal system is one where all user data is versioned, and all system data is handled by NixOS, making a reproducible in 1 command system.

The Positives

Compatibility

It’s just a git repo with some extra stuff. You can store it on a stock git server (Not the data, but the log of where the data is).

Simplicity

Git-annex is very simple (the core is, you can most definitely get more complicated with the copy tracking).

But basically all that you do is replace file content with the hash of the content, and then manage the actual files separately. That’s the primary difference between git-annex and git itself.

The Negatives

Scalability

It doesn’t scale well.

Not the size of files, but the amount. If you’re dealing entirely with a few thousand 4GB media files, then git-annex will work wonders here. But some of the things I try to manage with it (A lot of small files) lead git status to be incredibly slow, sometimes a few minutes because it wants to re-index a few thousand files for whatever reason. God help you when you actually make sweeping changes.

Git-annex is the reason I made my shell git prompt async. Because otherwise, I’d just have to drop to bash/sh if I wanted it to be usable.

How to fix?

Don’t use git. Git isn’t designed for this, even if you do remove the large file problem.

Backup tools can walk a whole file system in seconds. Do things like they do.

To be fair, git might be able to be modified to make this work, but using git to track possibly tens of thousands of dirty files feels outside the scope of a source control tracking tool.

Small Changes

If you have a large file changes many times, but only a few bytes each time, git-annex needs to store the file in full each time. Again, this might be fine depending on the kind of file you store, but this is definitely not fine for me, who uses git-annex to manage their music (So a re-tagging will need me to store the whole file again).

How to fix?

Chunk each file into parts using a rolling hash (don’t try to make diffs), and then store each chunk individually. You can make a tree of chunks this way, and split a large file into roughly 16KB chunks with only needing to store an additional 60KB or so per modification (Assuming a 4-5 layer tree). This idea was taken from bup, and is rather cool.

In addition, for locked files, allow a file to act as the storage for chunks. (As in, say “if you want chunk ABCDEF, look at locked_file[1000..6000]”. A file should be able to act as the storage of more than one chunk. Unlocked files can’t be used to store chunks reliably, since they can be modified at any time.

This would greatly reduce disk usage while still not allowing data loss.

Data that doesn’t need to be directly available (the chunks backing unlocked files) can be compressed, which will further mitigate the excess space usage.

This “using a file as storage for chunks” is definitely additional complexity. It’s by no means mandatory for this system to work.

Locking

Locking gets in the way so often. Not a lot to add here. Getting an error when I make a change to my music library using beets, then needing to go back and unlock the file and repeat the operation is tedious.

How to fix?

Locking should be reserved for special circumstances where you have a large file that you know won’t be changed. It should be opt in. I know locking allows git-annex to only store a single copy of a file, but it’s still annoying.

Yes, you can use git add to add a file without locking it, but that’s slower than git annex add because git add needs to use git filters. It also doesn’t give an ETA. I’ve also ran into issues where git add would write the file to git instead of filtering it, and then suddenly you commit a 4MB file directly into git. That should not be even possible.

Ideally, there should be a mechanism that allows you to automatically unlock/relock files when you need to write to them. This should be fail-safe, in that it should never lose data, but it’s okay if it fails to unlock if it should have. (Perhaps an inotify watch program that unlocks on file open, or a FUSE file system?)