A silly little archive format
Thoughts for future work
Attempt to validate strings.  Augh.
THERE we have basic path sanitization

heads

tip
browse log

clone

read-only
https://hg.sr.ht/~icefox/uri
read/write
ssh://hg@hg.sr.ht/~icefox/uri

#uri - a silly archive format

A little archive file format, just for fun. Inspired by this article: https://mort.coffee/home/tar/ Named after Dr. Uri Burstyn.

I realized 5 minutes after making it that the name is terrible, and have decided to keep it this way so that nobody ever takes this seriously. The file extension is .uri. This probably causes problems for some things. It's pronounced like "Yuri" though.

#Format description

As far as I can tell there's basically two ways to design an archive format: you either have some sort of index array that describes file metadata for many files in one place, or you just have what is essentially a linked list of header+data pieces. It's a tradeoff: listing, modifying and removing files is a lot more efficient if you can read/modify an table and just change where various index entries point. However, making this efficient for adding new files usually means having multiple places in the archive file an index table can b and having multiple indices, so the details can get complex. It starts feeling a lot more like a file system than an archive file. On the other hand, if the metadata for each file is right before the file in the archive, extracting files and adding new files is basically trivial, but to read metadata about the archive contents you have to scan through the whole thing going from header to header, and removing files means compacting down everything after them. So, neither is uniformly better than the other.

The index style is used in zip, 7z, as well as probably rar archives, while the list style is used in tar, ar, cpio and RIFF. uri uses the list style, basically because it seems simpler and less common.

So a uri file starts with a file header, followed by one or more chunks. Each chunk is a chunk header followed by the file name and file data, if any, and then a CRC32 checksum of the file name+data at the end. So there's no single index to read but you can scan through the chunks fairly easily and also trivially stick new chunks onto the end of an existing file.

Graphically, the structure of a chunk is:

│ ...           │
├───────────────┤
│ Chunk header  │
│               │
├───────────────┤
│ File path     │
├───────────────┤
│ File data     │
│ ...           │
│               │
│               │
│               │
│               │
│               │
│               │
│               │
│               │
│               │
│               │
│               │
├───────────────┤
│ File checksum │
├───────────────┤
│ ...           │

(Asciiflow is just the best thing ever.)

Things we aren't gonna give a shit about for now:

  • Non-Unix-y file systems -- sorry but there's a lot of platform assumptions that go into these things and I'm not gonna try to shoehorn them all into one file format right now.
  • special file types besides symlinks and dirs, such as hard links, fifos, device files
  • File creation and modification/access times
  • Files bigger than u64::MAX (Actually technically i64::MAX, since the seek() system call takes an i64. Consider nailing this down better someday.)
  • Compression, though it might be cool to play with it someday down the line so there's a field in the chunk header for it.

File header:

Magic: "uri\0"
Version number: u32, this is version 0

Chunk header:

File type (normal, dir, symlink) -- [u8;4]
File path length -- u32
File mode -- u32
Owner name -- [u8;32]
Group name -- [u8;32]
File length -- u64
Compression scheme -- [u8;4]
Chunk header checksum -- u32 (CRC32)

All numbers are little-endian. All sizes are in bytes. "Strings" such as the owner name or file path have no particular encoding, they are just bags of bytes, and they do not have zero terminators.

Total header size: 92 bytes or thereabouts? Plus the file name and data.

The file type is a set of 4 ASCII characters, and is at the start of each chunk header. That way it acts as a human-readable marker for the start of each chunk, which is nice for debugging, and serves as a soft check to make sure that you are actually at the start of a chunk without needing an explict magic number for the chunk header.

Per the Linux useradd man page, max username length is 32 characters. I presume they actually mean 32 bytes. Both strings, if less than 32 characters, will have unused data filled with 0 bytes, and will never contain a 0 byte before the unused portion.

The chunk header checksum is calculated by CRC32'ing the serialized bytes for all the fields preceeding it in the header.

The file path is stored directly after the chunk data. The file path must be relative, ie not start with /. it cannot contain . or .. directory entries in it. It cannot contain consecutive /'s. The path does not end with a zero byte, since its length is already in the chunk header.

After the file path comes the file contents. If it's a symlink, it's the destination of the symlink. If it's a directory, it's nothing. If it's a normal file, it's the data. There is no padding.

The file checksum includes the file path string. Basically it's the checksum of everything between the chunk header and itself. Hence files have a checksum for the file path and file data, directories have a checksum of just the file path, and symlinks have a checksum of the file path and the symlink destination bytes.

If there's a file entry for some file at path foo/bar.txt, a directory entry for directory foo/ must preceed it somewhere in the archive.

Multiple entries with the same path are an error.

File types:

  • FILE
  • DIR (including space)
  • SLNK (symlink)

Compression types:

  • NONE
  • Maybe more later: GZIP, ZSTD or XZ may be candidates

#License

For both code and spec:

SPDX-License-Identifier: GPL-3.0-only

#Future considerations

For compressed data it might or might not be useful to store the uncompressed data length, so you can create a buffer once and then extract the file into it all in one go? Not actually sure, we're presumably going to be streaming this onto the disk anyway. I'm kinda leaning towards "no" for now?

Do we actually care about the compression type per chunk, or can we just put the compression type in the file header and use the same algorithm for all chunks? The idea of tailoring the compression algo to the data is neat, but really you aren't gonna need it.

For device files etc we might need to store more metadata, either in the chunk header or in the file body itself.

Right now the file format version is 0 so we can change whatever we feel like.

I sooooorta want to have the last byte of the magic number indicate the file format version, so next version would be uri\1, then uri\2, etc. However having the magic number be a zero-terminated C string feels a little too cute to let go. Not actually useful (I hate zero-terminated strings), but cute.

A cute goal would be to be able to have a decompression program decompress multiple files in parallel. Should be doable, even though doing a lot of seeking makes my HDD-accustomed brain wince. Doing parallel compression is harder 'cause we don't know how long the compressed data is and copying shit is expensive. We'd basically be compressing multiple files into chunks at once, either in memory or writing them to disk, and then concatenating them all together in one big copy.

A cute feature would be to be able to delete chunks by replacing their data with 0's and their chunk header with a certain dummy header (also all 0's?), thus you don't need to move/compact all the data after it. Then if you DON'T do any inline compression but rather just run the whole archive through a compression program, the empty space will basically disappear. Inline compression basically makes this feature worthless though, you can delete files but not recover the space without compacting the chunks after it. We are not going to put up with dead space between chunks or anything like that. Dealing with fragmentation is a game I choose not to play. So no, we're not going to do this.

Which CRC32 do we use, exactly? Whichever one the crc32fast crate uses. There's several common variations though. For more info see:

Maybe we can make this do something sensible on Windows? I honestly have not tested. It's probably possible, I've tried to leave stubs for platform functionality.

Can file paths have a zero byte in them? I don't THINK so... Usernames certainly cannot, but also neither usernames nor file paths have to be valid UTF-8. Hnyrn. Do we really care if they aren't? ...maybe. It'd make life a little simpler if we only dealt with real Rust strings rather than OS bullshit, but as noted above, there's a lot of OS bullshit involved in correctly archiving files so handling that is kinda our job here.

#Known bugs

See "things we won't worry about" above.

There's a time or two when we read the whole file being compressed into memory, which we shouldn't do. But it's convenient so for now it's fine.

There's probably edge cases in path handling we don't catch, because paths are Hard. I want to make it impossible to create an archive that doesn't start with a relative toplevel dir, but I haven't really gotten there yet, and even if I do the extractor still has to check for evil paths.

Need moar integration tests, fuzzing, code coverage, etc.

#Wontfix

The CLI will zip up a single directory of stuff; you can't tell it to archive multiple files or multiple directories. This is intentional, so that when you unzip it everything is in a single directory.

#Lessons learned

  • Archiving files is actually a horribly platform-dependent operation, and no portable tool will do it perfectly. A lot of the weird stuff in tar is there because it was intended as literally a backup program, so it needed to know weird things like device numbers and such.
  • The POSIX API sucks and nobody knows how to use it. I spent two days looking for getpwuid() and search engines and Stack Overflow both couldn't help. Everything I found said "parse /etc/passwd by hand, oh and you'll have to read /etc/nsswitch.conf and talk to the things it says manually, and BSD has its own API for it, etc." Of course, I don't know how to use it either: I spent about 2 hours looking for a bug that was caused by me checking for hard links incorrectly.
  • Designing file formats is fun! I am inordinately pleased about using a human-readable file_type field as a marker to start chunk headers. Does it actually matter though, to anyone other than people writing tools like this one? Not really.
  • Making a file format better than tar is real easy. But making a program better than tar is a shitload of very precise and fiddly work.

#Future work

Inspired mainly by https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar, which is in a large part a hilariously and disgustingly complete recount of all of tar's warts. Its goals are different from uri's, it focuses on content-addressed, layered and deduplicated container images rather than general-purpose file archives for shuffling piles of data around. But still there's some useful lessons there:

  • Compressing the individual files is better than compressing the whole archive tar.gz style, 'cause it enables parallel decompression.
  • Parallel extraction is actually pretty easy if your files are in some kind of topological order, probably worth doing -- we already require that a directory entry preceed any files in that directory.
  • Parallel compression is probably harder but not too hard if your total archive is not too big to fit into memory. If only some files in it are too big to fit into memory then it's harder but still possible. If it is a giant archive of giant files then you're probably slightly hosed no matter what, but having a proper ordering is probably still useful.
  • You will need some kind of extension method or else people will reinvent it badly. I don't really want to reinvent RIFF here, but having a format for variable-length fields for metadata or such is probably inevitable.
  • Content-addressing wasn't an original goal for this but might be a good thing to have anyway. This requires storing everything in a known order and such, disallowing stray-but-technically-valid variations such as duplicated files, empty metadata fields or deleted files, and generally leaving nothing up to chance. But if you do all of that that you know that if you have two valid archives, each with a (cryptographic, not crc) data checksum in their header, then identical header == identical archives. I do love me some content-addressing, even though it's not really necessary for this kind of format.
  • We probably need checksums for chunk headers, not just data.
  • If we do content-addressing, do we generate content addresses for individual chunks as well as the file as a whole? Right now I feel like the answer is "no", though it does have some interesting possibilities in terms of deduplication.
  • Ignoring file attributes like creation date might be a mistake but it sure makes life easier. Same with things like device files and hard links.

This actually slightly vindicates my decision to go with a header+data chunk model and not an index+archive model, 'cause all of this is still quite easy to do. However, it does still have the problem that going through all the metadata requires skimming through the whole file, even if it happens in large jumps. You could arrange it so that you have a header, a pile of data, and then the index at the end of the file, or maybe the index at some offset pointed to in the header. That would make reading the index easier, and you could keep life simple by still not allowing rearranging or deletion -- since those operations spoil content-addressing anyway. In reality, I don't know enough about how filesystems work in practice to know what the performance tradeoffs are actually like. Maybe I should benchmark it?

So if I do decide to make another version of this, the questions to answer will be:

  • Parallelization?
  • Content addressing?
  • Coalesce chunk headers into a single index somewhere?
  • Still keep the focus on creating/extracting archives, rather than making it easy to do random access; in particular, deleting files without recreating the archive remains out of scope.