Thoughts for future work
Attempt to validate strings. Augh.
THERE we have basic path sanitization
A little archive file format, just for fun. Inspired by this article: https://mort.coffee/home/tar/ Named after Dr. Uri Burstyn.
I realized 5 minutes after making it that the name is
terrible,
and have decided to keep it this way so that nobody ever takes this
seriously. The file extension is .uri
. This probably causes problems
for some things. It's pronounced like "Yuri" though.
As far as I can tell there's basically two ways to design an archive format: you either have some sort of index array that describes file metadata for many files in one place, or you just have what is essentially a linked list of header+data pieces. It's a tradeoff: listing, modifying and removing files is a lot more efficient if you can read/modify an table and just change where various index entries point. However, making this efficient for adding new files usually means having multiple places in the archive file an index table can b and having multiple indices, so the details can get complex. It starts feeling a lot more like a file system than an archive file. On the other hand, if the metadata for each file is right before the file in the archive, extracting files and adding new files is basically trivial, but to read metadata about the archive contents you have to scan through the whole thing going from header to header, and removing files means compacting down everything after them. So, neither is uniformly better than the other.
The index style is used in zip
, 7z
, as well as probably rar
archives, while the list style is used in tar
, ar
, cpio
and RIFF.
uri
uses the list style, basically because it seems simpler and less
common.
So a uri
file starts with a file header, followed by one or more
chunks. Each chunk is a chunk header followed by the file name and file
data, if any, and then a CRC32 checksum of the file name+data at the
end. So there's no single index to read but you can scan through the
chunks fairly easily and also trivially stick new chunks onto the end of
an existing file.
Graphically, the structure of a chunk is:
│ ... │
├───────────────┤
│ Chunk header │
│ │
├───────────────┤
│ File path │
├───────────────┤
│ File data │
│ ... │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
├───────────────┤
│ File checksum │
├───────────────┤
│ ... │
(Asciiflow is just the best thing ever.)
Things we aren't gonna give a shit about for now:
u64::MAX
(Actually technically i64::MAX
, since
the seek()
system call takes an i64
. Consider nailing this down
better someday.)File header:
Magic: "uri\0"
Version number: u32, this is version 0
Chunk header:
File type (normal, dir, symlink) -- [u8;4]
File path length -- u32
File mode -- u32
Owner name -- [u8;32]
Group name -- [u8;32]
File length -- u64
Compression scheme -- [u8;4]
Chunk header checksum -- u32 (CRC32)
All numbers are little-endian. All sizes are in bytes. "Strings" such as the owner name or file path have no particular encoding, they are just bags of bytes, and they do not have zero terminators.
Total header size: 92 bytes or thereabouts? Plus the file name and data.
The file type is a set of 4 ASCII characters, and is at the start of each chunk header. That way it acts as a human-readable marker for the start of each chunk, which is nice for debugging, and serves as a soft check to make sure that you are actually at the start of a chunk without needing an explict magic number for the chunk header.
Per the Linux useradd
man page, max username length is 32 characters.
I presume they actually mean 32 bytes. Both strings, if less than 32
characters, will have unused data filled with 0 bytes, and will never
contain a 0 byte before the unused portion.
The chunk header checksum is calculated by CRC32'ing the serialized bytes for all the fields preceeding it in the header.
The file path is stored directly after the chunk data. The file path must be relative, ie not start with /. it cannot contain . or .. directory entries in it. It cannot contain consecutive /'s. The path does not end with a zero byte, since its length is already in the chunk header.
After the file path comes the file contents. If it's a symlink, it's the destination of the symlink. If it's a directory, it's nothing. If it's a normal file, it's the data. There is no padding.
The file checksum includes the file path string. Basically it's the checksum of everything between the chunk header and itself. Hence files have a checksum for the file path and file data, directories have a checksum of just the file path, and symlinks have a checksum of the file path and the symlink destination bytes.
If there's a file entry for some file at path foo/bar.txt
, a directory
entry for directory foo/
must preceed it somewhere in the archive.
Multiple entries with the same path are an error.
File types:
FILE
DIR
(including space)SLNK
(symlink)Compression types:
NONE
GZIP
, ZSTD
or XZ
may be candidatesFor both code and spec:
SPDX-License-Identifier: GPL-3.0-only
For compressed data it might or might not be useful to store the uncompressed data length, so you can create a buffer once and then extract the file into it all in one go? Not actually sure, we're presumably going to be streaming this onto the disk anyway. I'm kinda leaning towards "no" for now?
Do we actually care about the compression type per chunk, or can we just put the compression type in the file header and use the same algorithm for all chunks? The idea of tailoring the compression algo to the data is neat, but really you aren't gonna need it.
For device files etc we might need to store more metadata, either in the chunk header or in the file body itself.
Right now the file format version is 0
so we can change whatever we
feel like.
I sooooorta want to have the last byte of the magic number indicate the
file format version, so next version would be uri\1
, then uri\2
,
etc. However having the magic number be a zero-terminated C string
feels a little too cute to let go. Not actually useful (I hate
zero-terminated strings), but cute.
A cute goal would be to be able to have a decompression program decompress multiple files in parallel. Should be doable, even though doing a lot of seeking makes my HDD-accustomed brain wince. Doing parallel compression is harder 'cause we don't know how long the compressed data is and copying shit is expensive. We'd basically be compressing multiple files into chunks at once, either in memory or writing them to disk, and then concatenating them all together in one big copy.
A cute feature would be to be able to delete chunks by replacing their data with 0's and their chunk header with a certain dummy header (also all 0's?), thus you don't need to move/compact all the data after it. Then if you DON'T do any inline compression but rather just run the whole archive through a compression program, the empty space will basically disappear. Inline compression basically makes this feature worthless though, you can delete files but not recover the space without compacting the chunks after it. We are not going to put up with dead space between chunks or anything like that. Dealing with fragmentation is a game I choose not to play. So no, we're not going to do this.
Which CRC32 do we use, exactly? Whichever one the crc32fast
crate
uses. There's several common variations though. For more info see:
Maybe we can make this do something sensible on Windows? I honestly have not tested. It's probably possible, I've tried to leave stubs for platform functionality.
Can file paths have a zero byte in them? I don't THINK so... Usernames certainly cannot, but also neither usernames nor file paths have to be valid UTF-8. Hnyrn. Do we really care if they aren't? ...maybe. It'd make life a little simpler if we only dealt with real Rust strings rather than OS bullshit, but as noted above, there's a lot of OS bullshit involved in correctly archiving files so handling that is kinda our job here.
See "things we won't worry about" above.
There's a time or two when we read the whole file being compressed into memory, which we shouldn't do. But it's convenient so for now it's fine.
There's probably edge cases in path handling we don't catch, because paths are Hard. I want to make it impossible to create an archive that doesn't start with a relative toplevel dir, but I haven't really gotten there yet, and even if I do the extractor still has to check for evil paths.
Need moar integration tests, fuzzing, code coverage, etc.
The CLI will zip up a single directory of stuff; you can't tell it to archive multiple files or multiple directories. This is intentional, so that when you unzip it everything is in a single directory.
tar
is there because it was intended as literally a backup
program, so it needed to know weird things like device numbers and
such.getpwuid()
and search engines and Stack Overflow both
couldn't help. Everything I found said "parse /etc/passwd
by hand,
oh and you'll have to read /etc/nsswitch.conf
and talk to the
things it says manually, and BSD has its own API for it, etc."
Of course, I don't know how to use it either: I spent about 2 hours
looking for a bug that was caused by me checking for hard links
incorrectly.file_type
field as a marker to start chunk
headers. Does it actually matter though, to anyone other than people
writing tools like this one? Not really.tar
is real easy. But making a
program better than tar
is a shitload of very precise and fiddly
work.Inspired mainly by https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar, which is in a large part a hilariously and disgustingly complete recount of all of tar's warts. Its goals are different from uri's, it focuses on content-addressed, layered and deduplicated container images rather than general-purpose file archives for shuffling piles of data around. But still there's some useful lessons there:
tar.gz
style, 'cause it enables parallel decompression.This actually slightly vindicates my decision to go with a header+data chunk model and not an index+archive model, 'cause all of this is still quite easy to do. However, it does still have the problem that going through all the metadata requires skimming through the whole file, even if it happens in large jumps. You could arrange it so that you have a header, a pile of data, and then the index at the end of the file, or maybe the index at some offset pointed to in the header. That would make reading the index easier, and you could keep life simple by still not allowing rearranging or deletion -- since those operations spoil content-addressing anyway. In reality, I don't know enough about how filesystems work in practice to know what the performance tradeoffs are actually like. Maybe I should benchmark it?
So if I do decide to make another version of this, the questions to answer will be: