Attempt to validate strings. Augh.
THERE we have basic path sanitization
Start on path sanitization
A little archive file format, just for fun. Inspired by this article: https://mort.coffee/home/tar/ Named after Dr. Uri Burstyn.
I realized 5 minutes after making it that the name is
and have decided to keep it this way so that nobody ever takes this
seriously. The file extension is
.uri. This probably causes problems
for some things. It's pronounced like "Yuri" though.
As far as I can tell there's basically two ways to design an archive format: you either have some sort of index array that describes file metadata for many files in one place, or you just have what is essentially a linked list of header+data pieces. It's a tradeoff: listing, modifying and removing files is a lot more efficient if you can read/modify an table and just change where various index entries point. However, making this efficient for adding new files usually means having multiple places in the archive file an index table can b and having multiple indices, so the details can get complex. On the other hand, if the metadata for each file is right before the file in the archive, extracting files and adding new files is basically trivial, but to read metadata about the archive contents you have to scan through the whole thing going from header to header, and removing files means compacting down everything after them. So, neither is uniformly better than the other.
The index style is used in
7z, as well as probably
archives, while the list style is used in
cpio and RIFF.
uri uses the list style, basically because it seems simpler and less
uri file starts with a file header, followed by one or more
chunks. Each chunk is a chunk header followed by the file name and file
data, if any, and then a CRC32 checksum of the file name+data at the
end. So there's no single index to read but you can scan through the
chunks fairly easily and also trivially stick new chunks onto the end of
an existing file.
Graphically, the structure of a chunk is:
│ ... │ ├───────────────┤ │ Chunk header │ │ │ ├───────────────┤ │ File path │ ├───────────────┤ │ File data │ │ ... │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ├───────────────┤ │ File checksum │ ├───────────────┤ │ ... │
(Asciiflow is just the best thing ever.)
Things we aren't gonna give a shit about for now:
i64::MAX, since the
seek()system call takes an
i64. Consider nailing this down better someday.)
Magic: "uri\0" Version number: u32, this is version 0
File type (normal, dir, symlink) -- [u8;4] File path length -- u32 File mode -- u32 Owner name -- [u8;32] Group name -- [u8;32] File length -- u64 Compression scheme -- [u8;4] Chunk header checksum -- u32 (CRC32)
All numbers are little-endian. All sizes are in bytes. "Strings" such as the owner name or file path have no particular encoding, they are just bags of bytes, and they do not have zero terminators.
Total header size: 92 bytes or thereabouts? Plus the file name and data.
The file type is a set of 4 ASCII characters, and is at the start of each chunk header. That way it acts as a human-readable marker for the start of each chunk, which is nice for debugging, and serves as a soft check to make sure that you are actually at the start of a chunk without needing an explict magic number for the chunk header.
Per the Linux
useradd man page, max username length is 32 characters.
I presume they actually mean 32 bytes. Both strings, if less than 32
characters, will have unused data filled with 0 bytes, and will never
contain a 0 byte before the unused portion.
The chunk header checksum is calculated by CRC32'ing the serialized bytes for all the fields preceeding it in the header.
The file path is stored directly after the chunk data. The file path must be relative, ie not start with /. it cannot contain . or .. directory entries in it. It cannot contain consecutive /'s. The path does not end with a zero byte, since its length is already in the chunk header.
After the file path comes the file contents. If it's a symlink, it's the destination of the symlink. If it's a directory, it's nothing. If it's a normal file, it's the data. There is no padding.
The file checksum includes the file path string. Basically it's the checksum of everything between the chunk header and itself. Hence files have a checksum for the file path and file data, directories have a checksum of just the file path, and symlinks have a checksum of the file path and the symlink destination bytes.
If there's a file entry for some file at path
foo/bar.txt, a directory
entry for directory
foo/ must preceed it somewhere in the archive.
Multiple entries with the same path are an error.
XZmay be candidates
For both code and spec:
For compressed data it might or might not be useful to store the uncompressed data length, so you can create a buffer once and then extract the file into it all in one go? Not actually sure, we're presumably going to be streaming this onto the disk anyway. I'm kinda leaning towards "no" for now?
Do we actually care about the compression type per chunk, or can we just put the compression type in the file header and use the same algorithm for all chunks? The idea of tailoring the compression algo to the data is neat, but really you aren't gonna need it.
For device files etc we might need to store more metadata, either in the chunk header or in the file body itself.
Right now the file format version is
0 so we can change whatever we
I sooooorta want to have the last byte of the magic number indicate the
file format version, so next version would be
etc. However having the magic number be a zero-terminated C string
feels a little too cute to let go. Not actually useful (I hate
zero-terminated strings), but cute.
A cute goal would be to be able to have a decompression program decompress multiple files in parallel. Should be doable, even though doing a lot of seeking makes my HDD-accustomed brain wince. Doing parallel compression is harder 'cause we don't know how long the compressed data is and copying shit is expensive. We'd basically be compressing multiple files into chunks at once, either in memory or writing them to disk, and then concatenating them all together in one big copy.
A cute feature would be to be able to delete chunks by replacing their data with 0's and their chunk header with a certain dummy header (also all 0's?), thus you don't need to move/compact all the data after it. Then if you DON'T do any inline compression but rather just run the whole archive through a compression program, the empty space will basically disappear. Inline compression basically makes this feature worthless though, you can delete files but not recover the space without compacting the chunks after it. We are not going to put up with dead space between chunks or anything like that. Dealing with fragmentation is a game I choose not to play. So no, we're not going to do this.
Which CRC32 do we use, exactly? Whichever one the
uses. There's several common variations though. For more info see:
Maybe we can make this do something sensible on Windows? I honestly have not tested. It's probably possible, I've tried to leave stubs for platform functionality.
Can file paths have a zero byte in them? I don't THINK so... Usernames certainly cannot, but also neither usernames nor file paths have to be valid UTF-8. Hnyrn. Do we really care if they aren't? ...maybe. It'd make life a little simpler if we only dealt with real Rust strings rather than OS bullshit, but as noted above, there's a lot of OS bullshit involved in correctly archiving files so handling that is kinda our job here.
See "things we won't worry about" above.
There's a time or two when we read the whole file being compressed into memory, which we shouldn't do. But it's convenient so for now it's fine.
There's probably edge cases in path handling we don't catch, because paths are Hard. I want to make it impossible to create an archive that doesn't start with a relative toplevel dir, but I haven't really gotten there yet, and even if I do the extractor still has to check for evil paths.
Need moar integration tests, fuzzing, code coverage, etc.
The CLI will zip up a single directory of stuff; you can't tell it to archive multiple files or multiple directories. This is intentional, so that when you unzip it everything is in a single directory.
taris there because it was intended as literally a backup program, so it needed to know weird things like device numbers and such.
getpwuid()and search engines and Stack Overflow both couldn't help. Everything I found said "parse
/etc/passwdby hand, oh and you'll have to read
/etc/nsswitch.confand talk to the things it says manually, and BSD has its own API for it, etc." Of course, I don't know how to use it either: I spent about 2 hours looking for a bug that was caused by me checking for hard links incorrectly.
file_typefield as a marker to start chunk headers. Does it actually matter though, to anyone other than people writing tools like this one? Not really.
taris real easy. But making a program better than
taris a shitload of very precise and fiddly work.