I was poking around some repo's and your file format caught my eye. I have done some work trying to push a reimplementation of tar in rust with uutils to very little avail. That effort has forced me to learn more about tar and archiving than I really ever wanted to know. One of the major pain points and fragile things about tar and the tar format is the header. So much logic and parsing considerations have to go into play to ensure backwards compatibility it really dominates a lot of libraries and standards discussions.
I really like the flexibility of appending the needed information to the end of an archive or metadata file because as long as the backwards compatibility can handle the feature set you won't ever be bound by the initial header size limit decision. The trailer can just handle all future modifications.
The only reason a tar header is the way it is, is because that tar was written for tape drives where sequential access was the only way to do things so it needed to be at the beginning of the tape. The reason it has stuck around is because it is so widely supported and free. It is also a very simple format on its surface.
Which is why I think the box format could be a spiritual successor to tar.
So enough exposition, I think moving everything besides the magic and pointer to the trailer would eliminate any potential growing pains that riddle tar. In essence have the trailer contain the header information, feature flags, versioning, archive attributes, etc.
In order to do any operation the meta for the entries is contained in the trailer already, so the archive meta which the header contained would be already be local to that operation. This way the format is free to add flags and versioning without any fear of running into entry space issues after the first 32 bytes. The header would never change so the trailer is free to change.
Preempting the trailer with the version and feature flags would also make the trailer contain the information to interpret the rest of the trailer, like version enum dispatch, or compression loading. Taking advantage of the locality of the options and control flow. I could go on but here is my crayon drawing thoughts of what the structs could look like:
original
pub struct BoxHeader {
/// Format version (currently 1).
pub version: u8,
/// Allow `\xNN` escape sequences in paths.
pub allow_escapes: bool,
/// Allow external symlinks pointing outside the archive.
pub allow_external_symlinks: bool,
/// Alignment for file data (0 = no alignment).
pub alignment: u32,
/// Offset to the trailer (metadata), or None if not yet written.
pub trailer: Option<NonZeroU64>,
}
post
pub struct BoxHeader {
/// Offset to the trailer (metadata), or None if not yet written.
pub trailer: Option<NonZeroU64>,
}
/* - SNIP - */
pub struct BoxMetadata<'a> {
/* ================================
* Original Header Meta
* =============================== */
/// Format version (currently 1).
pub version: u8,
/// Allow `\xNN` escape sequences in paths.
pub allow_escapes: bool,
/// Allow external symlinks pointing outside the archive.
pub allow_external_symlinks: bool,
/// Alignment for file data (0 = no alignment).
pub alignment: u32,
// ===============================
/// Root "directory" keyed by record indices
pub(crate) root: Vec<RecordIndex>,
/// Keyed by record index (offset by -1). This means if a `RecordIndex` has the value 1, its index in this vector is 0.
/// This is to provide compatibility with platforms such as Linux, and allow for error checking a box file.
pub(crate) records: Vec<Record<'a>>,
/// Interned attribute keys with type information.
pub(crate) attr_keys: Vec<AttrKey>,
/// The global attributes that apply to this entire box file.
pub(crate) attrs: AttrMap,
/// Zstd dictionary for compression/decompression.
/// When present, all Zstd-compressed content (files and chunked blocks) uses this dictionary.
/// None means no dictionary training was performed.
pub(crate) dictionary: Option<Box<[u8]>>,
/// Parsed FST for O(key_length) path lookups and iteration.
/// None for old archives without FST support.
pub(crate) fst: Option<box_fst::Fst<Cow<'a, [u8]>>>,
/// Block FST for seeking within chunked files.
/// Keys are 16 bytes: record_index (u64 BE) + logical_offset (u64 BE).
/// Values are physical offsets within the compressed data.
/// None for archives without chunked files or block index.
pub(crate) block_fst: Option<box_fst::Fst<Cow<'a, [u8]>>>,
}
I know more goes into parsing, reading, and writing with this change, but happy to lend a hand if this idea sounds interesting, or you have other ideas on where you want to go with Box.
I think it awesome super interested!
I was poking around some repo's and your file format caught my eye. I have done some work trying to push a reimplementation of tar in rust with uutils to very little avail. That effort has forced me to learn more about tar and archiving than I really ever wanted to know. One of the major pain points and fragile things about tar and the tar format is the header. So much logic and parsing considerations have to go into play to ensure backwards compatibility it really dominates a lot of libraries and standards discussions.
I really like the flexibility of appending the needed information to the end of an archive or metadata file because as long as the backwards compatibility can handle the feature set you won't ever be bound by the initial header size limit decision. The trailer can just handle all future modifications.
The only reason a tar header is the way it is, is because that tar was written for tape drives where sequential access was the only way to do things so it needed to be at the beginning of the tape. The reason it has stuck around is because it is so widely supported and free. It is also a very simple format on its surface.
Which is why I think the box format could be a spiritual successor to tar.
So enough exposition, I think moving everything besides the magic and pointer to the trailer would eliminate any potential growing pains that riddle tar. In essence have the trailer contain the header information, feature flags, versioning, archive attributes, etc.
In order to do any operation the meta for the entries is contained in the trailer already, so the archive meta which the header contained would be already be local to that operation. This way the format is free to add flags and versioning without any fear of running into entry space issues after the first 32 bytes. The header would never change so the trailer is free to change.
Preempting the trailer with the version and feature flags would also make the trailer contain the information to interpret the rest of the trailer, like version enum dispatch, or compression loading. Taking advantage of the locality of the options and control flow. I could go on but here is my crayon drawing thoughts of what the structs could look like:
original
post
I know more goes into parsing, reading, and writing with this change, but happy to lend a hand if this idea sounds interesting, or you have other ideas on where you want to go with Box.
I think it awesome super interested!