Note that this is a draft, not a finished article. Also keep in mind, that the .yar format doesn't exist yet; the basic design ideas are done, but the detailed file format specification is in early phase.
To highlight the problems the .yar format tries to solve, this article compares the .yar format to a few popular archive formats.
Some of the formats being compared are plain archivers, and some integrate also compression and maybe even encryption into the same file format. Both approaches have advantages and disadvantages. We concentrate mostly on the archiving features, because the .yar format is an archiving-only format.
Note that this section compares the features of the archive formats, not the features of the tools using these formats. This an important distinction, because sometimes the most popular tools to handle the specific file format don't take full advantage of the file format, or the user interface of the tool is too limiting.
The .tar format is the most popular archive format used on POSIX systems.
There are several variants of the .tar file format, which all are extensions made on top of the original .tar format designed in the 1970s. Some of the .tar variants are not compatible with each other. Distinguishing between the variants is not always easy.
The original .tar format had a 100-character limit for filename length. It was extended in the ustar (POSIX.1-1988) format to allow at maximum of 255 character filenames (includes pathname). In practice the limit is smaller, because the filename is splitted in parts of 155 and 100 bytes from a directory separator. GNU tar and PAX (POSIX.1-2001) formats have no arbitrary limits for the filename length.
The .tar format is one of the very few file formats for which the “file” command has a detection routine written in C. The routine first checks the header checksum, and if that matches, looks for variant-specific magic bytes in the middle of the header. Older .tar variants have no magic bytes at all, thus the only way to detect these files from the file contents is the header checksum.
No popular variant of the .tar format supports storing a central index of the archive contents. Thus, to get the list of the archived files, the archiver needs to read through the whole archive file, which is slow, especially if the archive is compressed.
There is no generic support for multiple forks in any of the .tar variants. The tar command on MacOS X is able to store resource forks into .tar files. It is done by storing the resource fork part into a separate file with a special name. When such archive is extracted with resource fork unaware tar, a two files are created for each file having a resource fork.
At least GNU tar and Jörg Schilling's star have their own extensions to store sparse files. Both support only a sparse format that is comparable to Central Sparse format of .yar. Thus, these tar implementations need to read twice through the sparse file being archived, unless the operating system supports a better way to locate the sparse “holes”. Reading through the file twice is slow, and could be eliminated by using encoding like the Scatter Sparse supported by the .yar format.
GNU tar and star provide extensions for incremental backups. The format used by star uses less space, but it supports incremental backups only partially, since it cannot indicate which files are obsolete and should be deleted. The basic idea of the incremental backup support in the .yar format has been borrowed from the GNU tar format.
Most of the “advanced” extensions of the .tar format work by creating “fake” files in the archive. This allows good backwards compatibility with old tools that don't support the new extensions. At the same time, this kind of extensions waste quite a bit of (uncompressed) space, which is significant even if the .tar file is compressed. For example, storing a one-byte file in the PAX format uses 2048 bytes of uncompressed space: 1024 bytes for backwards compatibility, 512 for PAX header, and 512 for the actual data block.
There are several variants of the .cpio file format. They are incompatible with each other, but easy to distinguish by checking the magic bytes.
The .cpio variants support storing only basic set of file attributes. There is no support for efficient handling of sparse files, and no support for incremental backups or multi-volume archives.
The advantage of the .cpio is that headers are quite small. It is also easy to parse if support for hard links is not needed.
Nowadays the most popular uses for the .cpio format is in the RPM package manager and Linux initramfs.
The .zip format is probably the most popular archive format in Windows and DOS operating systems. On other systems it has become more popular via OpenDocument format, which uses .zip format with different filename suffixes.
The .zip format supports not only archiving itself, but also compression and encryption. The files are compressed independently, which guarantees fast random access reading, but lack of solid compression prevents getting maximum compression ratios.
The .zip format has central index at the end of the archive. Information about archived files is also stored between the actual file data. This makes recovery possible in case the index gets corrupted.
Creating .zip files is streamable. Secure extracting isn't streamable due to the archive index being at the end of the archive. It's insecure, because if user first checked the file list from the index, and then tried to extract the archive in streamed mode, the actual files being extracted could be completely differently named than what the index claims; the index cannot be verified until it is too late. For this reason, the .zip format is not considered to be really streamable when reading the archive.
The .7z format is the native format of 7-Zip.
Central index is stored at the end of the archive, which makes reading .7z files non-streamable. Writing the .7z format is not streamable either, because the archiver tool must be able to update some information in the beginning of the archive at the end of the archiving process.
The XAR format is a archiving-compression format like .zip in sense that all the files are compressed independently. The advantages and disadvantages of this are equal with the .zip format.
XAR files are usually almost always streamable. It's possible to create archives, that are not streamable, but it is useful only if there are duplicate files on the disk, that are not hard linked. The command line tool doesn't create non-streamable archives without special command line option.
XAR uses XML to store the index of the archived files. While some people see XML as a bit bloated solution for such a simple problem as archive format header, it is very flexible and allows storing arbitrary set of metadata such as ACLs and extended attributes.
Writing .dar files is streamable unless a multi-volume archive is created: the archiving tool needs to update the beginning of each volume once the end of the volume has been written. Reading .dar files is not streamable, but the dar tools provide good work-arounds that are enough in many real-world situations.
Descriptions of the tabulated features:
| Format | Archiving only | Header | Unicode filenames | Central index |
|---|---|---|---|---|
| .yar | yes | UTF-8 text | yes (UTF-8) | usually |
| .tar (ustar) | yes | Binary | no (raw 8-bit) | no |
| .tar (PAX) | yes | Binary | yes (UTF-8) | no |
| .cpio (newc) | yes | ASCII text | no (raw 8-bit) | no |
| .xar | no | Binary+XML | yes | yes |
| .dar | no | Binary | no (raw 8-bit) | yes |
| .zip | no | Binary | yes (*) | yes |
| .7z | no | Binary | yes (UTF-16LE) | yes |
(*) This feature was added in 2006, so probably only few tools support it yet. Earlier the .zip format supported only raw 8-bit filenames.
| Format | Streamable | Forks | 1-pass sparse | 2-pass sparse |
|---|---|---|---|---|
| .yar | yes | yes | yes | yes |
| .tar | yes | no (*) | no | GNU, star |
| .cpio | yes | no | no | no |
| .xar | usually | yes | no | no |
| .dar | writing usually is | ? | ? | ? |
| .zip | writing only | no | no | no |
| .7z | no | no | no | no |
(*) Some tar archiver tools can store forks by storing the fork as another regular file into the archive. This is just a hack in the archiver tool, not a feature of the archive format, because in the worst case situation the filename used for the fork could conflict with another file.
| Format | Incremental backups | Multi-volume | Volumes are independent |
|---|---|---|---|
| .yar | yes | yes | yes |
| .tar | GNU | GNU, star | yes |
| .cpio | no | no | - |
| .xar | no | no | - |
| .dar | yes | yes | no |
| .zip | no | yes? | no? |
| .7z | no | no (*) | no |
(*) 7-Zip supports raw splitting of .7z archives. While this is effectively a multi-volume archive, the feature is not part of the archive format, but the archiving tool (similar thing could be done with any archive format).
| Format | Owner and group names | UID and GID | ACLs and EAs |
|---|---|---|---|
| .yar | yes | yes | yes |
| .tar (ustar) | yes | yes | no |
| .tar (PAX) | yes | yes | yes |
| .cpio | no | yes | no |
| .xar | yes | yes | yes |
| .dar | no (*) | yes | yes |
| .zip | no | yes | no? |
| .7z | no | no | no |
(*) The dar tool doesn't store the owner and group names, but probably the file format would support storing them e.g. as extended attributes.