Comparing .yar to other archive formats

Note that this is a draft, not a finished article. Also keep in mind, that the .yar format doesn't exist yet; the basic design ideas are done, but the detailed file format specification is in early phase.

To highlight the problems the .yar format tries to solve, this article compares the .yar format to a few popular archive formats.

Some of the formats being compared are plain archivers, and some integrate also compression and maybe even encryption into the same file format. Both approaches have advantages and disadvantages. We concentrate mostly on the archiving features, because the .yar format is an archiving-only format.

Note that this section compares the features of the archive formats, not the features of the tools using these formats. This an important distinction, because sometimes the most popular tools to handle the specific file format don't take full advantage of the file format, or the user interface of the tool is too limiting.

Detailed comparisons

Tape Archive (.tar)

The .tar format is the most popular archive format used on POSIX systems.

There are several variants of the .tar file format, which all are extensions made on top of the original .tar format designed in the 1970s. Some of the .tar variants are not compatible with each other. Distinguishing between the variants is not always easy.

The original .tar format had a 100-character limit for filename length. It was extended in the ustar (POSIX.1-1988) format to allow at maximum of 255 character filenames (includes pathname). In practice the limit is smaller, because the filename is splitted in parts of 155 and 100 bytes from a directory separator. GNU tar and PAX (POSIX.1-2001) formats have no arbitrary limits for the filename length.

The .tar format is one of the very few file formats for which the “file” command has a detection routine written in C. The routine first checks the header checksum, and if that matches, looks for variant-specific magic bytes in the middle of the header. Older .tar variants have no magic bytes at all, thus the only way to detect these files from the file contents is the header checksum.

No popular variant of the .tar format supports storing a central index of the archive contents. Thus, to get the list of the archived files, the archiver needs to read through the whole archive file, which is slow, especially if the archive is compressed.

There is no generic support for multiple forks in any of the .tar variants. The tar command on MacOS X is able to store resource forks into .tar files. It is done by storing the resource fork part into a separate file with a special name. When such archive is extracted with resource fork unaware tar, a two files are created for each file having a resource fork.

At least GNU tar and Jörg Schilling's star have their own extensions to store sparse files. Both support only a sparse format that is comparable to Central Sparse format of .yar. Thus, these tar implementations need to read twice through the sparse file being archived, unless the operating system supports a better way to locate the sparse “holes”. Reading through the file twice is slow, and could be eliminated by using encoding like the Scatter Sparse supported by the .yar format.

GNU tar and star provide extensions for incremental backups. The format used by star uses less space, but it supports incremental backups only partially, since it cannot indicate which files are obsolete and should be deleted. The basic idea of the incremental backup support in the .yar format has been borrowed from the GNU tar format.

Most of the “advanced” extensions of the .tar format work by creating “fake” files in the archive. This allows good backwards compatibility with old tools that don't support the new extensions. At the same time, this kind of extensions waste quite a bit of (uncompressed) space, which is significant even if the .tar file is compressed. For example, storing a one-byte file in the PAX format uses 2048 bytes of uncompressed space: 1024 bytes for backwards compatibility, 512 for PAX header, and 512 for the actual data block.

.cpio

There are several variants of the .cpio file format. They are incompatible with each other, but easy to distinguish by checking the magic bytes.

The .cpio variants support storing only basic set of file attributes. There is no support for efficient handling of sparse files, and no support for incremental backups or multi-volume archives.

The advantage of the .cpio is that headers are quite small. It is also easy to parse if support for hard links is not needed.

Nowadays the most popular uses for the .cpio format is in the RPM package manager and Linux initramfs.

.zip

The .zip format is probably the most popular archive format in Windows and DOS operating systems. On other systems it has become more popular via OpenDocument format, which uses .zip format with different filename suffixes.

The .zip format supports not only archiving itself, but also compression and encryption. The files are compressed independently, which guarantees fast random access reading, but lack of solid compression prevents getting maximum compression ratios.

The .zip format has central index at the end of the archive. Information about archived files is also stored between the actual file data. This makes recovery possible in case the index gets corrupted.

Creating .zip files is streamable. Secure extracting isn't streamable due to the archive index being at the end of the archive. It's insecure, because if user first checked the file list from the index, and then tried to extract the archive in streamed mode, the actual files being extracted could be completely differently named than what the index claims; the index cannot be verified until it is too late. For this reason, the .zip format is not considered to be really streamable when reading the archive.

.7z

The .7z format is the native format of 7-Zip.

Central index is stored at the end of the archive, which makes reading .7z files non-streamable. Writing the .7z format is not streamable either, because the archiver tool must be able to update some information in the beginning of the archive at the end of the archiving process.

Extensible Archive (.xar)

The XAR format is a archiving-compression format like .zip in sense that all the files are compressed independently. The advantages and disadvantages of this are equal with the .zip format.

XAR files are usually almost always streamable. It's possible to create archives, that are not streamable, but it is useful only if there are duplicate files on the disk, that are not hard linked. The command line tool doesn't create non-streamable archives without special command line option.

XAR uses XML to store the index of the archived files. While some people see XML as a bit bloated solution for such a simple problem as archive format header, it is very flexible and allows storing arbitrary set of metadata such as ACLs and extended attributes.

Disk Archive (.dar)

Writing .dar files is streamable unless a multi-volume archive is created: the archiving tool needs to update the beginning of each volume once the end of the volume has been written. Reading .dar files is not streamable, but the dar tools provide good work-arounds that are enough in many real-world situations.

Summary

Descriptions of the tabulated features:

  • Archiving only indicates if the file format is strictly an archiving format. Some formats support also compression and encryption.
  • The format of the header doesn't usually matter, but a human readable header can be easier to debug and repair in case of problems.
  • Using Unicode to store filenames makes it sure that the filenames will show up correctly when the archive is moved between different types of systems.
  • Central index makes it fast to list the archive contents and to locate arbitrary files from the archive.
  • Streamability means that the archive can be read and written with a single pass without doing any seeks (without random access). In the POSIX world, streamability is often an essential feature.
  • Some file system support multiple data streams (forks) in a single file. On NTFS, forks are known as Alternate Data Stream (ADS), and on HFS+ they are called Resource Forks. In the table below, support for forks indicates if the archive format can store multiple forks per file.
  • 1-pass sparse file support makes it fast to store sparse files into archive even when the operating system doesn't support a fast way to get information about sparse holes.
  • 2-pass sparse file support wastes less space than 1-pass method, but it is slow if the operating system doesn't support a fast way to get information about sparse holes, because the archiver will need to read through the file twice.
  • Incremental backup support makes it possible to use the archive format for large scale backups. This feature requires, that the archive format can store information about renamed, moved, and deleted files.
  • Multi-volume support allows splitting the archive on multiple (usually physical) media.
  • Independent volumes are nice when one of the volumes gets corrupted: each volume can be extracted independently, thus only the data on the corrupted volume is lost.
Format Archiving only Header Unicode filenames Central index
.yar yes UTF-8 text yes (UTF-8) usually
.tar (ustar) yes Binary no (raw 8-bit) no
.tar (PAX) yes Binary yes (UTF-8) no
.cpio (newc) yes ASCII text no (raw 8-bit) no
.xar no Binary+XML yes yes
.dar no Binary no (raw 8-bit) yes
.zip no Binary yes (*) yes
.7z no Binary yes (UTF-16LE) yes

(*) This feature was added in 2006, so probably only few tools support it yet. Earlier the .zip format supported only raw 8-bit filenames.

Format Streamable Forks 1-pass sparse 2-pass sparse
.yar yes yes yes yes
.tar yes no (*) no GNU, star
.cpio yes no no no
.xar usually yes no no
.dar writing usually is ? ? ?
.zip writing only no no no
.7z no no no no

(*) Some tar archiver tools can store forks by storing the fork as another regular file into the archive. This is just a hack in the archiver tool, not a feature of the archive format, because in the worst case situation the filename used for the fork could conflict with another file.

Format Incremental backups Multi-volume Volumes are independent
.yar yes yes yes
.tar GNU GNU, star yes
.cpio no no -
.xar no no -
.dar yes yes no
.zip no yes? no?
.7z no no (*) no

(*) 7-Zip supports raw splitting of .7z archives. While this is effectively a multi-volume archive, the feature is not part of the archive format, but the archiving tool (similar thing could be done with any archive format).

Format Owner and group names UID and GID ACLs and EAs
.yar yes yes yes
.tar (ustar) yes yes no
.tar (PAX) yes yes yes
.cpio no yes no
.xar yes yes yes
.dar no (*) yes yes
.zip no yes no?
.7z no no no

(*) The dar tool doesn't store the owner and group names, but probably the file format would support storing them e.g. as extended attributes.

 
yar/comparison.txt · Last modified: 2010/01/01 20:02 (external edit)
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki