[Rpm-ecosystem] Proposed zchunk file format - V4

Jonathan Dieter jdieter at gmail.com
Mon Apr 16 12:47:58 UTC 2018


Here's version four with a swap from fixed-length integers to variable-
length compressed integers which allow us to skip compression of the
index (since the non-integer data is all uncompressable checksums). 
I've also added the uncompressed size of each chunk to the index to
make it easier to figure out how much space to allocate for the
uncompressed chunk.

+-+-+-+-+-+====================+=================+========================+
|   ID    | Checksum type (ci) | Header checksum | Compression type (ci ) |
+-+-+-+-+-+====================+=================+========================+

+=================+=======+=================+
| Index size (ci) | Index | Compressed Dict |
+=================+=======+=================+

+===========+===========+
|   Chunk   |   Chunk   | ==> More chunks
+===========+===========+

(ci)
 Compressed (unsigned) integer - An variable length little endian
 integer where the first seven bits of the number are stored in the
 first byte, followed by the next seven bits in the next byte, and so
 on.  The top bit of all bytes except the final byte must be zero, and
 the top bit of the final byte must be one, indicating the end of the
 number.

ID
 '\0ZCK1', identifies file as zchunk version 1 file

Checksum type
 This is an 8-bit unsigned integer containing the type of checksum
 used to generate the header checksum and the total data checksum, but
 *not* the chunk checksums.

 Current values:
   0 = SHA-1
   1 = SHA-256

Header checksum
 This is the checksum of everything from the beginning of the file
 until the end of the index when the header checksum is all \0's.

Compression type
 This is an integer containing the type of compression used to
 compress dict and chunks.

 Current values:
   0 - Uncompressed
   2 - zstd

Index size
 This is an integer containing the size of the index.

Index
 This is the index, which is described in the next section.

Compressed Dict (optional)
 This is a custom dictionary used when compressing each chunk.
 Because each chunk is compressed completely separately from the
 others, the custom dictionary gives us much better overall
 compression.  The custom dictionary is compressed without a custom
 dictionary (for obvious reasons).

Chunk
 This is a chunk of data, compressed with the custom dictionary
 provided above.


The index:

+==========================+==================+===============+
| Chunk checksum type (ci) | Chunk count (ci) | Data checksum |
+==========================+==================+===============+

+===============+==================+===============================+
| Dict checksum | Dict length (ci) | Uncompressed dict length (ci) |
+===============+==================+===============================+

+================+===================+==========================+
| Chunk checksum | Chunk length (ci) | Uncompressed length (ci) | ...
+================+===================+==========================+

Chunk checksum type
 This is an integer containing the type of checksum used to generate
 the chunk checksums.

 Current values:
   0 = SHA-1
   1 = SHA-256

Chunk count
 This is a count of the number of chunks in the zchunk file.

Checksum of all data
 This is the checksum of everything after the index, including the
 compressed dict and all the compressed chunks.  This checksum is
 generated using the overall checksum type, *not* the chunk checksum
 type.

Dict checksum
 This is the checksum of the compressed dict, used to detect whether
 two dicts are identical.  If there is no dict, the checksum must be
 all zeros.

Dict length
 This is an integer containing the length of the dict.  If there is no
 dict, this must be a zero.

Uncompressed dict length
 This is an integer containing the length of the dict after it has
 been decompressed.  If there is no dict, this must be a zero.

Chunk checksum
 This is the checksum of the compressed chunk, used to detect whether
 any two chunks are identical.

Chunk length
 This is an integer containing the length of the chunk.

Uncompressed dict length
 This is an integer containing the length of the chunk after it has
 been decompressed.

The index is designed to be able to be extracted from the file on the
server and downloaded separately, to facilitate downloading only the
parts of the file that are needed, but must then be re-embedded when
assembling the file so the user only needs to keep one file.


More information about the Rpm-ecosystem mailing list