[Rpm-ecosystem] Proposed zchunk file format - V4
Jonathan Dieter
jdieter at gmail.com
Mon Apr 16 12:47:58 UTC 2018
Here's version four with a swap from fixed-length integers to variable-
length compressed integers which allow us to skip compression of the
index (since the non-integer data is all uncompressable checksums).
I've also added the uncompressed size of each chunk to the index to
make it easier to figure out how much space to allocate for the
uncompressed chunk.
+-+-+-+-+-+====================+=================+========================+
| ID | Checksum type (ci) | Header checksum | Compression type (ci ) |
+-+-+-+-+-+====================+=================+========================+
+=================+=======+=================+
| Index size (ci) | Index | Compressed Dict |
+=================+=======+=================+
+===========+===========+
| Chunk | Chunk | ==> More chunks
+===========+===========+
(ci)
Compressed (unsigned) integer - An variable length little endian
integer where the first seven bits of the number are stored in the
first byte, followed by the next seven bits in the next byte, and so
on. The top bit of all bytes except the final byte must be zero, and
the top bit of the final byte must be one, indicating the end of the
number.
ID
'\0ZCK1', identifies file as zchunk version 1 file
Checksum type
This is an 8-bit unsigned integer containing the type of checksum
used to generate the header checksum and the total data checksum, but
*not* the chunk checksums.
Current values:
0 = SHA-1
1 = SHA-256
Header checksum
This is the checksum of everything from the beginning of the file
until the end of the index when the header checksum is all \0's.
Compression type
This is an integer containing the type of compression used to
compress dict and chunks.
Current values:
0 - Uncompressed
2 - zstd
Index size
This is an integer containing the size of the index.
Index
This is the index, which is described in the next section.
Compressed Dict (optional)
This is a custom dictionary used when compressing each chunk.
Because each chunk is compressed completely separately from the
others, the custom dictionary gives us much better overall
compression. The custom dictionary is compressed without a custom
dictionary (for obvious reasons).
Chunk
This is a chunk of data, compressed with the custom dictionary
provided above.
The index:
+==========================+==================+===============+
| Chunk checksum type (ci) | Chunk count (ci) | Data checksum |
+==========================+==================+===============+
+===============+==================+===============================+
| Dict checksum | Dict length (ci) | Uncompressed dict length (ci) |
+===============+==================+===============================+
+================+===================+==========================+
| Chunk checksum | Chunk length (ci) | Uncompressed length (ci) | ...
+================+===================+==========================+
Chunk checksum type
This is an integer containing the type of checksum used to generate
the chunk checksums.
Current values:
0 = SHA-1
1 = SHA-256
Chunk count
This is a count of the number of chunks in the zchunk file.
Checksum of all data
This is the checksum of everything after the index, including the
compressed dict and all the compressed chunks. This checksum is
generated using the overall checksum type, *not* the chunk checksum
type.
Dict checksum
This is the checksum of the compressed dict, used to detect whether
two dicts are identical. If there is no dict, the checksum must be
all zeros.
Dict length
This is an integer containing the length of the dict. If there is no
dict, this must be a zero.
Uncompressed dict length
This is an integer containing the length of the dict after it has
been decompressed. If there is no dict, this must be a zero.
Chunk checksum
This is the checksum of the compressed chunk, used to detect whether
any two chunks are identical.
Chunk length
This is an integer containing the length of the chunk.
Uncompressed dict length
This is an integer containing the length of the chunk after it has
been decompressed.
The index is designed to be able to be extracted from the file on the
server and downloaded separately, to facilitate downloading only the
parts of the file that are needed, but must then be re-embedded when
assembling the file so the user only needs to keep one file.
More information about the Rpm-ecosystem
mailing list