[Rpm-ecosystem] Proposed zchunk file format - V2

Mon Feb 19 08:06:50 UTC 2018

Neal, thanks for the feedback.  After taking your comments into
consideration, here's version 2.  

+-+-+-+-+-+------------------+-+-+-+-+-+-+-+-+
|    ID   | Compression type |  Index size   |
+-+-+-+-+-+------------------+-+-+-+-+-+-+-+-+

+==================+=================+
| Compressed Index | Compressed Dict |
+==================+=================+

+===========+===========+
|   Chunk   |   Chunk   | ==> More chunks
+===========+===========+

ID
 '\0ZCK1', identifies file as zchunk version 1 file

Compression type
 Type of compression used to compress dict and chunks

 Current values:
   0 - Uncompressed
   2 - zstd

Index size
 This is a 64-bit unsigned integer containing the size of compressed 
 index.

Compressed Index
 This is the index, which is described in the next section.  The index 
 is compressed without a custom dictionary.

Compressed Dict (optional)
 This is a custom dictionary used when compressing each chunk.
 Because each chunk is compressed completely separately from the
 others, the custom dictionary gives us much better overall
 compression.  The custom dictionary is compressed without a custom
 dictionary (for obvious reasons).

Chunk
 This is a chunk of data, compressed with the custom dictionary
 provided above.

The index:

+---------------+======================+
| Checksum type | Checksum of all data |
+---------------+======================+

+================+-+-+-+-+-+-+-+-+
| Dict checksum  |  End of dict  |
+================+-+-+-+-+-+-+-+-+

+================+-+-+-+-+-+-+-+-+
| Chunk checksum | End of chunk  |  ==> More
+================+-+-+-+-+-+-+-+-+

Checksum type
 This is the type of checksum used to generate the checksums in the 
 index.

 Current values:
   0 = SHA-256

Checksum of all data
 This is the checksum of the compressed dict and all the compressed 
 chunks, used to verify that the file is actually the same, even in 
 the unlikely event of a hash collision for one of the chunks

Dict checksum
 This is the checksum of the compressed dict, used to detect whether 
 two dicts are identical.  If there is no dict, the checksum must be
 all zeros.

End of dict
 This is the location of the end of the dict starting from the end of 
 the index.  This gives us the information we need to find and 
 decompress the dict.  If there is no dict, the checksum must be all
 zeros.

Chunk checksum
 This is the checksum of the compressed chunk, used to detect whether 
 any two chunks are identical.

End of chunk
 This is the location of the end of the chunk starting from the end of 
 the index.  This gives us the information we need to find and 
 decompress each chunk.

The index is designed to be able to be extracted from the file on the
server and downloaded separately, to facilitate downloading only the
parts of the file that are needed, but must then be re-embedded when
assembling the file so the user only needs to keep one file.