[Rpm-ecosystem] Proposed zchunk file format

Sat Feb 17 11:17:37 UTC 2018

On Fri, Feb 16, 2018 at 1:52 PM, Jonathan Dieter <jdieter at gmail.com> wrote:
> So here's my proposed file format for the zchunk file.  Should I add
> some flags to facilitate possible different compression formats?
>

I think it'd be smart to make sure that if other compression formats
were needed, it would be easy to implement. So flags for facilitating
that would be a good idea.

> +-+-+-+-+-+-+-+-+-+-+-+-+==================+=================+
> |  ID   |  Index size   | Compressed Index | Compressed Dict |
> +-+-+-+-+-+-+-+-+-+-+-+-+==================+=================+
>
> +===========+===========+
> |   Chunk   |   Chunk   | ==> More chunks
> +===========+===========+
>
> ID
>  '\0ZCK', identifies file as zchunk file
>
> Index size
>  This is a 64-bit unsigned integer containing the size of compressed
>  index.
>
> Compressed Index
>  This is the index, which is described in the next section.  The index
>  is compressed using standard zstd compression without a custom
>  dictionary.
>
> Compressed Dict
>  This is a custom dictionary used when compressing each chunk.
>  Because each chunk is compressed completely separately from the
>  others, the custom dictionary gives us much better overall
>  compression.  The custom dictionary is compressed using standard zstd
>  compression without using a separate custom dictionary (for obvious
>  reasons).
>
> Chunk
>  This is a chunk of data, compressed using zstd with the custom
>  dictionary provided above.
>
>
> The index:
>
> +++++++++++++++++++++++++++++++-+-+-+-+-+-+-+-+
> |          sha256sum
>      |  End of dict  |
> +++++++++++++++++++++++++++++++-+-+-+-+-+-+-+-+
>
> +++++++++++++++++++++++++++++++-+-+-+-+-+-+-+-+
> |          sha256sum          | End of chunk  |  ==> More
> +++++++++++++++++++++++++++++++-+-+-+-+-+-+-+-+
>
> sha256sum of compressed dict
>  This is a binary sha256sum of the compressed chunk, used to detect
>  whether two dicts are identical.
>
> End of dict
>  This is the location of the end of the dict with 0 being the end of
>
> the index.  This gives us the information we need to find and
>  decompress the dict.
>
> sha256sum of compressed chunk
>  This is a binary sha256sum of the compressed chunk, used to detect
>
> whether any two chunks are identical.
>

I suggest you add something to indicate what kind of checksum it is,
because when it has to be changed for whatever reason, we need a way
to make the format obvious for checksums.

> End of chunk
>  This is the location of the end of the chunk with 0 being the end of
>  the index.  This gives us the information we need to find and
>  decompress each chunk.
>
>
> The index is designed to be able to be extracted from the file on the
> server and downloaded separately, to facilitate downloading only the
> parts of the file that are needed, but must then be re-embedded when
> assembling the file so the user only needs to keep one file.

Overall, it looks pretty good to me.

-- 
真実はいつも一つ！/ Always, there's only one truth!