[Rpm-ecosystem] Proposed zchunk file format
Neal Gompa
ngompa13 at gmail.com
Sat Feb 17 11:17:37 UTC 2018
On Fri, Feb 16, 2018 at 1:52 PM, Jonathan Dieter <jdieter at gmail.com> wrote:
> So here's my proposed file format for the zchunk file. Should I add
> some flags to facilitate possible different compression formats?
>
I think it'd be smart to make sure that if other compression formats
were needed, it would be easy to implement. So flags for facilitating
that would be a good idea.
> +-+-+-+-+-+-+-+-+-+-+-+-+==================+=================+
> | ID | Index size | Compressed Index | Compressed Dict |
> +-+-+-+-+-+-+-+-+-+-+-+-+==================+=================+
>
> +===========+===========+
> | Chunk | Chunk | ==> More chunks
> +===========+===========+
>
> ID
> '\0ZCK', identifies file as zchunk file
>
> Index size
> This is a 64-bit unsigned integer containing the size of compressed
> index.
>
> Compressed Index
> This is the index, which is described in the next section. The index
> is compressed using standard zstd compression without a custom
> dictionary.
>
> Compressed Dict
> This is a custom dictionary used when compressing each chunk.
> Because each chunk is compressed completely separately from the
> others, the custom dictionary gives us much better overall
> compression. The custom dictionary is compressed using standard zstd
> compression without using a separate custom dictionary (for obvious
> reasons).
>
> Chunk
> This is a chunk of data, compressed using zstd with the custom
> dictionary provided above.
>
>
> The index:
>
> +++++++++++++++++++++++++++++++-+-+-+-+-+-+-+-+
> | sha256sum
> | End of dict |
> +++++++++++++++++++++++++++++++-+-+-+-+-+-+-+-+
>
> +++++++++++++++++++++++++++++++-+-+-+-+-+-+-+-+
> | sha256sum | End of chunk | ==> More
> +++++++++++++++++++++++++++++++-+-+-+-+-+-+-+-+
>
> sha256sum of compressed dict
> This is a binary sha256sum of the compressed chunk, used to detect
> whether two dicts are identical.
>
> End of dict
> This is the location of the end of the dict with 0 being the end of
>
> the index. This gives us the information we need to find and
> decompress the dict.
>
> sha256sum of compressed chunk
> This is a binary sha256sum of the compressed chunk, used to detect
>
> whether any two chunks are identical.
>
I suggest you add something to indicate what kind of checksum it is,
because when it has to be changed for whatever reason, we need a way
to make the format obvious for checksums.
> End of chunk
> This is the location of the end of the chunk with 0 being the end of
> the index. This gives us the information we need to find and
> decompress each chunk.
>
>
> The index is designed to be able to be extracted from the file on the
> server and downloaded separately, to facilitate downloading only the
> parts of the file that are needed, but must then be re-embedded when
> assembling the file so the user only needs to keep one file.
Overall, it looks pretty good to me.
--
真実はいつも一つ!/ Always, there's only one truth!
More information about the Rpm-ecosystem
mailing list