[Rpm-ecosystem] Some points about zchunk

Jonathan Dieter jdieter at gmail.com
Thu Jul 5 17:07:58 UTC 2018


Michael, thank you so much for your detailed review!  I really
appreciate the time you took to look at this in such detail!

I'm currently waiting to board a flight, so I'll make this brief and
I'll probably be unavailable until Monday.

Comments inline

On Thu, 2018-07-05 at 14:18 +0000, Michael Schroeder wrote:
> Hi,
> 
> here are some of my thoughts about Jonathan's zchunk compression:
<snip>
> 
> Thoughts:
> ---------
> The basic algorithms and implementation are sound and work nicely.
> Kudos to Jonathan for doing such an amazing job.

Coming from you, this means a lot to me!  Thank you so much!

> Here's some points I have: (Please correct me if I'm wrong anywhere)
> 
>  1) The current implementation can't reuse chunks when the dictionary
>     changes. That's a rather big limitation. A dictionary is a must
>     if we want to go with small chunks.
> 
>     We can also go with no dictionary and large chunks, this is somewhat
>     the zchunk default. For the example above the buzhash algorithm would
>     split the file into 193 chunks instead of the "package level" 1844
>     chunks. Large chunks mean good compression, but the amount of data
>     that can get reused will probably be much less. In that case we (SUSE)
>     might as well stay with zsync and gzip -9 --rsyncable ;)
>     
>     From an algorithmic point having different dictionaries is not a
>     problem: you'd just need to store the checksum over the uncompressed
>     chunks instead. But there's a big drawback: you can't reconstruct the
>     identical file. That's because you need to re-compress the chunks
>     you reuse with the new dictionary, and this may lead to different
>     data if the zstd algorithm is different than the one used when
>     creating the repository
> 
>     We have the same problem with deltarpms, the recompression is the weak
>     step. Repository creation is usually done on system that runs a different
>     distribution version than the target, which makes this even more likely.
> 
>     So we can reconstruct a zchunk file that gets the same data when
>     uncompressed, but it might not be the identical zchunk file. But this
>     may not be a problem at all, we just need to be sure that the
>     verification step works.

My plan was to just keep the same dictionaries (a different one for
each metadata file) for at least a whole release, if not more.  My
dictionary generation script
(https://www.jdieter.net/downloads/zchunk-dicts/split.py)
removes checksums before running zstd -D, so the dictionary should
remain effective for a minimum of one release.

At the point where the dictionary changes, everybody just downloads the
full metadata again with the new dictionary and gets good deltas from
then on.

I'm planning to package up the optimal Fedora dictionaries, make them
Recommended: in createrepo_c, and only change them in Rawhide once,
somewhere around branching.

By using the same dictionaries, we are able to validate the checksums
before decompression, which keeps zchunk from decompressing unverified
data, a possible attack vector.

>  2) What to put into repomd.xml? We'll need to old primary.xml.gz for
>     compatibility reasons. It's a good security practice to minimize the
>     attack vector, so we should put the zchunk header checksum into
>     the repodata.xml so that it can be verified before running the zchunk
>     code. So primary.xml.zck with extra attributes for the header? Or an
>     extra element that describes the zchunk header?

My proposal is here:
https://www.jdieter.net/downloads/zchunk/repomd.dtd

In summary, I'm just adding extra zchunk attributes to the main file
element:
zck-location
header-checksum
header-size
zck-timestamp

librepo first downloads header-size of the file and then verifies that
the header checksum matches and is valid.

librepo then grabs any common chunks from already downloaded metadata,
downloads the remaining chunks, and verifies the body checksum that's
embedded in the header.

>  3) I don't think signature support in zchunk is useful ;)

Fair enough.  ;)  It doesn't actually work yet, and I suspect that
you're right in the librepo context, but I think it could be useful in
other contexts.

>  4) Nitpick: Why does zchunk use sha1 checksums for the chunks? Either
>     it's something that needs to be cryptographic sound, then sha1 is the
>     wrong choice. Or it's just meant for identifying chunks, then
>     md5 is probably faster/smaller. Or some other checksum. But you
>     really don't need 20 bytes like with sha1.

It doesn't need to be cryptographically sound because we have a body
checksum that is sha256.  I'll look at adding MD5 support and
defaulting to it for the chunk checksum type.

> Ok, that's enough for now.

Thanks again for looking at this!

Jonathan


More information about the Rpm-ecosystem mailing list