[Rpm-ecosystem] Some points about zchunk

Sun Jul 8 18:45:36 UTC 2018

On Fri, 2018-07-06 at 11:48 +0000, Michael Schroeder wrote:
> On Thu, Jul 05, 2018 at 08:07:58PM +0300, Jonathan Dieter wrote:
> > My plan was to just keep the same dictionaries (a different one for
> > each metadata file) for at least a whole release, if not more.  My
> > dictionary generation script
> > (https://www.jdieter.net/downloads/zchunk-dicts/split.py)
> > removes checksums before running zstd -D, so the dictionary should
> > remain effective for a minimum of one release.
> 
> Yes, I guessed that you would say something like this. And it's
> also a reasonable thing to do.
> 
> It's just a shame that we can't generate the dictionary when creating
> the repository metadata, it would be such a nice feature to have.

Yep.

> > At the point where the dictionary changes, everybody just downloads the
> > full metadata again with the new dictionary and gets good deltas from
> > then on.
> > 
> > I'm planning to package up the optimal Fedora dictionaries, make them
> > Recommended: in createrepo_c, and only change them in Rawhide once,
> > somewhere around branching.
> > 
> > By using the same dictionaries, we are able to validate the checksums
> > before decompression, which keeps zchunk from decompressing unverified
> > data, a possible attack vector.
> 
> That depends. Maybe I'll implement dictionary transcoding for zchunk
> just in case the zstd algorithms don't change. Even if that's pretty
> unlikely.

Sure, that would be great!

> > >  2) What to put into repomd.xml? We'll need to old primary.xml.gz for
> > >     compatibility reasons. It's a good security practice to minimize the 
> > >     attack vector, so we should put the zchunk header checksum into
> > >     the repodata.xml so that it can be verified before running the zchunk
> > >     code. So primary.xml.zck with extra attributes for the header?
> > My proposal is here:
> > https://www.jdieter.net/downloads/zchunk/repomd.dtd
> > 
> > In summary, I'm just adding extra zchunk attributes to the main file
> > element:
> > zck-location
> > header-checksum
> > header-size
> > zck-timestamp
> > 
> > librepo first downloads header-size of the file and then verifies that
> > the header checksum matches and is valid.
> 
> Please use zck-header-checksum and zck-header-size instead.

Ok, will do.

> > librepo then grabs any common chunks from already downloaded metadata,
> > downloads the remaining chunks, and verifies the body checksum that's
> > embedded in the header.
> 
> And libzypp will never use librepo, so I have to implement all this
> myself ;)
> 
> The good thing is that libzypp already supports range downloads from
> multiple mirrors in parallel, because we already support delta
> metadata downloads. So I just need an libzchunk api that "fills"
> the target file with the data from the old metadata and returns a
> list of ranges that need to be downloaded.

There's zck_copy_chunks(zckCtx *src, zckCtx *tgt) that copies any
needed chunks from src to tgt.  You can run it multiple times with
multiple sources, if you so choose.

There's zck_get_missing_range(zckCtx *zck, int max_ranges) which will
return a zckRange* of the needed chunks.

Finally, there's zck_get_range_char(zckRange *range) which returns a
char* of the range returned from zck_get_missing_range().

The download process itself is a bit complex, because we don't actually
do the download in libzck.  There's an example of how to download at:
https://github.com/zchunk/zchunk/blob/master/src/zck_dl.c

<snip>
> > >  4) Nitpick: Why does zchunk use sha1 checksums for the chunks? Either
> > >     it's something that needs to be cryptographic sound, then sha1 is the
> > >     wrong choice. Or it's just meant for identifying chunks, then
> > >     md5 is probably faster/smaller. Or some other checksum. But you
> > >     really don't need 20 bytes like with sha1.
> > 
> > It doesn't need to be cryptographically sound because we have a body
> > checksum that is sha256.  I'll look at adding MD5 support and
> > defaulting to it for the chunk checksum type.
> 
> Ah, no, I think you misunderstood. Do *not* add md5 support. In fact,
> I'd ask you to remove sha1 support as well to make your code smaller.
> 
> My point is that you shouldn't use 20 bytes just for chunk identification
> purposes. As you said, it doesn't need to be cryptographically sound, we
> don't have to make sure it withstands an attacker.
> Just use the first 8 bytes of the sha256 sum instead (or sha512, as
> it's a bit faster than sha256 IIRC).

Ok, that makes sense.  I'll add it as a new hash type (SHA512_64?) and
make it the default for the chunk checksum.

Jonathan