[Rpm-ecosystem] Proposed zchunk file format - V3

Jonathan Dieter jdieter at gmail.com
Mon Mar 12 19:21:50 UTC 2018


On Mon, 2018-03-12 at 15:42 +0100, Michal Domonkos wrote:
> Hi Jonathan,
> 
> To me, the zchunk idea looks good.
> 
> Incidentally, for the last couple of months, I have been trying to
> rethink the way we cache metadata on the clients, as part of the
> libdnf (re)design efforts. My goal was to de-duplicate the data
> between similar repos in the cache as well as decrease the size that
> needs to be downloaded every time (inevitably leading to this topic).
> 
> I came up with two different strategies:
> 
> 1) Chunking
<snip>
> That made me think that either using git (libgit2) directly or doing a
> small, lightweight implementation of the core concepts might be the
> way to go. I even played with the latter a bit (I didn't get to
> breaking down primary.xml, though):
> https://github.com/dmnks/rhs-proto
> 
> In the context of this thread, this is basically what you do with
> zchunk (just much better) :)

Yes, I guess. ;)  The git concept sounds interesting, but it will be a
lot of work and will require some huge changes in how we deal with
metadata.  For the moment, I think I'll focus on more evolutionary
changes.

> 2) Deltas
> 
> Later, during this year's devconf, I had a few "brainstorming"
> sessions with Florian Festi who pointed out that the differences in
> metadata updates might often be on the sub-package level (e.g. NEVRA
> in the version tag) so chunking on the package boundaries might not
> give us the best results possible. Instead perhaps, we could generate
> deltas on the binary level.

My current tests were chunked on srpm boundaries, not package
boundaries (not sure if we're using the same terminology here).  The
problem with chunking on the package boundary in the xml is that it
creates many more chunks than grouping by srpm, and the smaller chunks
hurt compression (even with a dictionary) and the larger number
increase the size of the index.  In Fedora, I think it's impossible to
only change the metadata for one sub-package belonging to an srpm, and,
even if it *is* possible, it's very rare.

> An alternative would be to pre-generate (compressed) binary deltas for
> the last N versions and let clients download an index file that will
> tell them what deltas they're missing and should download. This is
> basically what debian's pdiff format does. One downside to this
> approach is that it doesn't give us the de-duplication on clients
> consuming multiple repos with similar content (probably quite common
> with RHEL subscriptions at least).

I'm not a huge fan of pre-generated binary deltas.  They might give
smaller deltas than a chunking solution, but at the cost of generating
and maintaining the deltas.  It's been enough of a pain for rpms that
(at least on an individual basis) change infrequently.  For metadata
that changes every day, I think it would be too much.

The beauty of the zchunk format (or zsync, or any other chunked format)
is that we don't have to download different files based on what we
have, but rather, we download either fewer or more parts of the same
file based on what we have.  From the server side, we don't have to
worry about the deltas, and the clients just get what they need.

Having said that, the current zchunk proposal doesn't really address
deduplication from multiple repos with similar content.  I suppose some
way of extending zchunk to allow you to specify multiple local sources
would be a way fixing that, but I'm still working on getting it to
build from *one* local source (getting very close, but not quite
there).

> Then I stumbled upon casync which combines the benefits of both
> strategies; it chunks based on the shape of the data (arguably giving
> better results than chunking on the package boundaries), and it
> doesn't require a smart protocol. However, it involves a lot of HTTP
> requests as you already mentioned.

Just to be clear, zchunk is casync with smart boundaries (or, better
put, application-specified boundaries) and a single file backend so
requests can be sent as http range requests rather than hundreds of
individual http requests.
 
> Despite that, I'm still leaning towards chunking as being the better
> solution of the two. The question is, how much granularity we want.
> You made a good point: the repodata format is fixed (be it xml or
> solv), so we might as well take advantage of it to detect boundaries
> for chunking, rather than using a rolling hash (but I have no data to
> back it up). I'm not sure how to approach the many-GET-requests (or
> the lack of range support) problem, though.

If you've been following the conversation between myself and Michael
Schroeder on rpm-ecosystem (starting with http://lists.rpm.org/pipermai
l/rpm-ecosystem/2018-March/000553.html), I did some comparisons between
zsync (which uses a rolling hash) and zchunk.  There's also the fact
that zsync uses gzip compression, while zchunk uses zstd by default,
but quite a bit of the difference is due to the larger window size a
rolling hash provides.

With zchunk, if you get a server with no range support, you just
download the full zchunk file.  I did run a check on the Fedora
mirrors, and I think there were two or three that didn't support range
requests at all.

> As part of my efforts, I created this "small" git repo that contains
> metadata snapshots since ~February which can be useful to see how
> typical metadata updates look like. Feel free to use it (e.g. for
> testing out zchunk):
> https://pagure.io/mdhist

This looks great!  I've got some random days in January and February,
so this looks far more consistent.

Thanks much for your insight,
Jonathan


More information about the Rpm-ecosystem mailing list