[Rpm-ecosystem] Proposal: Zchunked rpms to reduce compose time and eliminate need for deltarpms

Jonathan Dieter jdieter at gmail.com
Sun Nov 18 15:53:33 UTC 2018


Michael, thanks so much for your feedback!

On Sun, 2018-11-18 at 14:09 +0000, Michael Schroeder wrote:
> On Sat, Nov 17, 2018 at 06:10:56PM +0000, Jonathan Dieter wrote:
> > In Fedora, there was a call for ideas on, among other things, reducing
> > the compose time.  Currently, a good chunk of Fedora's compose time is
> > spent generating deltarpms,
> 
> Why's that? Delta generation is actually pretty fast *if* you
> limit the memory usage with the '-m' option.

I'm actually not sure of the answer for that.  Out of curiosity, what's
the limit you use at SUSE and how did you decide to use that limit?

> > and I've been thinking about a way to use
> > zchunk as rpm's compression payload, which would make deltarpms
> > redundant.  Neal suggested I bring it up here, so here's my proposal:
> > 
> > <tl;dr>
> > Have rpm use zchunk as its compression format, removing the need for
> > deltarpms, and thus reducing compose time.  This will require changes
> > to both the rpm format and new features in the zchunk format.
> > </tl;dr>
> 
> You'll need to provide some real life numbers to make a convincing
> argument for this. A delta algorithm is way different to chunking:
> 
> - the minimal chunk size is *really* small (16 bytes iirc)
> - there is also an additional "offset" stream that improves the delta
>   sizes significantly
> 
> So zchunk will be much worse than a deltarpm. Deltarpms also work
> with files from the filesystem, so you don't have to keep the
> installed rpms around for the update.

You're absolutely right on the first part (for what it's worth, I wrote
about how the offset stream works nine years ago at 
https://www.jdieter.net/posts/2009/11/06/on-binary-delta-algorithms).

As for working off files from the filesystem, my plan is to achieve the
same with zchunked rpms.  The key difference is that a zchunked rpm
will have checksums of both the compressed and uncompressed chunk data,
and rebuilding the rpm from locally installed data will include that
data uncompressed, so the rebuilt rpm won't be byte-for-byte identical
to the original (though it will be verifiable without decompressing
anything).

As I mentioned in the drawbacks section of my proposal, deltarpms are
definitely more efficient than zchunked rpms.  The two advantages that
zchunked rpms give you are (1) they're more general since you don't
need a specific old version of an rpm installed, and (2) they're
generated at build time, so the step of generating deltarpms is
eliminated.

The real questions here are (1) *how much* less efficient will zchunk
be than deltarpm, and (2) are the advantages listed above worth the
loss of efficiency.  Unfortunately (2) can't really be determined until
I have real-world numbers for (1).

But before I come up with real-world numbers, I'd like to make sure
that the concept is at least workable.  I can think of two major
reasons it may not be.

   1. Rebuilt zchunked rpms may not be byte-for-byte identical to the
      original.  Because of how zchunk signing and verification works, the
      rpm will still be able to be verified *without* decompressing
      anything, but this may be a show-stopper anyway.
   2. zchunked rpms will require some major changes to the RPM file
      format, and the format will *not* be able to be read by older
      versions of RPM.  This may be a show-stopper.

There may also be other reasons I'm missing.

I am very aware that we're nowhere near ready to determine whether this
proposal is the right way forward, but, hopefully after this
conversation, we'll at least know whether it's worth investigating.

Jonathan



More information about the Rpm-ecosystem mailing list