[Rpm-ecosystem] Proposal: Zchunked rpms to reduce compose time and eliminate need for deltarpms

Jonathan Dieter jdieter at gmail.com
Sat Nov 17 18:10:56 UTC 2018


In Fedora, there was a call for ideas on, among other things, reducing
the compose time.  Currently, a good chunk of Fedora's compose time is
spent generating deltarpms, and I've been thinking about a way to use
zchunk as rpm's compression payload, which would make deltarpms
redundant.  Neal suggested I bring it up here, so here's my proposal:

<tl;dr>
Have rpm use zchunk as its compression format, removing the need for
deltarpms, and thus reducing compose time.  This will require changes
to both the rpm format and new features in the zchunk format.
</tl;dr>

*deltarpm background*
As part of Fedora's compose process, deltarpms are generated between
each new rpm and both the GA version of the rpm and the previous
version.  This process is very CPU and memory intensive, especially for
large rpms.

This also means that deltarpms are only useful for an end user if they
are either updating from GA or have been diligent about keeping their
system up-to-date.  If a user is updating a package from N-2 to N,
there will be no deltarpm and the full rpm will be downloaded.

*zchunk background*
As most readers on this list are aware, I've been working on zchunk[2],
a compression format that's designed for highly efficient deltas, and
using it minimize metadata downloads[3].

The core idea behind zchunk is that a file is split into independently
compressed chunks and the checksum of each compressed chunk is stored
in the zchunk header.  When downloading a new version of the file, you
download the zchunk header first, check which chunks you already have,
and then download the rest.

*Proposal*
My proposal would be to make zchunk the rpm compression format.  This
would involve a few additions to the zchunk format[4] (something the
format has been designed to accommodate), and would require some
changes to the rpm file format, thus probably necessitating a major
version bump in rpm.

*Benefit*
The benefit of zchunked rpms is that, when downloading an updated rpm,
you would only need to download the chunks that have changed from
what's on your system.

The uncompressed local chunks would be combined with the downloaded
compressed chunks to create a local rpm that will pass signature
verification without needing to recompress the uncompressed local
chunks, making this computationally much faster than rebuilding a
deltarpm, a win for users.

The savings wouldn't be as good as what deltarpm can achieve, but
deltarpms would be redundant and could be removed, completely
eliminating a large step from the compose process.

*Drawbacks*
   1. Downloading a new release of a zchunked rpm would be larger than
      downloading the equivalent deltarpm.  This is offset by the fact
      that the client is able to work out which chunks it needs no matter
      what the original rpm is, rather than needing a specific deltarpm
      from the original rpm to the new one.
   2. The rebuilt rpm may not be byte-for-byte identical to the original,
      but will be able to be validated without decompression, as explained
      in the next section

*Changes*
The zchunk format would need to be extended to allow for a zchunked rpm
to contain both the uncompressed chunks that were already on the local
system and the newly downloaded compressed chunks while still passing
signature verification.  This would also require moving signature
verification to zchunk.
 
The rpm file format has to be changed because the zchunk header needs
to be at the beginning of the file in order for the zchunk library
figure out which chunks it needs to download.  My suggestions for
changes to the rpm file format are as follows:

   1. Signing should be moved to the zchunk format as described at the
      beginning of this section.
   2. The rpm header should be stored in one stream inside the zchunk
      file.  This allows it to be easily extracted separately from the
      data.
   3. The rpm cpio should be stored in a second stream inside the zchunk
      file.
   4. At minimum, an optional zchunk element should be set to identify
      zchunk rpms as rpms rather than regular zchunk files.  If desired,
      optional elements could also be set containing %{name}, %{version},
      %{release}, %{arch} and %{epoch}.  This would allow this information
      to be read easily by programs such as 'file' without needing to
      extract the rpm header stream.  These rpm attributes would also be
      stored in the rpm header as usual.

*Final notes*
I realize this is a massive proposal, zchunk is still very young, and
we're still working on getting the dnf zchunk pull requests reviewed. 
I do think it's feasible and provides an opportunity to eliminate a
pain point from our compose process while still reducing the download
size for our users.

[1]: 
https://fedoraproject.org/wiki/Objectives/Lifecycle/Problem_statements#Challenge_.231:_Faster.2C_more_scalable_composes
[2]: https://github.com/zchunk/zchunk
[3]: https://fedoraproject.org/wiki/Changes/Zchunk_Metadata
[4]: https://github.com/zchunk/zchunk/blob/master/zchunk_format.txt



More information about the Rpm-ecosystem mailing list