[Rpm-ecosystem] A proof-of-concept for delta'ing repodata

Tue Feb 13 09:52:14 UTC 2018

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

CCing rpm-ecosystem@ ML since it's main location where this message should have
went 😉

On Mon, 2018-02-12 at 23:53 +0200, Jonathan Dieter wrote:
> <tl;dr>
> I've come up with a method of splitting repodata into chunks that can
> be downloaded and combined with chunks that are already on the local
> system to create a byte-for-byte copy of the compressed repodata. 
> Tools and scripts are at:
> https://www.jdieter.net/downloads/
> </tl;dr>
> 
> Background:
> With DNF, we're currently downloading ~20MB of repository data every
> time the updates repository changes.
> 
> When casync was released, I wondered if we could use it to only
> download the deltas for the repodata.  At Flock last summer, I ran some
> tests against the uncompressed repodata and saw a reduction of 30-40%
> from one day to the next, which seemed low, but was a good starting
> point.
> 
> Unfortunately, due to the way casync separates each file into thousands
> of compressed chunks, building each file required thousands of (serial)
> downloads which, even on a decent internet connection, took *forever*.
> 
> When I talked through the idea with Kevin and Patrick, they also
> pointed out that our mirrors might not be too keen on the idea of
> adding thousands of tiny files that change every day.
> 
> 
> The Solution(?):
> One potential solution to the "multitude of files" problem is to merge
> the chunks back into a single file, and use HTTP ranges to only
> download the parts of the file we want.  An added bonus is that most
> web servers are configured to support hundreds of ranges in one
> request, which greatly reduces the number of requests we have to make.
> 
> The other problem with casync is that it's chunk separation is naïve,
> which is why we were only achieving 30-40% savings.  But we know what
> the XML file is supposed to look like, so we can separate the chunks on
> the tag boundaries in the XML.
> 
> So I've ditched casync altogether and put together a proof-of-concept
> (tentatively named zchunk) that takes an XML file, compresses each tag
> separately, and then concatenates all of them into one file.  The tool
> also creates an index file that tells you the sha256sum for each
> compressed chunk and the location of the chunk in the file.
> 
> I've also written a small script that will download a zchunk off the
> internet.  If you don't specify an old file, it will just download
> everything, but if you specify an old file, it will download the index
> of the new file and compare the sha256sums of each chunk.  Any
> checksums that match will be taken from the old file, and the rest will
> be downloaded.
> 
> In testing, I've seen savings ranging from 10% (December 17 to today)
> to 95% (yesterday to today).
> 
> 
> Remaining problems:
>  * Zchunk files are bigger than their gzip equivalents.  This ranges
>    from 5% larger for filelists.xml to 300% larger for primary.xml. 
>    This can be greatly reduced by chunking primary.xml based on srpm
>    rather than rpm, which brings the size increase for primary.xml down
>    to roughly 30%.
> 
>  * Many changes to the metadata can mean a large number of ranges
>    requested.  I ran a check on our mirrors, and three (out of around
>    150 that had the file I was testing) don't honor range requests at
>    all, and three others only honor a small number in a single request.
>     A further seven didn't respond at all (not sure if that had
>    anything to do with the range requests), and the rest supported
>    between 256 and 512 ranges in a single request.  We can reduce the
>    number of ranges requested by always ordering our packages by date. 
>    This would ensure that new packages are grouped at the end of the
>    xml where they will be grabbed in one contiguous range.

This would "break" DNF, because libsolv is assigning Id's by the order of
packages in metadata. So if something requires "webserver" and there is "nginx"
and "httpd" providing it (without versions), then lowest Id is picked up (not
going into details of this). Which means depending on when last update for one
or other was submitted, users will get different results. This is unacceptable
from my POV.

>  * Zchunk files use zlib (it gives better compression than xz with such
>    small chunks), but, because they use a custom zdict, they are not gz
>    files.  This means that we'll need new tools to read and write them.
>    (And I am volunteering to do the work here)

What about zstd? Also in latest version of lz4 there is support for
dictionaries too.

> The tools:
> The proof-of-concept tools are all sitting in
> https://www.jdieter.net/downloads/zchunk-scripts/
> 
> They are full of ugly hacks, especially when it comes to parsing the
> XML, there's little to no error reporting, and I didn't comment them
> well at all, but they should work.
> 
> If all you want to do is download zchunks, you need to run dl_zchunk.py
> with the url you want to download (ending in .zck) as the first
> parameter.  Repodata for various days over the last few weeks is at:
> https://www.jdieter.net/downloads/zchunk-test/  You may need to hover
> over the links to see which is which.  The downloads directory is also
> available over rsync at rsync://jdieter.net/downloads/zchunk-test.
> 
> dl_zchunk.py doesn't show anything if you download the full file, but
> if you run the command with an old file as the second parameter, it
> will show four numbers: bytes taken from the old file, bytes downloaded
> from the new, total downloaded bytes and total uploaded bytes.
> 
> zchunk.py creates a .zck file.  To group chunks by source rpm in
> primary.xml, run
> ./zchunk.py <file> rpm:sourcerpm
> 
> unzchunk.py decompresses a .zck file to stdout
> 
> I realize that there's a lot to digest here, and it's late, so I know I
> missed something.  Please let me know if you have any suggestions,
> criticisms or flames, though it might be a few hours before I respond.

As being someone who tried to work on this problem I very appreciate what you
have done here. We've started with using zsync and results were quite good, but
zsync is dead and has ton of bugs. Also it requires archives to be `
- --rsyncable`. So my question is why not to add idx file as additional one for
existing files instead of inventing new format? The problem is that we will
have to distribute in old format too (for compatibility reasons).

I'm not sure if trying to do optimizations by XML tags is very good idea
especially because I hope that in future we would stop distributing XML's and
start distributing solv/solvx.
- -- 
- -Igor Gnatenko
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEhLFO09aHZVqO+CM6aVcUvRu8X0wFAlqCtU4ACgkQaVcUvRu8
X0xlcg//eJQy0Qs56p/y011o8r+lGTTrad3Dki4XYB6SwM2ACbEMiWF1gDPtoFCW
UtvzNzdzgM8AUuMvZFhByveuM60vbHbkzc3BM5MoumHVfE0Dn/Xih2T9RG4Lbxk5
Th7NuxkGU2oP8byHAMqskXZ+BKjWUE0VKyBi0vtnDMqXFydKFyfkdGiOzGb9QiuH
Urvd9BqFZvJ7lNisIwakuXR0geWykKztfOO7vNcnow5yoivljmeIyH5kzDStdJvS
YDfRnz9/m8X7iePwRqzElxdS0sNBULDI+Juw+jHBb/h5hg67KfYknJiJv2UshJOl
ks+WPjMwloOo7PfoNCdHj5iWC/254u5YgC1doOrlEpXXjizWYryQWl6+TDPoNqFA
kjWWvxLu+/6P/wvQdjLWXaZO3wpZrOMypzvrqU9V2C8QVky1yPKnkwKbf/jsP0L6
uVG3p3fmDl84ukpj3igRBmA2EkH/sDTe+KkEPrqZe5aYsSz22bRXLgmYqIPss38R
xsnGqdV6pVeSSDVRyxkcpXkmNLA56h9nhqekMRla6O7D+1cnBYrTeAtIWCu/xFl2
zQIUS/9cv5uiWrX+8EP5QTAfbIyp+70Y4AeItG/9NR9CJ4RBDSJPqDguJK6VbIBu
nxvNU3/Bam3I+l9LXWN4hQ8Vpei/vWzgo5twlS9fx1rqU0wgMlY=
=xHso
-----END PGP SIGNATURE-----