[Rpm-ecosystem] A proof-of-concept for delta'ing repodata

Vít Ondruch vondruch at redhat.com
Fri Mar 2 12:18:54 UTC 2018


Have you experimented with casync [1]?


[1] https://github.com/systemd/casync

Dne 13.2.2018 v 10:52 Igor Gnatenko napsal(a):
> CCing rpm-ecosystem@ ML since it's main location where this message
> should have
> went 😉
> On Mon, 2018-02-12 at 23:53 +0200, Jonathan Dieter wrote:
> > <tl;dr>
> > I've come up with a method of splitting repodata into chunks that can
> > be downloaded and combined with chunks that are already on the local
> > system to create a byte-for-byte copy of the compressed repodata.
> > Tools and scripts are at:
> > https://www.jdieter.net/downloads/
> > </tl;dr>
> > Background:
> > With DNF, we're currently downloading ~20MB of repository data every
> > time the updates repository changes.
> > When casync was released, I wondered if we could use it to only
> > download the deltas for the repodata.  At Flock last summer, I ran some
> > tests against the uncompressed repodata and saw a reduction of 30-40%
> > from one day to the next, which seemed low, but was a good starting
> > point.
> > Unfortunately, due to the way casync separates each file into thousands
> > of compressed chunks, building each file required thousands of (serial)
> > downloads which, even on a decent internet connection, took *forever*.
> > When I talked through the idea with Kevin and Patrick, they also
> > pointed out that our mirrors might not be too keen on the idea of
> > adding thousands of tiny files that change every day.
> > The Solution(?):
> > One potential solution to the "multitude of files" problem is to merge
> > the chunks back into a single file, and use HTTP ranges to only
> > download the parts of the file we want.  An added bonus is that most
> > web servers are configured to support hundreds of ranges in one
> > request, which greatly reduces the number of requests we have to make.
> > The other problem with casync is that it's chunk separation is naïve,
> > which is why we were only achieving 30-40% savings.  But we know what
> > the XML file is supposed to look like, so we can separate the chunks on
> > the tag boundaries in the XML.
> > So I've ditched casync altogether and put together a proof-of-concept
> > (tentatively named zchunk) that takes an XML file, compresses each tag
> > separately, and then concatenates all of them into one file.  The tool
> > also creates an index file that tells you the sha256sum for each
> > compressed chunk and the location of the chunk in the file.
> > I've also written a small script that will download a zchunk off the
> > internet.  If you don't specify an old file, it will just download
> > everything, but if you specify an old file, it will download the index
> > of the new file and compare the sha256sums of each chunk.  Any
> > checksums that match will be taken from the old file, and the rest will
> > be downloaded.
> > In testing, I've seen savings ranging from 10% (December 17 to today)
> > to 95% (yesterday to today).
> > Remaining problems:
> >  * Zchunk files are bigger than their gzip equivalents.  This ranges
> >    from 5% larger for filelists.xml to 300% larger for primary.xml.
> >    This can be greatly reduced by chunking primary.xml based on srpm
> >    rather than rpm, which brings the size increase for primary.xml down
> >    to roughly 30%.
> >  * Many changes to the metadata can mean a large number of ranges
> >    requested.  I ran a check on our mirrors, and three (out of around
> >    150 that had the file I was testing) don't honor range requests at
> >    all, and three others only honor a small number in a single request.
> >     A further seven didn't respond at all (not sure if that had
> >    anything to do with the range requests), and the rest supported
> >    between 256 and 512 ranges in a single request.  We can reduce the
> >    number of ranges requested by always ordering our packages by date.
> >    This would ensure that new packages are grouped at the end of the
> >    xml where they will be grabbed in one contiguous range.
> This would "break" DNF, because libsolv is assigning Id's by the order of
> packages in metadata. So if something requires "webserver" and there
> is "nginx"
> and "httpd" providing it (without versions), then lowest Id is picked
> up (not
> going into details of this). Which means depending on when last update
> for one
> or other was submitted, users will get different results. This is
> unacceptable
> from my POV.
> >  * Zchunk files use zlib (it gives better compression than xz with such
> >    small chunks), but, because they use a custom zdict, they are not gz
> >    files.  This means that we'll need new tools to read and write them.
> >    (And I am volunteering to do the work here)
> What about zstd? Also in latest version of lz4 there is support for
> dictionaries too.
> > The tools:
> > The proof-of-concept tools are all sitting in
> > https://www.jdieter.net/downloads/zchunk-scripts/
> > They are full of ugly hacks, especially when it comes to parsing the
> > XML, there's little to no error reporting, and I didn't comment them
> > well at all, but they should work.
> > If all you want to do is download zchunks, you need to run dl_zchunk.py
> > with the url you want to download (ending in .zck) as the first
> > parameter.  Repodata for various days over the last few weeks is at:
> > https://www.jdieter.net/downloads/zchunk-test/  You may need to hover
> > over the links to see which is which.  The downloads directory is also
> > available over rsync at rsync://jdieter.net/downloads/zchunk-test.
> > dl_zchunk.py doesn't show anything if you download the full file, but
> > if you run the command with an old file as the second parameter, it
> > will show four numbers: bytes taken from the old file, bytes downloaded
> > from the new, total downloaded bytes and total uploaded bytes.
> > zchunk.py creates a .zck file.  To group chunks by source rpm in
> > primary.xml, run
> > ./zchunk.py <file> rpm:sourcerpm
> > unzchunk.py decompresses a .zck file to stdout
> > I realize that there's a lot to digest here, and it's late, so I know I
> > missed something.  Please let me know if you have any suggestions,
> > criticisms or flames, though it might be a few hours before I respond.
> As being someone who tried to work on this problem I very appreciate
> what you
> have done here. We've started with using zsync and results were quite
> good, but
> zsync is dead and has ton of bugs. Also it requires archives to be `
> --rsyncable`. So my question is why not to add idx file as additional
> one for
> existing files instead of inventing new format? The problem is that we
> will
> have to distribute in old format too (for compatibility reasons).
> I'm not sure if trying to do optimizations by XML tags is very good idea
> especially because I hope that in future we would stop distributing
> XML's and
> start distributing solv/solvx.
> > _______________________________________________ > Rpm-ecosystem
mailing list > Rpm-ecosystem at lists.rpm.org >

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rpm.org/pipermail/rpm-ecosystem/attachments/20180302/89bfdd85/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.rpm.org/pipermail/rpm-ecosystem/attachments/20180302/89bfdd85/attachment.asc>

More information about the Rpm-ecosystem mailing list