[Rpm-ecosystem] A proof-of-concept for delta'ing repodata
Vít Ondruch
vondruch at redhat.com
Fri Mar 2 12:18:54 UTC 2018
Jonathan,
Have you experimented with casync [1]?
Vít
[1] https://github.com/systemd/casync
Dne 13.2.2018 v 10:52 Igor Gnatenko napsal(a):
> CCing rpm-ecosystem@ ML since it's main location where this message
> should have
> went 😉
>
> On Mon, 2018-02-12 at 23:53 +0200, Jonathan Dieter wrote:
> > <tl;dr>
> > I've come up with a method of splitting repodata into chunks that can
> > be downloaded and combined with chunks that are already on the local
> > system to create a byte-for-byte copy of the compressed repodata.
> > Tools and scripts are at:
> > https://www.jdieter.net/downloads/
> > </tl;dr>
>
> > Background:
> > With DNF, we're currently downloading ~20MB of repository data every
> > time the updates repository changes.
>
> > When casync was released, I wondered if we could use it to only
> > download the deltas for the repodata. At Flock last summer, I ran some
> > tests against the uncompressed repodata and saw a reduction of 30-40%
> > from one day to the next, which seemed low, but was a good starting
> > point.
>
> > Unfortunately, due to the way casync separates each file into thousands
> > of compressed chunks, building each file required thousands of (serial)
> > downloads which, even on a decent internet connection, took *forever*.
>
> > When I talked through the idea with Kevin and Patrick, they also
> > pointed out that our mirrors might not be too keen on the idea of
> > adding thousands of tiny files that change every day.
>
>
> > The Solution(?):
> > One potential solution to the "multitude of files" problem is to merge
> > the chunks back into a single file, and use HTTP ranges to only
> > download the parts of the file we want. An added bonus is that most
> > web servers are configured to support hundreds of ranges in one
> > request, which greatly reduces the number of requests we have to make.
>
> > The other problem with casync is that it's chunk separation is naïve,
> > which is why we were only achieving 30-40% savings. But we know what
> > the XML file is supposed to look like, so we can separate the chunks on
> > the tag boundaries in the XML.
>
> > So I've ditched casync altogether and put together a proof-of-concept
> > (tentatively named zchunk) that takes an XML file, compresses each tag
> > separately, and then concatenates all of them into one file. The tool
> > also creates an index file that tells you the sha256sum for each
> > compressed chunk and the location of the chunk in the file.
>
> > I've also written a small script that will download a zchunk off the
> > internet. If you don't specify an old file, it will just download
> > everything, but if you specify an old file, it will download the index
> > of the new file and compare the sha256sums of each chunk. Any
> > checksums that match will be taken from the old file, and the rest will
> > be downloaded.
>
> > In testing, I've seen savings ranging from 10% (December 17 to today)
> > to 95% (yesterday to today).
>
>
> > Remaining problems:
> > * Zchunk files are bigger than their gzip equivalents. This ranges
> > from 5% larger for filelists.xml to 300% larger for primary.xml.
> > This can be greatly reduced by chunking primary.xml based on srpm
> > rather than rpm, which brings the size increase for primary.xml down
> > to roughly 30%.
>
> > * Many changes to the metadata can mean a large number of ranges
> > requested. I ran a check on our mirrors, and three (out of around
> > 150 that had the file I was testing) don't honor range requests at
> > all, and three others only honor a small number in a single request.
> > A further seven didn't respond at all (not sure if that had
> > anything to do with the range requests), and the rest supported
> > between 256 and 512 ranges in a single request. We can reduce the
> > number of ranges requested by always ordering our packages by date.
> > This would ensure that new packages are grouped at the end of the
> > xml where they will be grabbed in one contiguous range.
>
> This would "break" DNF, because libsolv is assigning Id's by the order of
> packages in metadata. So if something requires "webserver" and there
> is "nginx"
> and "httpd" providing it (without versions), then lowest Id is picked
> up (not
> going into details of this). Which means depending on when last update
> for one
> or other was submitted, users will get different results. This is
> unacceptable
> from my POV.
>
> > * Zchunk files use zlib (it gives better compression than xz with such
> > small chunks), but, because they use a custom zdict, they are not gz
> > files. This means that we'll need new tools to read and write them.
> > (And I am volunteering to do the work here)
>
> What about zstd? Also in latest version of lz4 there is support for
> dictionaries too.
>
> > The tools:
> > The proof-of-concept tools are all sitting in
> > https://www.jdieter.net/downloads/zchunk-scripts/
>
> > They are full of ugly hacks, especially when it comes to parsing the
> > XML, there's little to no error reporting, and I didn't comment them
> > well at all, but they should work.
>
> > If all you want to do is download zchunks, you need to run dl_zchunk.py
> > with the url you want to download (ending in .zck) as the first
> > parameter. Repodata for various days over the last few weeks is at:
> > https://www.jdieter.net/downloads/zchunk-test/ You may need to hover
> > over the links to see which is which. The downloads directory is also
> > available over rsync at rsync://jdieter.net/downloads/zchunk-test.
>
> > dl_zchunk.py doesn't show anything if you download the full file, but
> > if you run the command with an old file as the second parameter, it
> > will show four numbers: bytes taken from the old file, bytes downloaded
> > from the new, total downloaded bytes and total uploaded bytes.
>
> > zchunk.py creates a .zck file. To group chunks by source rpm in
> > primary.xml, run
> > ./zchunk.py <file> rpm:sourcerpm
>
> > unzchunk.py decompresses a .zck file to stdout
>
> > I realize that there's a lot to digest here, and it's late, so I know I
> > missed something. Please let me know if you have any suggestions,
> > criticisms or flames, though it might be a few hours before I respond.
>
> As being someone who tried to work on this problem I very appreciate
> what you
> have done here. We've started with using zsync and results were quite
> good, but
> zsync is dead and has ton of bugs. Also it requires archives to be `
> --rsyncable`. So my question is why not to add idx file as additional
> one for
> existing files instead of inventing new format? The problem is that we
> will
> have to distribute in old format too (for compatibility reasons).
>
> I'm not sure if trying to do optimizations by XML tags is very good idea
> especially because I hope that in future we would stop distributing
> XML's and
> start distributing solv/solvx.
> > _______________________________________________ > Rpm-ecosystem
mailing list > Rpm-ecosystem at lists.rpm.org >
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rpm.org/pipermail/rpm-ecosystem/attachments/20180302/89bfdd85/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.rpm.org/pipermail/rpm-ecosystem/attachments/20180302/89bfdd85/attachment.asc>
More information about the Rpm-ecosystem
mailing list