<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
Jonathan,<br>
<br>
Have you experimented with casync [1]?<br>
<br>
Vít<br>
<br>
<br>
[1] <a class="moz-txt-link-freetext" href="https://github.com/systemd/casync">https://github.com/systemd/casync</a><br>
<br>
<br>
<br>
Dne 13.2.2018 v 10:52 Igor Gnatenko napsal(a):<br>
<blockquote type="cite">CCing rpm-ecosystem@ ML since it's main
location where this message should have<br>
went 😉<br>
<br>
On Mon, 2018-02-12 at 23:53 +0200, Jonathan Dieter wrote:<br>
> <tl;dr><br>
> I've come up with a method of splitting repodata into chunks
that can<br>
> be downloaded and combined with chunks that are already on
the local<br>
> system to create a byte-for-byte copy of the compressed
repodata. <br>
> Tools and scripts are at:<br>
> <a class="moz-txt-link-freetext" href="https://www.jdieter.net/downloads/">https://www.jdieter.net/downloads/</a><br>
> </tl;dr><br>
<br>
> Background:<br>
> With DNF, we're currently downloading ~20MB of repository
data every<br>
> time the updates repository changes.<br>
<br>
> When casync was released, I wondered if we could use it to
only<br>
> download the deltas for the repodata. At Flock last summer,
I ran some<br>
> tests against the uncompressed repodata and saw a reduction
of 30-40%<br>
> from one day to the next, which seemed low, but was a good
starting<br>
> point.<br>
<br>
> Unfortunately, due to the way casync separates each file into
thousands<br>
> of compressed chunks, building each file required thousands
of (serial)<br>
> downloads which, even on a decent internet connection, took
*forever*.<br>
<br>
> When I talked through the idea with Kevin and Patrick, they
also<br>
> pointed out that our mirrors might not be too keen on the
idea of<br>
> adding thousands of tiny files that change every day.<br>
<br>
<br>
> The Solution(?):<br>
> One potential solution to the "multitude of files" problem is
to merge<br>
> the chunks back into a single file, and use HTTP ranges to
only<br>
> download the parts of the file we want. An added bonus is
that most<br>
> web servers are configured to support hundreds of ranges in
one<br>
> request, which greatly reduces the number of requests we have
to make.<br>
<br>
> The other problem with casync is that it's chunk separation
is naïve,<br>
> which is why we were only achieving 30-40% savings. But we
know what<br>
> the XML file is supposed to look like, so we can separate the
chunks on<br>
> the tag boundaries in the XML.<br>
<br>
> So I've ditched casync altogether and put together a
proof-of-concept<br>
> (tentatively named zchunk) that takes an XML file, compresses
each tag<br>
> separately, and then concatenates all of them into one file.
The tool<br>
> also creates an index file that tells you the sha256sum for
each<br>
> compressed chunk and the location of the chunk in the file.<br>
<br>
> I've also written a small script that will download a zchunk
off the<br>
> internet. If you don't specify an old file, it will just
download<br>
> everything, but if you specify an old file, it will download
the index<br>
> of the new file and compare the sha256sums of each chunk.
Any<br>
> checksums that match will be taken from the old file, and the
rest will<br>
> be downloaded.<br>
<br>
> In testing, I've seen savings ranging from 10% (December 17
to today)<br>
> to 95% (yesterday to today).<br>
<br>
<br>
> Remaining problems:<br>
> * Zchunk files are bigger than their gzip equivalents. This
ranges<br>
> from 5% larger for filelists.xml to 300% larger for
primary.xml. <br>
> This can be greatly reduced by chunking primary.xml based
on srpm<br>
> rather than rpm, which brings the size increase for
primary.xml down<br>
> to roughly 30%.<br>
<br>
> * Many changes to the metadata can mean a large number of
ranges<br>
> requested. I ran a check on our mirrors, and three (out
of around<br>
> 150 that had the file I was testing) don't honor range
requests at<br>
> all, and three others only honor a small number in a
single request.<br>
> A further seven didn't respond at all (not sure if that
had<br>
> anything to do with the range requests), and the rest
supported<br>
> between 256 and 512 ranges in a single request. We can
reduce the<br>
> number of ranges requested by always ordering our packages
by date. <br>
> This would ensure that new packages are grouped at the end
of the<br>
> xml where they will be grabbed in one contiguous range.<br>
<br>
This would "break" DNF, because libsolv is assigning Id's by the
order of<br>
packages in metadata. So if something requires "webserver" and
there is "nginx"<br>
and "httpd" providing it (without versions), then lowest Id is
picked up (not<br>
going into details of this). Which means depending on when last
update for one<br>
or other was submitted, users will get different results. This is
unacceptable<br>
from my POV.<br>
<br>
> * Zchunk files use zlib (it gives better compression than xz
with such<br>
> small chunks), but, because they use a custom zdict, they
are not gz<br>
> files. This means that we'll need new tools to read and
write them.<br>
> (And I am volunteering to do the work here)<br>
<br>
What about zstd? Also in latest version of lz4 there is support
for<br>
dictionaries too.<br>
<br>
> The tools:<br>
> The proof-of-concept tools are all sitting in<br>
> <a class="moz-txt-link-freetext" href="https://www.jdieter.net/downloads/zchunk-scripts/">https://www.jdieter.net/downloads/zchunk-scripts/</a><br>
<br>
> They are full of ugly hacks, especially when it comes to
parsing the<br>
> XML, there's little to no error reporting, and I didn't
comment them<br>
> well at all, but they should work.<br>
<br>
> If all you want to do is download zchunks, you need to run
dl_zchunk.py<br>
> with the url you want to download (ending in .zck) as the
first<br>
> parameter. Repodata for various days over the last few weeks
is at:<br>
> <a class="moz-txt-link-freetext" href="https://www.jdieter.net/downloads/zchunk-test/">https://www.jdieter.net/downloads/zchunk-test/</a> You may need
to hover<br>
> over the links to see which is which. The downloads
directory is also<br>
> available over rsync at
rsync://jdieter.net/downloads/zchunk-test.<br>
<br>
> dl_zchunk.py doesn't show anything if you download the full
file, but<br>
> if you run the command with an old file as the second
parameter, it<br>
> will show four numbers: bytes taken from the old file, bytes
downloaded<br>
> from the new, total downloaded bytes and total uploaded
bytes.<br>
<br>
> zchunk.py creates a .zck file. To group chunks by source rpm
in<br>
> primary.xml, run<br>
> ./zchunk.py <file> rpm:sourcerpm<br>
<br>
> unzchunk.py decompresses a .zck file to stdout<br>
<br>
> I realize that there's a lot to digest here, and it's late,
so I know I<br>
> missed something. Please let me know if you have any
suggestions,<br>
> criticisms or flames, though it might be a few hours before I
respond.<br>
<br>
As being someone who tried to work on this problem I very
appreciate what you<br>
have done here. We've started with using zsync and results were
quite good, but<br>
zsync is dead and has ton of bugs. Also it requires archives to be
`<br>
--rsyncable`. So my question is why not to add idx file as
additional one for<br>
existing files instead of inventing new format? The problem is
that we will<br>
have to distribute in old format too (for compatibility reasons).<br>
<br>
I'm not sure if trying to do optimizations by XML tags is very
good idea<br>
especially because I hope that in future we would stop
distributing XML's and<br>
start distributing solv/solvx.<br>
</blockquote>
<span style="white-space: pre-wrap; display: block; width: 98vw;">>
> _______________________________________________
> Rpm-ecosystem mailing list
> <a class="moz-txt-link-abbreviated" href="mailto:Rpm-ecosystem@lists.rpm.org">Rpm-ecosystem@lists.rpm.org</a>
> <a class="moz-txt-link-freetext" href="http://lists.rpm.org/mailman/listinfo/rpm-ecosystem">http://lists.rpm.org/mailman/listinfo/rpm-ecosystem</a>
</span><br>
<br>
</body>
</html>