[Rpm-maint] RFC: RPMTAG_IDENTITY calculation

Alexey Tourbin alexey.tourbin at gmail.com
Fri Mar 30 04:50:09 UTC 2018


On Thu, Mar 29, 2018 at 7:55 PM, Vladimir D. Seleznev
<vseleznv at altlinux.org> wrote:
> Hello, rpm-maint@!
>
> There are RFC patches which implement RPMTAG_IDENTITY calculation.
>
> The main idea is that RPMTAG_IDENTITY contains a hash of as many as possible,
> ideally all RPMTAGs, with exception of that that principally cannot be
> reproducible and that we don't want to make it reproducible. Another exception
> is for these tags that we want to use in certain cases, but only for these tags
> that aren't relevant to result of package build. So value of RPMTAG_IDENTITY is
> calculating by blacklist filtered tags for each package.

Hello,

So previously you outlined two use cases for RPMTAG_IDENTITY:
- stricter dependencies between subpackages, presumably to replace
Requires: %name = %EVR with something like %name = %EVR-%{NameBuildID}
or %name = id:%{NameBuildId};
- verifying if a build is reproducible, in which case %{NameBulidID}s
should stay the same across rebuilds.

Let's start with stricter dependencies between subpackages.  My first
observation is that end users, who update only from
centralized/verified repositories, do not benefit from stricter
dependencies.  That's because either of the following holds:
- rpm does not permit updating packages with %EVR unchanged; or even if it does,
- the build system which serves a centralized repo does not permit
inplace package updates; or even if it does,
- the package manager atop of rpm doesn't pull packages with unchanged
%EVR; or even if it does, it can pull and update all installed
subpackages at once.

So there is no plausible way for an end user to end up with
subpackages (no pun intended) from different build sets.  There is a
way to end up with subpackages from different build sets for
developers who do incremental package builds without bumping the
release.  But said developers, I perhaps among them, must somehow
learn to solve their problems without involving everybody else.  So my
second observation is that there indeed exist some facilities which
only beg to be used and render the whole issue a non-issue.

For example, if you do incremental builds in a Mandriva-based Russian
distro, you should try this:

$ sudo rpm -Fv RPMS.hasher/*.rpm

This will "freshen" all installed packages, and that I think is the
best way to handle different build sets.  The only case where it
doesn't work nearly as perfectly is when the set of subpackage names
can change in an arbitrary way. But neither will RPMTAG_IDENTITY
handle perfectly all such situations!  Thus my third observation is
that the problem has not been examined properly from a mathematical
standpoint.  Using subpackages from a single build set is a stronger
requirement which cannot be satisfied by simply producing stricter
dependencies within connected components.

By the way, I believe there might be legitimate reasons for partial
upgrades, on the premise that one knows what he or she is doing.  For
example, if I make changes to a library, I may want to update only the
library subpackage.  Or, if there is a big noarch subpackage with
data, I have every reason to leave it alone.

This further brings the problem of noarch subpackage.  They are
supposed to be installable on any architecture, but stricter
subpackage dependencies can change that.  Let's do some case analysis:
- arch->noarch, i.e. a binary package requires its base noarch
subpackage; this will result in very rigorous requirements to noarch
subpackages: they must hatch byte-to-byte identical on every
architecture, or else the dependency will be broken.  This might
actually make sense, or it might not.  I'm inclined towards the
latter, here's why: strict dependencies between subpackages is a very
basic mechanism, while the identity of noarch packages, the right
amount of it, is subject to interpretation, and is a matter of policy.
So the build system should orchestrate synchronous builds across
architectures and then check if noarch subpackages are identical
enough, according to its policy.  Shifting the responsibility down to
rpm would compromise the mechanism/policy distinction.
- noarch->arch, i.e. a noarch subpackage requires its base binary
package; with stricter dependencies, that would be outright wrong,
because noarch subpackage can't know byte-to-byte specifics of binary
packages, and the dependency will be broken one way or the other, most
of the time;
- further amendments to how strict dependencies are propagated between
subpackages must be made to build/interdep.c; since interdep.c is not
part of rpm.org, I'll omit the details.

> Previously I wrote that RPMTAG_IDENTITY value will be used to generate more
> strict interpackage dependencies, but we turn away from it because identity of
> binary packages of two builds from one source package can be same for some
> packages and differ for others, and it brings collision for them.

So I actually was intrigued and waited to see your patches, in
particular how you handle dependencies, before expressing my opinion,
but it turns out there will be no patches regarding dependencies.

Well, this leaves the case of build id.  Suppose you built a package,
and you want to know its build id.  So you open the package and read
its build id from the header.  Further suppose that the package is
stored on a hard drive (a very plausibly assumption indeed).  Further
suppose the drive makes about 6,000 revolutions per minute, so it
takes about 0.01s to start reading the header.  About a megabyte can
be read in another 0.01s, an average header being much smaller.
According to blake2.net, data can be hashed at a speed of about 900
Mb/s, so it will take about 0.001s or less to recalculate the build id
on the fly.  The thing is, it's just reading the header that is
already expensive; once you have the data, calculating the hash is
cheap, and the difference is more than an order of magnitude.  The
difference in speed will be less pronounced with SSD.  Still, you need
to read at least 4K, because that's how filesystems work.  So putting
RPMTAG_IDENTITY into the signature header won't reduce nearly as much
overhead as you might hope.


More information about the Rpm-maint mailing list