[Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension
Jeff Johnson
n3npq at me.com
Fri Apr 6 12:26:54 UTC 2018
Sent from my iPad
> On Apr 6, 2018, at 1:52 AM, Panu Matilainen <pmatilai at redhat.com> wrote:
>
>> On 04/05/2018 03:42 PM, Vladimir D. Seleznev wrote:
>>> On Thu, Apr 05, 2018 at 11:41:33AM +0300, Panu Matilainen wrote:
>>>> On 04/03/2018 10:31 PM, Vladimir D. Seleznev wrote:
>>>> RPMTAG_IDENTITY is calculating as digest of part of package header that
>>>> does not contain irrelevant to package build tag entries.
>>>>
>>>> Mathematically RPMTAG_IDENTITY value is a result of function of two
>>>> variable: a package header and an rpm utility, thus this value can
>>>> differ for same package and different version of rpm.
>>>
>>> Before proceeding with further work on this, we need to define what is
>>> it that we're trying to identify. The above definition is very
>>> ambiguous, and it's impossible to properly review + discuss the patch
>>> when my idea of package identity might be entirely different from
>>> somebody elses idea, that'll only cause unnecessary work and frustration.
>> Agree, that commit message isn't clear.
>>> Starting with, what is a "package"? Are we talking about the source
>>> package, or binary packages?
>> Originally it was about binary packages, but is there really difference?
>> Source packages are building as well as binary, and something can be
>> changed after rebuild.
>
> Source *packages* are built too, yes, but there's a vast difference between reproducability of src.rpm and binary rpm.
>
> However while reviewing the patch yesterday, I realized I've been increasingly thinking about *source* identity (note the lack of "package"), which is something quite different: you'd calculate a digest over the unparsed spec + all the sources and patches etc the spec refers to [*] and save it in the header of binaries and sources on build. This would let you identify all the packages that have been built from the same source, ie whether the package was built eg on Fedora or RHEL (it's fairly common to share specs between them) or whatever it'd have the same source id.
>
> [*] obviously you need to parse the spec to get those references and it's possible to create specs where this differs between arches, but sane specs use same sources + patches between archs etc
>
>>> If it's binaries, then we're always ultimately talking about a *build*,
>>> and a line needs to be drawn somewhere.
>> OK.
>>> There are any number of ways to draw such a line, so it needs to be
>>> explicitly stated. One example of such line could be something like
>>> "package id must match between a package built on different instances
>>> of the same operating system, version and architecture". That clearly
>>> is NOT the line that this version of the patch tries to draw, but then
>>> it's not at all clear to me what that line is supposed to be.
>> I think, there should be a line with other side idea: if package
>> identity is matched between package build on the same build environment,
>> then the build is reproducible.
>> The possible new version of commit massage is below:
>> Add RPMTAG_IDENTITY calculation as tag extension
>> RPMTAG_IDENTITY is calculating as digest of values of significant
>> package header tag entries and represents package build characteristics.
>> The main purpose of package identity is reproducible build verification:
>> if package identity is matched between package build on same build
>> environment, then the package build is reproducible for this
>> environment.
>
> Right, reproducability is one such line and that'd be a much better description.
>
> I do think that RPMTAG_IDENTITY is overly broad name for such a narrow purpose though - note how it led me to think about the source level identity instead. Something towards "build id" maybe, but we don't want to mix it up with debuginfo buildid. No need to get hung over it right now though, just something to think about.
>
To handle multiple types of reproducibility, IDENTITY needs to be computed across an array of hashes of elements, not the elements themselves.
One reason for the added abstraction layer in an array of hashes of elements is to diagnose proof-of-reproducibility failures: when IDENTITY fails, one can then identify which element failed.
For the (overly simple) case of tags in a header, the array of hashes is of each tag's data, and IDENTITY is then the digest of the array. For diagnostic purposes, the tag name should also be appended to the tag data hash.
By using a hash on an array of hashes, the mechanism can be chained/extended.
Consider, say, a TRANSACTION_IDENTITY composed as an array of package IDENTITY hashes composed of tag data elements in the (overly simple) example we have been considering.
See what NixOS (a package manager based on functional programming concepts) does to append human readable strings to hashes for an easily readable format for the array elements in the IDENTITY value.
hth
73 de Jeff
> - Panu -
> _______________________________________________
> Rpm-maint mailing list
> Rpm-maint at lists.rpm.org
> http://lists.rpm.org/mailman/listinfo/rpm-maint
More information about the Rpm-maint
mailing list