[Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension

Panu Matilainen pmatilai at redhat.com
Thu Apr 5 06:12:14 UTC 2018


On 04/04/2018 05:32 PM, Vladimir D. Seleznev wrote:
> On Wed, Apr 04, 2018 at 09:18:35AM -0400, Jeff Johnson wrote:
>>
>>
>>> On Apr 4, 2018, at 5:47 AM, Panu Matilainen <pmatilai at redhat.com> wrote:
>>>
>>> On 04/04/2018 11:01 AM, Panu Matilainen wrote:
>>> [...]
>>>>> +static int identityTag(Header h, rpmtd td, headerGetFlags hgflags)
>>>>> +{
>>>>> +    int rc = 1;
>>>>> +    unsigned int bsize;
>>>>> +    void * hb;
>>>>> +    DIGEST_CTX ctx;
>>>>> +    char * identity = NULL;
>>>>> +
>>>>> +    struct rpmtd_s ttd;
>>>>> +    Header ih = headerNew();
>>>>> +    HeaderIterator hi = headerInitIterator(h);
>>>>> +
>>>>> +    while (headerNext(hi, &ttd) && rc) {
>>>>> +    if (rpmtdCount(&ttd) > 0) {
>>>>> +        if (!identityFilter(ttd.tag))
>>>>> +        rc = headerPut(ih, &ttd, HEADERPUT_DEFAULT);
>>>>> +    }
>>>>> +    rpmtdFreeData(&ttd);
>>>>> +    }
>>>>> +    headerFreeIterator(hi);
>>>>> +
>>>>> +    if (!rc)
>>>>> +    return 0;
>>>>> +
>>>>> +    hb = headerExport(ih, &bsize);
>>>
>>> Forgot to bring this up on the first round: another question is
>>> whether it actually makes sense to go through all this trouble of
>>> copying the header, then exporting it and calculating the digest
>>> from that.
>>>
>>> It'd be considerably cheaper to calculate a digest on the go while
>>> iterating over the data (from the immutable region, see my other
>>> email). A newly imported header is guaranteed to be sorted (by tags)
>>> so it's consistent.
>>>
>>
>> There is no particular reason why the IDENTITY digest should need/use
>> a header blob.
>>
>> Any faithful transformation (e.g. using sprintf or hex strings) on the
>> data for the set of tags can be used for an IDENTITY digest. The
>> header blob implicitly includes padding, which can/will change
>> depending on the definition ordering even when the tag data is
>> identical/reproducible.
> 
> Ok.
> 
>> The filtering should also be positive/inclusive rather than
>> negative/exclusive. While a positive list of tags to include is harder
>> to enumerate than the shorter list of tags to exclude, the IDENTITY
>> tag set will then be closed and well known.
> 
> I still don't understand why filtering should be inclusive rather than
> exclusive? The only reason I see is arbitrary tags but the code
> currently does not include tags beyond RPMTAG_FIRSTFREE_TAG.

With negative filtering you'll never quite know which tags were 
included. Older packages have things that newer packages don't, and 
coming from different sources who knows what patches the building rpm 
was carrying. And you'll be playing "what we forgot" for quite some time 
this way. After sleeping over it, for example:

+    case RPMTAG_CHANGELOG:

This is not an actual tag in packages, it's only used by the spec parser 
internally. To filter out the changelog from identity you'll need these 
three:

     RPMTAG_CHANGELOGTIME        = 1080, /* i[] */
     RPMTAG_CHANGELOGNAME        = 1081, /* s[] */
     RPMTAG_CHANGELOGTEXT        = 1082, /* s[] */

Other tags you'd need to filter out but current patch misses:
- RPMTAG_RPMVERSION (reflects the version used to build rpm, which is 
not a part of a packages identity)
- RPMTAG_OPTFLAGS (reflects building rpm configuration, not the package id)
- RPMTAG_PAYLOADFORMAT, RPMTAG_PAYLOADCOMPRESSOR, RPMTAG_PAYLOADFLAGS 
(these reflect rpm configuration, not package id)
- RPMTAG_OS, RPMTAG_PLATFORM (if filtering arch then surely these too)
- RPMTAG_RHNPLATFORM (older packages can have this)
- RPMTAG_VENDOR, RPMTAG_PACKAGER (typically set from configuration)
- RPMTAG_FILEDIGESTS, RPMTAG_FILEDIGESTALGO (these depend on rpm 
configuration, and also compiler versions, build time encoded in files etc)

...and still I didn't go through each and every tag, which is what will 
be needed either way. Only with blacklisting, you'll never *really* 
know. With whitelisting, identity instantly becomes well-defined, even 
if it turns out it's missing something it should have.

A random remark of the day:
RPMTAG_RELEASE you clearly want to include, but then for example in 
Fedora/RHEL ecosystem part of it is very commonly partially defined by 
%dist macro which comes from rpm config. So the package id would change 
depending on which distro it was built on, even if it otherwise did not 
change. ID as a build-time digest of the unparsed spec (and all the 
associated sources) would avoid those issues...

> 
>> The simplest way to mark the positive/inclusive IDENTITY tag set is to
>> change the awk script that generates the tag table to add a marker
>> (like the array return marker) to identify which tags to include.
>> The members (and ordering) of the IDENTITY tag set might also need to
>> be configurable without recompiling.
> 
> May be it would be easier to add marker that tells that marked tag
> should be filtered from IDENTITY by calculation? But what is best way to
> do that? Should I add a new field to struct headerTagTableEntry_s? Or I
> miss something?

AIUI that's exactly what Jeff was suggesting. But adding a new field to 
headerTagTableEntry_s would require also adding an API to access it 
which seems a bit weird I think. An alternative would be constructing 
the identityFilter() function into a .C file with eg an awk script 
that's then included into tagexts.c.

	- Panu -

> 
>> But overall, dynamically generating the IDENTITY tag set withe a tag
>> extension can be deployed (and back ported and maintained) far more
>> easily than changing header code.
>>
>> Nice job!
> 



More information about the Rpm-maint mailing list