[Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension

Jeff Johnson n3npq at me.com
Thu Apr 5 21:00:08 UTC 2018



> On Apr 5, 2018, at 2:12 AM, Panu Matilainen <pmatilai at redhat.com> wrote:
> 
>> On 04/04/2018 05:32 PM, Vladimir D. Seleznev wrote:
>>> On Wed, Apr 04, 2018 at 09:18:35AM -0400, Jeff Johnson wrote:
>>> 
>>> 
>>>> On Apr 4, 2018, at 5:47 AM, Panu Matilainen <pmatilai at redhat.com> wrote:
>>>> 
>>>> On 04/04/2018 11:01 AM, Panu Matilainen wrote:
>>>> [...]
>>>>>> +static int identityTag(Header h, rpmtd td, headerGetFlags hgflags)
>>>>>> +{
>>>>>> +    int rc = 1;
>>>>>> +    unsigned int bsize;
>>>>>> +    void * hb;
>>>>>> +    DIGEST_CTX ctx;
>>>>>> +    char * identity = NULL;
>>>>>> +
>>>>>> +    struct rpmtd_s ttd;
>>>>>> +    Header ih = headerNew();
>>>>>> +    HeaderIterator hi = headerInitIterator(h);
>>>>>> +
>>>>>> +    while (headerNext(hi, &ttd) && rc) {
>>>>>> +    if (rpmtdCount(&ttd) > 0) {
>>>>>> +        if (!identityFilter(ttd.tag))
>>>>>> +        rc = headerPut(ih, &ttd, HEADERPUT_DEFAULT);
>>>>>> +    }
>>>>>> +    rpmtdFreeData(&ttd);
>>>>>> +    }
>>>>>> +    headerFreeIterator(hi);
>>>>>> +
>>>>>> +    if (!rc)
>>>>>> +    return 0;
>>>>>> +
>>>>>> +    hb = headerExport(ih, &bsize);
>>>> 
>>>> Forgot to bring this up on the first round: another question is
>>>> whether it actually makes sense to go through all this trouble of
>>>> copying the header, then exporting it and calculating the digest
>>>> from that.
>>>> 
>>>> It'd be considerably cheaper to calculate a digest on the go while
>>>> iterating over the data (from the immutable region, see my other
>>>> email). A newly imported header is guaranteed to be sorted (by tags)
>>>> so it's consistent.
>>> 
>>> There is no particular reason why the IDENTITY digest should need/use
>>> a header blob.
>>> 
>>> Any faithful transformation (e.g. using sprintf or hex strings) on the
>>> data for the set of tags can be used for an IDENTITY digest. The
>>> header blob implicitly includes padding, which can/will change
>>> depending on the definition ordering even when the tag data is
>>> identical/reproducible.
>> Ok.
>>> The filtering should also be positive/inclusive rather than
>>> negative/exclusive. While a positive list of tags to include is harder
>>> to enumerate than the shorter list of tags to exclude, the IDENTITY
>>> tag set will then be closed and well known.
>> I still don't understand why filtering should be inclusive rather than
>> exclusive? The only reason I see is arbitrary tags but the code
>> currently does not include tags beyond RPMTAG_FIRSTFREE_TAG.
> 
> With negative filtering you'll never quite know which tags were included. Older packages have things that newer packages don't, and coming from different sources who knows what patches the building rpm was carrying. And you'll be playing "what we forgot" for quite some time this way. After sleeping over it, for example:
> 
> +    case RPMTAG_CHANGELOG:
> 
> This is not an actual tag in packages, it's only used by the spec parser internally. To filter out the changelog from identity you'll need these three:
> 
>    RPMTAG_CHANGELOGTIME        = 1080, /* i[] */
>    RPMTAG_CHANGELOGNAME        = 1081, /* s[] */
>    RPMTAG_CHANGELOGTEXT        = 1082, /* s[] */
> 
> Other tags you'd need to filter out but current patch misses:
> - RPMTAG_RPMVERSION (reflects the version used to build rpm, which is not a part of a packages identity)
> - RPMTAG_OPTFLAGS (reflects building rpm configuration, not the package id)
> - RPMTAG_PAYLOADFORMAT, RPMTAG_PAYLOADCOMPRESSOR, RPMTAG_PAYLOADFLAGS (these reflect rpm configuration, not package id)
> - RPMTAG_OS, RPMTAG_PLATFORM (if filtering arch then surely these too)
> - RPMTAG_RHNPLATFORM (older packages can have this)
> - RPMTAG_VENDOR, RPMTAG_PACKAGER (typically set from configuration)
> - RPMTAG_FILEDIGESTS, RPMTAG_FILEDIGESTALGO (these depend on rpm configuration, and also compiler versions, build time encoded in files etc)
> 
> ...and still I didn't go through each and every tag, which is what will be needed either way. Only with blacklisting, you'll never *really* know. With whitelisting, identity instantly becomes well-defined, even if it turns out it's missing something it should have.
> 
> A random remark of the day:
> RPMTAG_RELEASE you clearly want to include, but then for example in Fedora/RHEL ecosystem part of it is very commonly partially defined by %dist macro which comes from rpm config. So the package id would change depending on which distro it was built on, even if it otherwise did not change. ID as a build-time digest of the unparsed spec (and all the associated sources) would avoid those issues...
> 
>>> The simplest way to mark the positive/inclusive IDENTITY tag set is to
>>> change the awk script that generates the tag table to add a marker
>>> (like the array return marker) to identify which tags to include.
>>> The members (and ordering) of the IDENTITY tag set might also need to
>>> be configurable without recompiling.
>> May be it would be easier to add marker that tells that marked tag
>> should be filtered from IDENTITY by calculation? But what is best way to
>> do that? Should I add a new field to struct headerTagTableEntry_s? Or I
>> miss something?
> 
> AIUI that's exactly what Jeff was suggesting. But adding a new field to headerTagTableEntry_s would require also adding an API to access it which seems a bit weird I think. An alternative would be constructing the identityFilter() function into a .C file with eg an awk script that's then included into tagexts.c.
> 

Um -- unless you have changed something (or perhaps I have since way back when ;-) -- there was a uint32 with (iirc) (1 << 17) masked on to indicate arrays and a simple getter for the uint32 value. There are/were 14 more unused bits.

(aside)
Expect an RFE issue soon to add a RPM_MAP_TYPE based on arbitrary tagno's in 0x40000000 -> 0x7ffffffff with a uint32 arbitrary tagno generated from a mapping array canonical plaintext string like, say, "Description(en_US)" instead of the (linear search!) RPM_I18NSTRING_TYPE bogosity. A "map" within headers is (of course) far more generally useful.

But RPM_MAP_TYPE support would need a flag bit like arrays, and IDENTITY inclusion.

hth

73 de Jeff

>    - Panu -
> 
>>> But overall, dynamically generating the IDENTITY tag set withe a tag
>>> extension can be deployed (and back ported and maintained) far more
>>> easily than changing header code.
>>> 
>>> Nice job!
> 


More information about the Rpm-maint mailing list