[Rpm-maint] Rpm Database musings

Michael Schroeder mls at suse.de
Mon Mar 4 11:22:31 UTC 2013


More numbers ahead:

rpm's scanned: 28423

uncompressed:
-------------
unc:   sum: 777290960, avg: 27348, median: 10600
clunc: sum: 168381220, avg:  5925, median:  2707
flunc: sum: 553276838, avg: 19466, median:  2102
xxunc: sum:  56542438, avg:  1990, median:  1684
bnunc: sum:  69125101,  avg: 2433, median:   176

lzo compressed:
---------------
lzo:   sum: 305711769, avg: 10756, median:  4805
cllzo: sum:  66466108, avg:  2339, median:  1149
fllzo: sum: 215489823, avg:  7582, median:  1214
xxlzo: sum:  30510359, avg:  1074, median:   989
bnlzo: sum:  19241190, avg:   677, median:   131

entry count:
------------
clnum: sum:   1199508, avg:    43, median:    23
flnum: sum:   3673589, avg:   130, median:    11


cl: header just containing the changelog tags
fl: header just containing the filelist tags
xx: header with no changelog/filelist tags
bn: header just containing the basename tag

So assuming 2000 installed packages, the headers without
the filelist/changelog are just about 2 MByte LZO compressed.


On to something completely different, let's speak about hashes,
be it in memory or on disk:

For 2000 packages we have about... ugh, that's actually hard
to tell as the avg and the median differ that much. Let's
use the average: 2000 * 130 = 260000 files.

I would hash them using just a 32-bit number for each hash entry,
the number would be a combination of the hash value and a header
number. Thus:

260000  ->  2^19 = 524288 hash entries (49% fill size)
hash(basename) = xxxx...xxxx yyyy....yyyy
                 --13 bits-- -- 19 bits--

Thus we store 13 bits of the hash value together with the header
number (we may need a header number -> header db idx array to keep
the header number small, but that just needs 2000 * 4 bytes).

I would not store the real string on disk, in most cases we
need to retrieve the real header anyway, our db function
can simply fetch the header and make sure that it's not a false
positive. Chances for that are pretty slim anyway, as we have the
extra 13 bits in the entry. (It may turn out that we need some extra
bits so that we can resize the hash without needing to read all
headers.)
So we need 4 * 524288 = 2 MByte for the Basename hash.

In conclusion, I think it wouldn't be hard to replace BerkeleyDB
with something much more lightweight for the index dbs. The
Packages database needs more work, though.

Cheers,
  Michael.

-- 
Michael Schroeder                                   mls at suse.de
SUSE LINUX Products GmbH,  GF Jeff Hawn, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}


More information about the Rpm-maint mailing list