[Rpm-maint] Rpm Database musings
Panu Matilainen
pmatilai at laiskiainen.org
Sun Mar 3 15:46:10 UTC 2013
On 03/01/2013 06:32 PM, Michael Schroeder wrote:
>
> Hi Panu et al,
>
> here are some numbers/musings about changing the database
> implementation to just one single packages file:
>
> - I assume that we still want to store all the headers (in some
> format) anyway.
Nod, I think the headers need to stay, the exact format is another, open
question.
>
> - I checked all the headers of the i586/noarch packages from FC18
> to get some understanding how big they are and if it makes
> sense to compress them. Here's the result:
>
> scanned: 28423 rpms
> uncompressed: sum: 777290960, avg: 27348, median: 10600
> lzo: sum: 305711769, avg: 10756, median: 4805
> gzip: sum: 255995670, avg: 9007, median: 4154
> xz: sum: 215564872, avg: 7585, median: 3728
>
> (the median is quite different from the avg, that means that
> some packages are quite big.)
>
> As you can see, compression about halfs the size of the headers.
> LZO seems to be "good enough" and has the advantage that it's
> really fast.
>
> - That means, if I have 2000 packages installed on my system
> (which is about the real number), the concatenated headers will
> use 20 MByte (using the median), 10 MByte when using LZO
> compression, 7.5 with xz.
>
> - So if we want to drop all index files and just scan the
> packages database, we would need (assuming disk IO throughput
> of 50 M/s) about .2 seconds to create the in-memory index
> data. Which maybe is too much, I dunno.
Right, in this context compression does indeed seem quite attractive.
When we talked about this in the devconf, I was thinking about the way
rpm itself currently keeps (re)loading the headers from Packages and
adding repeated decompression to the other costs of header loading
didn't seem like a way to make it faster. But for roughly halving the
amount of io needed for scanning through it exactly once (which is of
course the way libsolv operates) its quite a different thing.
0.2s is not a whole lot, for many operations absolutely nothing really,
but I'd think some kind of cache would be in order to avoid having to
read through all of packages just for those simple 'rpm -qf /foo' kind
of queries. Such as, store the in-memory index structures into a memory
mapped cache file. The cache could perhaps be write-once and read-only
for other uses so there's no need for locking within the cache: eg
recreate it from scratch at the end of transactions and atomically
replace the old one so the cache itself is always coherent. Or
something... this isn't that far from libsolv's .solv files.
Speaking of which... a funny little idea I got at the end of the
devconf: regardless of future rpmdb format changes, it should be now
possible to write an rpm plugin that creates + updates a .solv file for
the rpmdb, so you should never have to actually read through the entire
rpmdb in libsolv and its users like libzypp, dnf etc.
- Panu -
More information about the Rpm-maint
mailing list