[Rpm-maint] Rpm Database musings

Sun Mar 3 15:46:10 UTC 2013

On 03/01/2013 06:32 PM, Michael Schroeder wrote:
>
> Hi Panu et al,
>
> here are some numbers/musings about changing the database
> implementation to just one single packages file:
>
> - I assume that we still want to store all the headers (in some
>    format) anyway.

Nod, I think the headers need to stay, the exact format is another, open 
question.

>
> - I checked all the headers of the i586/noarch packages from FC18
>    to get some understanding how big they are and if it makes
>    sense to compress them. Here's the result:
>
>      scanned: 28423 rpms
>      uncompressed: sum: 777290960, avg: 27348, median: 10600
>      lzo:          sum: 305711769, avg: 10756,  median: 4805
>      gzip:         sum: 255995670, avg:  9007,  median: 4154
>      xz:           sum: 215564872, avg:  7585,  median: 3728
>
>    (the median is quite different from the avg, that means that
>    some packages are quite big.)
>
>    As you can see, compression about halfs the size of the headers.
>    LZO seems to be "good enough" and has the advantage that it's
>    really fast.
>
> - That means, if I have 2000 packages installed on my system
>    (which is about the real number), the concatenated headers will
>    use 20 MByte (using the median), 10 MByte when using LZO
>    compression, 7.5 with xz.
>
> - So if we want to drop all index files and just scan the
>    packages database, we would need (assuming disk IO throughput
>    of 50 M/s) about .2 seconds to create the in-memory index
>    data. Which maybe is too much, I dunno.

Right, in this context compression does indeed seem quite attractive. 
When we talked about this in the devconf, I was thinking about the way 
rpm itself currently keeps (re)loading the headers from Packages and 
adding repeated decompression to the other costs of header loading 
didn't seem like a way to make it faster. But for roughly halving the 
amount of io needed for scanning through it exactly once (which is of 
course the way libsolv operates) its quite a different thing.

0.2s is not a whole lot, for many operations absolutely nothing really, 
but I'd think some kind of cache would be in order to avoid having to 
read through all of packages just for those simple 'rpm -qf /foo' kind 
of queries. Such as, store the in-memory index structures into a memory 
mapped cache file. The cache could perhaps be write-once and read-only 
for other uses so there's no need for locking within the cache: eg 
recreate it from scratch at the end of transactions and atomically 
replace the old one so the cache itself is always coherent. Or 
something... this isn't that far from libsolv's .solv files.

Speaking of which... a funny little idea I got at the end of the 
devconf: regardless of future rpmdb format changes, it should be now 
possible to write an rpm plugin that creates + updates a .solv file for 
the rpmdb, so you should never have to actually read through the entire 
rpmdb in libsolv and its users like libzypp, dnf etc.

	- Panu -