[Rpm-maint] Rpm Database musings

Fri Mar 8 19:21:33 UTC 2013

On 03/08/2013 04:37 PM, Michael Schroeder wrote:
> On Thu, Mar 07, 2013 at 10:28:41PM +0200, Panu Matilainen wrote:
>> Right now I'm more interested in what the overall design of this all might
>> look like. Like said, I'd like to see the cache be a "read-only media" so
>> there are zero locking needed for queries that only need data from the
>> cache. It'll undoubtedly penalize writers (ie transactions) as the entire
>> cache probably needs to be regenerated even if just one package is
>> installed/removed, but then we're not in the "millions of transactions per
>> second" database business at all, in rpm's case painless (say, without
>> having a library steal your signals and blow up in all directions if you
>> miss a single iterator free, etc) and fast reads are what really counts I
>> think.
>
> OTOH Using a simple reader/writer flock doesn't cost much.

Sure, and in any case flock() locks are far more decent than BDB 
environment ones in that they get mopped out by the os when the process 
goes away without. Entirely avoiding locks for the cache readers might 
not be feasible in practise, its just an idea.

>> One possibility for (supposedly :) hassle-free Packages replacement might
>> be just storing the headers as individual files (named by their "instance"
>> or sha1 hash or something) in a directory of their own. That'd eliminate
>> the need for complex book-keeping for free slots and stuff which is pretty
>> much required for a database-style single file, plus adding and removing
>> things should be very low-cost. Dunno about looping through them all,
>> compared to eg Packages, but then for the added cost of separate open+close
>> etc for each there are no middle-layers adding cost of their own. I'm sure
>> there are downsides too, such as making the whole thing more exposed and
>> easier for outside abuse than manipulating a BDB database file but dunno if
>> it matters in reality - root is required to mess with it, and root can
>> already rm -f /var/lib/rpm/Packages or replace it with whatever and so on.
>
> I kind of like to have all the data in one file.

It has its advantages of course. Having headers spread in different 
files would probably make some things easier but also slower, so you'd 
really want to avoid having to go to the headers. I did a quick 
test-case in python yesterday: reading through all the ~2160 headers in 
my rpmdb with the current libdb implementation (with no signature 
checking) takes about 0.11s, loading them from separate files takes 
about 0.15s. Small numbers but in percentages thats quite a lot.

> Anyway, attached is a little Packages database implementation I did yesterday
> and today. The code is very careful not to destroy things if the database
> is corrupt, i.e. it makes sure that it does not overwrite data.

Wow, that didn't take long. One might get the idea that you're even more 
eager to get rid of BDB than I am :D Can't blame you for that...

> The basic design is like this:
>
> First there's a little header to hold the generation count needed for
> the locking.
>
> Then comes the slot space. Each slot consists of:
> - a magic word
> - the pkgidx of the header
> - an offset
> - a length
> Empty slots have offset=length=pkgidx=0.
>
> Then come the data blobs containing the headers. Each blob consists of:
> - a start magic word
> - the pkgidx of the header
> - a timestamp when the header was writtenm into the database
> - the length of the header in bytes
> - the header data
> - the md5sum over the data starting with the start magic word
> - the length of the header in bytes again
> - an end magic word
>
> It's designed so that you can easily recover headers from a corrupt
> database.

I've only had a cursory look over the code so far but I could certainly 
live with something like this. Heck, that's less lines than the BDB 
support code in lib/backend/ alone, never mind all the BDB stuff that 
"leaks" into lib/rpmdb.c. And bringing our data under our own control. Etc.

Some random notes and ideas (I'm likely missing various details so 
pardon me if something is already there):

The generation counter alone provides something we dont have with BDB 
(apart from trusting file timestamps): a simple way to tell whether 
rpmdb was changed since we last looked at it. Ditto for number of 
packages (I guess pkgdb->nslots equals that, and if not should be 
trivial to add anyway), without trashing through the entire db just to 
come up with an integer. Both rather common needs, there are probably 
many more similar things that can be far easier achieved with a custom 
format tailored to rpm's needs.

We could perhaps take some advantage of knowing the way how rpm does 
transactions: erases always come after installs, so on upgrades there 
are never free slots originating from the same transaction. So we could 
just do lazy deletion: just flag the removed headers for erasure but 
dont actually bother deleting and zeroing them, the next transaction 
that occurs will do that. Should reduce the amount of data needing 
fdatasync() as well.

Kinda related to the above: I dont see the header timestamp being 
actually used for anything (but then I might've missed something). The 
timestamp might work as a very simple transaction kinda thingie: allow 
readers to only see headers that have older timestamp than the time of 
opening a db handle. Or probably a more central timestamp thats only 
updated at end of a transaction so other readers dont see the 
transaction-in-progress changes. With lazy deletion (a timestamp maybe) 
readers could continue to operate without having to care a whole lot if 
a transaction occurs while they're readin.

> Performance seems to be not bad, most of the time seems to be
> spend in qsort and md5 calculation. The qsort part can easily be
> removed, we may want to switch to something faster that the
> md5sum, though, for example an adler32 checksum. Or maybe
> have both adler32 and md5sum and just check the adler32 for
> reading and use the md5sum for database recovery.

MD5 and other cryptographic hashes are probably an overkill here, 
especially considering the upper layers perform further checking (SHA1 
or signature) for the retrieved headers. Something like adler32 sounds 
fine, and considering we already link to zlib anyway we could just use 
the implementation from there (it is exported by zlib).

> BUT: what's killing it is the fdatasync. Without any syncing
> and cache dropping my little test program reports the following
> numbers (2099 headers):
>
>      writing took 1050 ms
>      reading took 375 ms
>      upgrade took 5157 ms
>
> When I enable cache dropping I get:
>
>      writing took 1288 ms
>      reading took 3702 ms
>      upgrade took 9283 ms
>
> As you can see the reading performance suffers because all headers
> have to be read from my slow disk.
>
> LZO helps here, without LZO I get:
>      writing took 1418 ms
>      reading took 6119 ms
>      upgrade took 13203 ms
>
> When I enable fdatasync, it gets much worse (LZO is used):
>
>      writing took 83794 ms
>      reading took 3712 ms
>      upgrade took 158699 ms
>
> Uh oh, 83 seconds for writing? That's 40 ms for one single header,
> which maybe is acceptable. It's still a bit painful, though.

Yup, fdatasync() sucks, but then it sucks with BDB too. I dont remember 
the details off-hand but a ridiculous proportion of an average rpm 
(upgrade) transaction is spent in fdatasync(). Florian might remember 
better, at least I remember him measuring and cursing, measuring and 
cursing it :)

For use in rpm, we'd probably want to use the rpmio FD_t stuff for this. 
I actually spent the beginning of the week cleaning the cruft up in 
preparation (to protect my own sanity :) to adding LZO as a supported 
compression format there. And since FD_t supports calculating digests on 
the fly too, adding adler32 as one possibility shouldn't be hard. How 
much of a win (or not) these things are I dunno, just thinking out loud...

I'll have a closer look + try to test it next week, but big thanks for 
taking the initiative here!

	- Panu -