[Rpm-maint] Rpm Database musings
Panu Matilainen
pmatilai at laiskiainen.org
Fri Mar 8 19:21:33 UTC 2013
On 03/08/2013 04:37 PM, Michael Schroeder wrote:
> On Thu, Mar 07, 2013 at 10:28:41PM +0200, Panu Matilainen wrote:
>> Right now I'm more interested in what the overall design of this all might
>> look like. Like said, I'd like to see the cache be a "read-only media" so
>> there are zero locking needed for queries that only need data from the
>> cache. It'll undoubtedly penalize writers (ie transactions) as the entire
>> cache probably needs to be regenerated even if just one package is
>> installed/removed, but then we're not in the "millions of transactions per
>> second" database business at all, in rpm's case painless (say, without
>> having a library steal your signals and blow up in all directions if you
>> miss a single iterator free, etc) and fast reads are what really counts I
>> think.
>
> OTOH Using a simple reader/writer flock doesn't cost much.
Sure, and in any case flock() locks are far more decent than BDB
environment ones in that they get mopped out by the os when the process
goes away without. Entirely avoiding locks for the cache readers might
not be feasible in practise, its just an idea.
>> One possibility for (supposedly :) hassle-free Packages replacement might
>> be just storing the headers as individual files (named by their "instance"
>> or sha1 hash or something) in a directory of their own. That'd eliminate
>> the need for complex book-keeping for free slots and stuff which is pretty
>> much required for a database-style single file, plus adding and removing
>> things should be very low-cost. Dunno about looping through them all,
>> compared to eg Packages, but then for the added cost of separate open+close
>> etc for each there are no middle-layers adding cost of their own. I'm sure
>> there are downsides too, such as making the whole thing more exposed and
>> easier for outside abuse than manipulating a BDB database file but dunno if
>> it matters in reality - root is required to mess with it, and root can
>> already rm -f /var/lib/rpm/Packages or replace it with whatever and so on.
>
> I kind of like to have all the data in one file.
It has its advantages of course. Having headers spread in different
files would probably make some things easier but also slower, so you'd
really want to avoid having to go to the headers. I did a quick
test-case in python yesterday: reading through all the ~2160 headers in
my rpmdb with the current libdb implementation (with no signature
checking) takes about 0.11s, loading them from separate files takes
about 0.15s. Small numbers but in percentages thats quite a lot.
> Anyway, attached is a little Packages database implementation I did yesterday
> and today. The code is very careful not to destroy things if the database
> is corrupt, i.e. it makes sure that it does not overwrite data.
Wow, that didn't take long. One might get the idea that you're even more
eager to get rid of BDB than I am :D Can't blame you for that...
> The basic design is like this:
>
> First there's a little header to hold the generation count needed for
> the locking.
>
> Then comes the slot space. Each slot consists of:
> - a magic word
> - the pkgidx of the header
> - an offset
> - a length
> Empty slots have offset=length=pkgidx=0.
>
> Then come the data blobs containing the headers. Each blob consists of:
> - a start magic word
> - the pkgidx of the header
> - a timestamp when the header was writtenm into the database
> - the length of the header in bytes
> - the header data
> - the md5sum over the data starting with the start magic word
> - the length of the header in bytes again
> - an end magic word
>
> It's designed so that you can easily recover headers from a corrupt
> database.
I've only had a cursory look over the code so far but I could certainly
live with something like this. Heck, that's less lines than the BDB
support code in lib/backend/ alone, never mind all the BDB stuff that
"leaks" into lib/rpmdb.c. And bringing our data under our own control. Etc.
Some random notes and ideas (I'm likely missing various details so
pardon me if something is already there):
The generation counter alone provides something we dont have with BDB
(apart from trusting file timestamps): a simple way to tell whether
rpmdb was changed since we last looked at it. Ditto for number of
packages (I guess pkgdb->nslots equals that, and if not should be
trivial to add anyway), without trashing through the entire db just to
come up with an integer. Both rather common needs, there are probably
many more similar things that can be far easier achieved with a custom
format tailored to rpm's needs.
We could perhaps take some advantage of knowing the way how rpm does
transactions: erases always come after installs, so on upgrades there
are never free slots originating from the same transaction. So we could
just do lazy deletion: just flag the removed headers for erasure but
dont actually bother deleting and zeroing them, the next transaction
that occurs will do that. Should reduce the amount of data needing
fdatasync() as well.
Kinda related to the above: I dont see the header timestamp being
actually used for anything (but then I might've missed something). The
timestamp might work as a very simple transaction kinda thingie: allow
readers to only see headers that have older timestamp than the time of
opening a db handle. Or probably a more central timestamp thats only
updated at end of a transaction so other readers dont see the
transaction-in-progress changes. With lazy deletion (a timestamp maybe)
readers could continue to operate without having to care a whole lot if
a transaction occurs while they're readin.
> Performance seems to be not bad, most of the time seems to be
> spend in qsort and md5 calculation. The qsort part can easily be
> removed, we may want to switch to something faster that the
> md5sum, though, for example an adler32 checksum. Or maybe
> have both adler32 and md5sum and just check the adler32 for
> reading and use the md5sum for database recovery.
MD5 and other cryptographic hashes are probably an overkill here,
especially considering the upper layers perform further checking (SHA1
or signature) for the retrieved headers. Something like adler32 sounds
fine, and considering we already link to zlib anyway we could just use
the implementation from there (it is exported by zlib).
> BUT: what's killing it is the fdatasync. Without any syncing
> and cache dropping my little test program reports the following
> numbers (2099 headers):
>
> writing took 1050 ms
> reading took 375 ms
> upgrade took 5157 ms
>
> When I enable cache dropping I get:
>
> writing took 1288 ms
> reading took 3702 ms
> upgrade took 9283 ms
>
> As you can see the reading performance suffers because all headers
> have to be read from my slow disk.
>
> LZO helps here, without LZO I get:
> writing took 1418 ms
> reading took 6119 ms
> upgrade took 13203 ms
>
> When I enable fdatasync, it gets much worse (LZO is used):
>
> writing took 83794 ms
> reading took 3712 ms
> upgrade took 158699 ms
>
> Uh oh, 83 seconds for writing? That's 40 ms for one single header,
> which maybe is acceptable. It's still a bit painful, though.
Yup, fdatasync() sucks, but then it sucks with BDB too. I dont remember
the details off-hand but a ridiculous proportion of an average rpm
(upgrade) transaction is spent in fdatasync(). Florian might remember
better, at least I remember him measuring and cursing, measuring and
cursing it :)
For use in rpm, we'd probably want to use the rpmio FD_t stuff for this.
I actually spent the beginning of the week cleaning the cruft up in
preparation (to protect my own sanity :) to adding LZO as a supported
compression format there. And since FD_t supports calculating digests on
the fly too, adding adler32 as one possibility shouldn't be hard. How
much of a win (or not) these things are I dunno, just thinking out loud...
I'll have a closer look + try to test it next week, but big thanks for
taking the initiative here!
- Panu -
More information about the Rpm-maint
mailing list