[Rpm-maint] Rpm Database musings

Panu Matilainen pmatilai at laiskiainen.org
Thu Mar 7 20:28:41 UTC 2013


On 03/05/2013 08:11 PM, Michael Schroeder wrote:
> On Mon, Mar 04, 2013 at 12:22:31PM +0100, Michael Schroeder wrote:
>> For 2000 packages we have about... ugh, that's actually hard
>> to tell as the avg and the median differ that much. Let's
>> use the average: 2000 * 130 = 260000 files.
>>
>> I would hash them using just a 32-bit number for each hash entry,
>> the number would be a combination of the hash value and a header
>> number. Thus:
>>
>> 260000  ->  2^19 = 524288 hash entries (49% fill size)
>> hash(basename) = xxxx...xxxx yyyy....yyyy
>>                   --13 bits-- -- 19 bits--
>>
>> Thus we store 13 bits of the hash value together with the header
>> number (we may need a header number -> header db idx array to keep
>> the header number small, but that just needs 2000 * 4 bytes).
>>
>> I would not store the real string on disk, in most cases we
>> need to retrieve the real header anyway, our db function
>> can simply fetch the header and make sure that it's not a false
>> positive. Chances for that are pretty slim anyway, as we have the
>> extra 13 bits in the entry.
>
> Hmm, turns out that it works only well if we have a really good
> hash function. The simple hashing I do in libsolv can't be used
> as too many files map to the same hash value. I tried the
> popular "MurMurHash", it seems to do a good job. Disadvantages
> of the scheme are the excess header fetches and there's also no
> fast way to get all the keys from the hash (needed for rpm's new
> IndexIterator).
>
>
> If we don't want to access excess headers, we need to put the
> strings into the hash. Here's the data for my machine:
>
> filelist entries: 238032
> -> hash entries: 524288
> -> hash memory used: 4 * 524288 = 2 Mbyte
>
> unique basenames: 207686
> basename stringspace needed: 3355117 bytes = 3.2 MByte
>
> This assumes that we can somehow address the string from the
> 32bit hash value. I have 2099 package installed, this needs
> 12 bits. That leaves us with 20 bits for the string offset, which
> is not enough (2^20 = 1048576). We might put the strings at
> a somewhat aligned offsets, thus we would need max 3 excess bytes
> for aligning, making the needed space 3355117 + 3 * 207686 =
> 3.8 MBytes, which is smalled than 1048576 * 4.
>
> It's probably easier to just use 2 * 32bit for the hashing, i.e.
> the needed size is 2 * 2 MByte hash + 3.2 MByte strings.

I wouldn't worry too much about hash algorithms and storage optimization 
at this point: that's something that can be tweaked and tuned over time 
as long as the cache structure is internally versioned so we know when 
we need to rebuild it.

Right now I'm more interested in what the overall design of this all 
might look like. Like said, I'd like to see the cache be a "read-only 
media" so there are zero locking needed for queries that only need data 
from the cache. It'll undoubtedly penalize writers (ie transactions) as 
the entire cache probably needs to be regenerated even if just one 
package is installed/removed, but then we're not in the "millions of 
transactions per second" database business at all, in rpm's case 
painless (say, without having a library steal your signals and blow up 
in all directions if you miss a single iterator free, etc) and fast 
reads are what really counts I think.

One possibility for (supposedly :) hassle-free Packages replacement 
might be just storing the headers as individual files (named by their 
"instance" or sha1 hash or something) in a directory of their own. 
That'd eliminate the need for complex book-keeping for free slots and 
stuff which is pretty much required for a database-style single file, 
plus adding and removing things should be very low-cost. Dunno about 
looping through them all, compared to eg Packages, but then for the 
added cost of separate open+close etc for each there are no 
middle-layers adding cost of their own. I'm sure there are downsides 
too, such as making the whole thing more exposed and easier for outside 
abuse than manipulating a BDB database file but dunno if it matters in 
reality - root is required to mess with it, and root can already rm -f 
/var/lib/rpm/Packages or replace it with whatever and so on.

	- Panu -


More information about the Rpm-maint mailing list