[Rpm-maint] File fingerprinting

Florian Festi ffesti at redhat.com
Mon Apr 28 14:39:43 UTC 2008


Hi!

I had some look into the file fingerprinting code. The purpose of this 
code is to find file conflicts. The main problem it has to solve is the 
fact that a file can be visible at different place in the directory tree 
if there are symlinks pointing to any of the parent directories. To 
solve this it does check for identical fingerprints of all files with 
the same basename.

A finger print consists of the device and inode number of the parent 
directory and if this directory doesn't exist yet the finger print of 
that down most directory that does exist and the path from that 
directory to the file.

Finger prints are used for two things:

1. Check all files in the transaction against each other
 To do that the finger prints of all files in the transaction are put 
into a big hash table.
2. Check the files in the transaction against those in the rpmdb
 For each pkg:
  For each file:
   search the db for files with the same basename and create the 
fingerprints for them

Finger printing actually doesn't work as soon as there are some symlinks 
involved that are not installed yet. I attached 3 spec files that show that.
Install the README rpm that then install FOO and FOO-DOC at once. As you 
can see /usr/share/README-1/README gets over writen without warning.
If README and FOO are already installed installing FOO-DOC raises an 
error as expected.

The packages contain:
README: /usr/share/README-1/README
FOO: /usr/share/FOO-1 -> README-1
FOO-DOC: /usr/share/FOO-1/README

The 2. part is also quite inefficient as it checks the same basename 
over and over again for each pkg containing a file with the same 
basename (Good candidates are README, Copying, ...). This leads to a 
kind of quadratic runtime behavior (installed files * to be installed 
files with the same basename). To get rid of this the file finger prints 
need to be checked for the whole transaction and not only per pkg. This 
requires changing the data structures involved (pointing back from the 
installed files to the "to be installed files") and reversing the way 
the files are traversed.

This will require that we keep all matching files in memory at once. As 
they can be represented (pkg num, file num) + (*pkg, file num) of the to 
be installed pkgs this should be possible in a few dozen bytes per file 
which means a few MB for some 100,000 files.

Florian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: README.spec
Type: text/x-rpm-spec
Size: 689 bytes
Desc: not available
Url : http://lists.rpm.org/pipermail/rpm-maint/attachments/20080428/ad79b17b/attachment-0003.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: FOO.spec
Type: text/x-rpm-spec
Size: 508 bytes
Desc: not available
Url : http://lists.rpm.org/pipermail/rpm-maint/attachments/20080428/ad79b17b/attachment-0004.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: FOO-DOC.spec
Type: text/x-rpm-spec
Size: 565 bytes
Desc: not available
Url : http://lists.rpm.org/pipermail/rpm-maint/attachments/20080428/ad79b17b/attachment-0005.bin 


More information about the Rpm-maint mailing list