[Rpm-ecosystem] Using librpm directly to parse RPM databases

Fri Mar 26 10:18:24 UTC 2021

On 3/25/21 5:28 PM, Richard W.M. Jones wrote:
> Libguestfs is a library for inspecting disk images.  It has long
> offered a way to find out what packages are installed in a Linux
> virtual machine by inspection of the disk image.  For guests
> identified as RPM-based it currently does this by some hokey parsing
> of /var/lib/rpm/Name etc.  Since RPM moved away from BDB to sqlite
> this has obviously stopped working, and now I'm trying to fix this.
> 
> An important thing here is that we still need to be able to parse out
> the RPM databases from very old guests (eg. back to RHEL 5 / 2007 era
> would be a good starting point).
> 
> The main operations of interest are:
> 
>   - List all installed RPMs.
> 
>   - Get some of the standard fields of an installed RPM, eg. name,
>     version, arch, URL, summary, description.
> 
> Based on playing with the programs here:
> https://blog.fpmurphy.com/2011/08/programmatically-retrieve-rpm-package-details.html
> I think we can probably do this using librpm directly.  My questions
> below are about this option, but if using librpm directly is _not_
> advisable for some reason then that would be good to know too.
> 
> * How stable is the librpm API?  If we switched to this method then
> we'd probably be using it for years.  The programs above date from
> 2011 and still compile and work even with sqlite, which is
> encouraging.

The main parts of librpm API are rather stable, the relevant APIs for 
this sort of use hasn't really changed since 2008 (rpm 4.6.0, that's a 
kind of a watershed release).

Note that the examples in the above link are doing things the hard way - 
if all you want from rpmdb is formatted strings then you'll want to use 
headerFormat() (headerSprintf() prior to 4.6) which is the API 
equivalent of rpm --queryformat and far nicer and simpler for that purpose.

> 
> * Will downstream versions of librpm maintain the ability to at least
> read BDB databases forever?  Not interested in writing.

Forever is such a long time :D

Upstream will need to support reading BDB databases as long as there are 
supported releases using BDB in the wild. For a ballpark figure, I would 
say the read-only support will remain at least for the lifetime of 
RHEL-8 and then some. Which puts us somewhere beyond 2030 - too far to 
further predictions.

Downstream support is a distro matter, the bdb-ro support is strictly 
optional in rpm build. As for Fedora, I'd estimate the same as upstream, 
but I can't speak for others.

> * Are there security implications to reading an RPM database, ie.  if
> the database has been corrupted, perhaps deliberately, do we need to
> confine or time-box the librpm process?  (We propose to confine it to
> a VM so this is more about DoS attacks.)

I don't know about security implications, but confining librpm access to 
a process of it's own is not a bad idea at all, even if only to protect 
protect *yourself* from the librpm side-effects such as signal delivery 
hijack when rpmdb is open. The other thing is that with native BDB, 
access by a process with write permission to the db directory can have 
side-effects to the database environment. To guard against that, you 
might want to run it as a some sort of nobody-user.

> A questions about the API:
> 
> * The example programs call rpmReadConfigFiles().  I wonder if we
> should _not_ do that because of security or other considerations?

Most of rpm will simply not work at all without calling 
rpmReadConfigFiles(), so it's not really optional.

Note that I'm quite sure I don't fully understand the use-case with 
libguestfs. If you can link to librpm and isolate it to its own process, 
couldn't you just exec the actual rpm binary for queries? The command 
line query interface is essentially compatible till the beginning of times.

	- Panu -

> 
> Rich.
>