[Rpm-maint] [rpm-software-management/rpm] RPM with Copy on Write (#1470)

malmond77 notifications at github.com
Tue Dec 29 20:54:14 UTC 2020


This is part of https://fedoraproject.org/wiki/Changes/RPMCoW

The majority of changes are in two new programs:

# rpm2extents

Modelled as a 'stream processor'. It reads a regular .rpm file on stdin,
and produces a modified .rpm file on stdout. The lead, signature and
headers are preserved 1:1 to allow all the normal metadata inspection,
signature verification to work as expected. Only the 'payload' is
modified.

The primary motivation for this tool is to re-organize the payload as a
sequence of raw file extents (hence the name). The files are organized
by their digest identity instead of path/filename. If any digest is
repeated, then the file is skipped/de-duped. Only regular files are
represented. All other entries like directories, symlinks, devices are
fully described in the headers and are omitted.

The files are padded so they start on `sysconf(_SC_PAGESIZE)` boundries
to permit 'reflink' syscalls to work in the `reflink` plugin.

At the end of the file is a footer with 3 sections:

1. List of calculated digests of the input stream. This is used in
   `librepo` because the file *written* is a derivative, and not the
   same as the repo metadata describes. `rpm2extents` takes one or more
   positional arguments that described which digest algorithms are
   desired. This is often just `SHA256`. This program is only measuring
   and recording the digest - it does not express an opinion on whether
   the file is correct. Due to the API on most compression libraries
   directly reading the source file, the whole file digest is measured
   using a subprocess and pipes. I don't love it, but it works.
2. Sorted List of file content digests + offset pairs. This is used in
   the plugin with a trivial binary search to locate the start of file
   content. The size is not needed because it's part of normal headers.
3. (offset of 1., offset of 2., 8 byte MAGIC value) triple

# reflink plugin

Looks for the 8 byte magic value at the end of the rpm file. If present
it alters the `RPMTAG_PAYLOADFORMAT` in memory to `clon`, and reads in
the digest-> offset table.

`rpmPackageFilesInstall()` in `fsm.c` is
modified to alter the enumeration strategy from
`rpmfiNewArchiveReader()` to `rpmfilesIter()` if not `cpio`. This is
needed because there is no cpio to enumerate. In the same function, if
`rpmpluginsCallFsmFilePre()` returns `RPMRC_PLUGIN_CONTENTS` then
`fsmMkfile()` is skipped as it is assumed the plugin did the work.

The majority of the work is in `reflink_fsm_file_pre()` - the per file
hook for RPM plugins. If the file enumerated in
`rpmPackageFilesInstall()` is a regular file, this function will look up
the offset in the digest->offset table and will try to reflink it, then
fall back to a regular copy. If reflinking does work: we will have
reflinked a whole number of pages, so we truncate the file to the
expected size. Therefore installing most files does involve two writes:
the reflink of the full size, then a fork/copy on write for the last
page worth.

If the file passed to `reflink_fsm_file_pre()` is anything other than a
regular file, it return `RPMRC_OK` so the normal mechanics of
`rpmPackageFilesInstall()` are used. That handles directories, symlinks
and other non file types.

# New API for internal use

1. `rpmReadPackageRaw()` is used within `rpm2extents` to read all the
   headers without trying to validate signatures. This eliminates the
   runtime dependency on rpmdb.
2. `rpmteFd()` exposes the Fd behind the rpmte, so plugins can interact
   with the rpm itself.
3. `RPMRC_PLUGIN_CONTENTS` in `rpmRC_e` for use in
   `rpmpluginsCallFsmFilePre()` specifically.
4. `pgpStringVal()` is used to help parse the command line in
   `rpm2extents` - the positional arguments are strings, and this
   converts the values back to the values in the table.

Nothing has been removed, and none of the changes are intended to be
used externally, so I don't think a soname bump is warranted here.
You can view, comment on, or merge this pull request online at:

  https://github.com/rpm-software-management/rpm/pull/1470

-- Commit Summary --

  * RPM with Copy on Write

-- File Changes --

    M Makefile.am (6)
    M lib/depends.c (2)
    M lib/fsm.c (50)
    M lib/package.c (40)
    M lib/rpmlib.h (9)
    M lib/rpmplugins.c (21)
    M lib/rpmte.c (5)
    M lib/rpmte.h (2)
    M lib/rpmtypes.h (3)
    M macros.in (1)
    M plugins/Makefile.am (4)
    A plugins/reflink.c (340)
    A rpm2extents.c (519)
    M rpmio/rpmpgp.c (10)
    M rpmio/rpmpgp.h (9)

-- Patch Links --

https://github.com/rpm-software-management/rpm/pull/1470.patch
https://github.com/rpm-software-management/rpm/pull/1470.diff

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/rpm-software-management/rpm/pull/1470
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rpm.org/pipermail/rpm-maint/attachments/20201229/d5ae0af8/attachment.html>


More information about the Rpm-maint mailing list