[Rpm-ecosystem] DNF's use cases of CAShe

Radek Holy rholy at redhat.com
Thu Jul 2 05:20:51 UTC 2015

----- Original Message -----
> From: "James Antill" <james at fedoraproject.org>
> To: "rpm-ecosystem" <rpm-ecosystem at lists.rpm.org>
> Sent: Thursday, July 2, 2015 5:56:33 AM
> Subject: Re: [Rpm-ecosystem] DNF's use cases of CAShe

Thanks for the quick response.

> On Wed, 2015-07-01 at 09:10 -0400, Radek Holy wrote:
> > Hi James (and others),
> > 
> > I've identified several DNF's use cases that, I believe, can be
> > supported by the new CAShe project.
> > 
> > 
> > Makecache
> > ---------
> > 
> > DNF's metadata cache is often refreshed; on demand, regularly or lazily...
> > 1) download the checksums of the current metadata
> > 2) if they do not match the cache, download the current metadata and store
> > them in the cache
> > 3) remove the old metadata of the given repositories from the cache
> > 
> > In this case, I think that DNF should remember the checksums of the
> > metadata stored in CAShe for each repository (including the
> > $RELEASEVER etc.). Then it could store the new metadata in CAShe, make
> > the local hardlinks, remove the old hardlinks, trigger the cleanup and
> > update the list of the checksums.
>  huh? I'm not sure why you want to remember the checksums of metadata?

Well, making the hardlinks is actually "remembering the checksums". I need to keep the hardlinks as long as the data are valid so that an accidental cleanup does not remove the data. But I agree that the paragraph is confusing. There is no need to have an additional list of the hardlinks in the internal cache. The directory is enough.

>  The point of the CAShe is that when you are about to download something
> with a checksum of XYZ you do:
> 1. does object with checksum XYZ exist in the CAShe.
> 2a. If yes, then get it into my program cache.
> 2b. If no, then I download it. When done I can put it into the CAShe.
> ...where hopefully 2a and 2b can use hardlinks, to save diskspace and do
> automatic tracking of inuse objects.
>  I wouldn't trigger the CAShe cleanup here, esp. as you didn't do
> anything that would need cleaning up.

It removes the hardlinks to the obsoleted metadata if there are any. Even after that you wouldn't trigger the cleanup? (I don't insist on it, I just expected that this is what would users prefer.) I agree that if nothing was obsoleted, there is no need for a cleanup.

>  The only time you need to think about the checksums are when you are
> downloading (when you'd have them anyway).
> > Download
> > --------
> > 
> > People sometimes want to download an RPM.
> > 1) "makecache"
> > 2) is the package stored in the cache?
> >     yes) copy it to the destination directory
> >     no) just download it (without storing it in the cache)
> > It may be configured to make a hardlink instead of copying the file
> > and to store the package in the cache.
> > 
> > The default case is just about retrieving the RPM from CAShe. No rocket
> > science.
>  Sure, this seems fine.
> > Install/upgrade etc.
> > --------------------
> > 
> > These are the core use cases of DNF.
> > 1) "makecache"
> > 2) if the packages are not stored in the cache, download them and
> > store them in the cache
> > 3) once the transaction is successfully performed (after N attempts),
> > all the downloaded packages are removed from the cache
> > 
> > Also people want to download the packages in advance and perform the
> > upgrade later. The procedure is the same. The only difference is that
> > there is a delay between (2) and (3) for sure.
> > 
> > In this case, DNF should persist the list of the checksums to be
> > removed after the successful transaction in addition. Then, it will
> > remove the hardlinks, trigger the cleanup and clean the list.
>  Why do you want to explicitly/manually remove things from the CAShe?
> That defeats the purpose. Just run the cleanup code, and if the packages
> you just installed are chosen to be deleted so be it ... no reason to
> manually delete them though.

I'd just remove the hardlinks from the DNF's cache and trigger the cleanup. Is this wrong? So, how can a user who has a very low disk capacity make sure that there is nothing unneeded in CAShe at every single moment?

>  Indeed the installroot usecases break if you do this, as do the
> multi-machine use cases. And while less likely there are use cases where
> a user would want the packages again.
>  In yum (and I assume dnf) we have to remove the packages from the
> application cache after they are installed in a transaction, this is
> because there's no other good time to find and delete them. So there is
> only one lifetime configuration, delete when used or keep forever. CAShe
> allows the user to set more useful lifetime and size constraints
> > Upgrade other devices, virtual machines and containers from a single cache
> > --------------------------------------------------------------------------
> > 
> > People want to download the data once and reuse them among the whole LAN.
> > 1) "Install/upgrade etc." on all representative systems
> > 2) but don't remove anything, or remove just those packages that are
> > not needed any more (based on access times or using a depsolver)
> >
> > Here I think that it shouldn't be needed to hardlink the packages into
> > DNF's cache since the instance which downloaded them probably does not
> > need them any more.
>  I'm not sure what you mean here. For each system just download them as
> you would, and delete them from DNF's cache as you would.
>  Then for each machine either have the CAShe mounted over NFS, or use
> the cashe rsync-to/rsync-from commands.

So, how would you set up a package manager and CAShe (and potentially every other software which uses CAShe) to make sure that every single package is downloaded only once in a potentially inhomogeneous network?

> > Undo/downgrade
> > --------------
> > 
> > Fedora removes the old packages from the repositories but people
> > sometimes need to undo a transaction or downgrade a package.
> > 1) "Install/upgrade etc."
> > 2) but remove only those packages that were persisted on the list but
> > were *not* installed during the last successful transaction
> > 
> > This is the same as the "Install/upgrade etc." case, just the DNF's
> > logic includes an additional condition. Also this may be a task for
> > another tool.
>  I'm not sure if you are trying to implement some kind of hidden/shadow
> repos. with the CAShe data here, or something?
>  If you want to be able to download/downgrade upgraded Fedora packages
> then you also want to implement something similar to the "yum-local"
> plugin. I wouldn't recommend using CAShe as a backend for this though.

Yes, that's it. I wanted to make the "local" plugin use CAShe.

>  Again, just treat it as it works in DNF now. If the package is
> available from a repo. with a checksum, then you don't need to download
> it if you can look it up in the CAShe.

In this case, you need to have every single package which have ever been installed in CAShe. How would you achieve that if not the way I proposed.

> > What do you think? Are these use cases reasonable from the CAShe's
> > POV? What do you think about the brief implementation proposals?
> > 
> > 
> > 
> > And some ideas:
> > - I think that DNF should, by default, trigger the cleanup just for
> > those packages of which it removed the hardlinks (since the operation
> > may not be cheap).
>  No, don't explicitly remove anything.
>  You can decide not to call the cleanup operation unless you have
> removed packages from DNF's cache (presumably due to a transaction), to
> not do the "expensive" operation.

Sure, I don't want to call the cleanup operation if I don't remove any hardlink. But if I do unlink something, then I believe that I should be able to ask CAShe to check whether the given content is needed somewhere else and if not, clean it. But since there can be many unneeded items in CAShe, I don't want to force user to wait for the general cleanup after every successful "dnf upgrade".

>  Note that "expensive" here is relative, in the best (hopefully normal)
> case getting a file from the CAShe is a single syscall, and putting a
> file into it is 2 syscalls. By comparison the cleanup operation requires
> at least reading the config. and readdir+stat'ing all the files in the
> CAShe. If you have to load from disk that readdir'ing+stat'ing is
> noticeable when someone runs "list foo", but not so much when someone
> runs "upgrade firefox".
> > - I think that there should be some other way how to mark that a file
> > should stay in CAShe. E.g. in the "makecache" case, a user may clean
> > the DNF's cache for some reason. Then something may trigger the
> > automatic cleanup. It will remove the metadata even though they are
> > still useful for DNF and for the other package managers that were not
> > executed for some time, if the data are fresh.
>  No, that's not how it works. Even if you manage to run "cashe cleanup"
> just after you ran "dnf clean all" it won't delete everything, it just
> obeys the limits (either by default or changed by the user).
>  By default that's the last 500MB of stuff used (upto another 1500MB of
> stuff, if it's been accessed in the 8 days). Obviously those can be
> changed.

OK, this probably is not a problem in case of package managers and metadata since it is assumed that they are run at least once a week or that the data will become obsoleted after 8 days anyway. But the sysadmin should know then that they shouldn't set the CAShe's time limit below the longest expiration period of all the enabled repositories if they don't want to re-download the metadata again (in case they are out of disk space and run the CAShe cleanup often). I mean, there might be less important data in CAShe than the repository metadata (even if those data were accessed later) which should be removed first if the limits are exceeded and CAShe currently cannot recognize the priority of the content.

> >  The same goes for the "multiple devices" and "undo/downgrade" cases.
> > Packages can be installed using multiple package managers and all of
> > them should be able to contribute to these repositories for the other
> > devices or to allow the future downgrades. If the metadata were marked
> > as "latest metadata for given repository" and the packages as
> > "packages for random/remote usage", all the package managers could
> > better collaborate to achieve the goal.
>  That's what the hardlinking does, without having to put any knowledge
> about repos./packages/URLs/etc. in the CAShe layer. And in the cases
> where hardlinking doesn't work all it needs to care about is data =>
> checksum mapping and what is the most recently used data, and it should
> mostly just work anyway (depending on what you set the limits to).
Radek Holý
Associate Software Engineer
Software Management Team
Red Hat Czech

More information about the Rpm-ecosystem mailing list