[Rpm-ecosystem] DNF's use cases of CAShe

Fri Jul 3 03:27:47 UTC 2015

On Thu, 2015-07-02 at 01:20 -0400, Radek Holy wrote:
> > On Wed, 2015-07-01 at 09:10 -0400, Radek Holy wrote:
> >  The point of the CAShe is that when you are about to download something
> > with a checksum of XYZ you do:
> > 
> > 1. does object with checksum XYZ exist in the CAShe.
> > 
> > 2a. If yes, then get it into my program cache.
> > 
> > 2b. If no, then I download it. When done I can put it into the CAShe.
> > 
> > ...where hopefully 2a and 2b can use hardlinks, to save diskspace and do
> > automatic tracking of inuse objects.
> >  I wouldn't trigger the CAShe cleanup here, esp. as you didn't do
> > anything that would need cleaning up.
> 
> It removes the hardlinks to the obsoleted metadata if there are any.
> Even after that you wouldn't trigger the cleanup? (I don't insist on
> it, I just expected that this is what would users prefer.)

 I'm not sure if it's a good idea to do a cleanup when metadata is
deleted. I didn't do it on the yum side because it's very likely that in
Fedora you'll be installing an update at some point in the near future,
and we can cleanup a few extra metadata files then. It might be a good
idea to do it more often though, if only to stop weird edge cases.

[...]
> I'd just remove the hardlinks from the DNF's cache and trigger the
> cleanup. Is this wrong?

 Ok, yeh that's good, I thought you meant manually removing data from
the CAShe after you deleted it in DNF.

> > > Upgrade other devices, virtual machines and containers from a single cache
> > > --------------------------------------------------------------------------
> > > 
> > > People want to download the data once and reuse them among the whole LAN.
> > > 1) "Install/upgrade etc." on all representative systems
> > > 2) but don't remove anything, or remove just those packages that are
> > > not needed any more (based on access times or using a depsolver)
> > >
> > > Here I think that it shouldn't be needed to hardlink the packages into
> > > DNF's cache since the instance which downloaded them probably does not
> > > need them any more.
> > 
> >  I'm not sure what you mean here. For each system just download them as
> > you would, and delete them from DNF's cache as you would.
> >  Then for each machine either have the CAShe mounted over NFS, or use
> > the cashe rsync-to/rsync-from commands.
> 
> So, how would you set up a package manager and CAShe (and potentially
> every other software which uses CAShe) to make sure that every single
> package is downloaded only once in a potentially inhomogeneous
> network?

 As I said above, you just set it up as normal and use a big NFS store
or use the rsync mirroring. The objects are accessed by checksum, so as
long as that doesn't change all the programs/hosts can access the same
data.

> > > Undo/downgrade
> > > --------------
> > > 
> > > Fedora removes the old packages from the repositories but people
> > > sometimes need to undo a transaction or downgrade a package.
> > > 1) "Install/upgrade etc."
> > > 2) but remove only those packages that were persisted on the list but
> > > were *not* installed during the last successful transaction
> > > 
> > > This is the same as the "Install/upgrade etc." case, just the DNF's
> > > logic includes an additional condition. Also this may be a task for
> > > another tool.
> > 
> >  I'm not sure if you are trying to implement some kind of hidden/shadow
> > repos. with the CAShe data here, or something?
> >  If you want to be able to download/downgrade upgraded Fedora packages
> > then you also want to implement something similar to the "yum-local"
> > plugin. I wouldn't recommend using CAShe as a backend for this though.
> 
> Yes, that's it. I wanted to make the "local" plugin use CAShe.

 Having the plugin use it as well, is ok (but I'm not sure it provides
much benefit, as the local repo. is local anyway).
 The plugin can't use only the CAShe though as it needs it's own primary
storage to control the lifetimes of the packages in the local repo. (Eg.
last N versions of each package).

> >  Again, just treat it as it works in DNF now. If the package is
> > available from a repo. with a checksum, then you don't need to download
> > it if you can look it up in the CAShe.
> > 
> 
> In this case, you need to have every single package which have ever
> been installed in CAShe. How would you achieve that if not the way I
> proposed.

 All the packages go into the CAShe, if the user configures the storage
to be big enough then they'll stay there ... if not they get removed.

> >  No, don't explicitly remove anything.
> >  You can decide not to call the cleanup operation unless you have
> > removed packages from DNF's cache (presumably due to a transaction), to
> > not do the "expensive" operation.
> 
> if I do unlink something, then I believe that I should be able to ask
> CAShe to check whether the given content is needed somewhere else and
> if not, clean it. But since there can be many unneeded items in CAShe,
> I don't want to force user to wait for the general cleanup after every
> successful "dnf upgrade".

 The question mostly isn't "is this needed anymore" the question is "if
we need to delete something which of the items we have are the least
likely to be needed", and to answer that we need to look at everything
and what the user configured limits/policy are.

[...]
>  But the sysadmin should know then that they shouldn't set the CAShe's
> time limit below the longest expiration period of all the enabled
> repositories if they don't want to re-download the metadata again (in
> case they are out of disk space and run the CAShe cleanup often).

 One thing here is that CAShe doesn't have a timelimit in a way that
would do that, data isn't deleted _just because_ it's N days old.

>  I mean, there might be less important data in CAShe than the
> repository metadata (even if those data were accessed later) which
> should be removed first if the limits are exceeded and CAShe currently
> cannot recognize the priority of the content.

 I mean ... this is a problem with all caches that aren't clairvoyant,
and any priorities will be different for different usecases so I didn't
try that atm. (in theory you could hack it using utime, but again ...).
 I'm assuming that LRU is going to be better than MRU, and if you want
to keep a lot of stuff you can always configure the storage size to be
bigger etc. (disk is cheap in this case).