[PATCH 4/7] drm/ttm: move LRU walk defines into new internal header

Thu Aug 22 06:47:48 UTC 2024

On Wed, 2024-08-21 at 14:00 +0200, Thomas Hellström wrote:
> Hi,
> 
> On Wed, 2024-08-21 at 11:48 +0200, Christian König wrote:
> > Am 21.08.24 um 10:57 schrieb Thomas Hellström:
> > > On Wed, 2024-08-21 at 10:14 +0200, Christian König wrote:
> > > > Am 20.08.24 um 18:00 schrieb Thomas Hellström:
> > > > > > Or why exactly should shrinking fail?
> > > > > A common example would be not having runtime pm and the
> > > > > particular
> > > > > bo
> > > > > needs it to unbind, we want to try the next bo. Example: i915
> > > > > GGTT
> > > > > bound bos and Lunar Lake PL_TT bos.
> > > > WHAT? So you basically block shrinking BOs because you can't
> > > > unbind
> > > > them
> > > > because the device is powered down?
> > > > 
> > > > I would say that this is a serious NO-GO. It basically means
> > > > that
> > > > powered down devices can lock down system memory for undefined
> > > > amount
> > > > of
> > > > time.
> > > > 
> > > > In other words an application can allocate memory, map it into
> > > > GGTT
> > > > and
> > > > then suspend or even get killed and we are not able to recover
> > > > the
> > > > memory because there is no activity on the GPU any more?
> > > > 
> > > > That really sounds like a bug in the driver design to me.
> > > It's bad but it's not as bad as it sounds.
> > > 
> > > Problem is we can't wake up during direct reclaim IIRC due to
> > > runtime
> > > pm lockdep violations, but we can and do fire up a thread to wake
> > > the
> > > device and after the wakeup delay have subsequent shrink calls
> > > succeed,
> > > or punt to kswapd or the oom handler.
> > 
> > Yeah that is obvious. The runtime PM is an interface designed to be
> > used 
> > from a very high level IOCTL/system call.
> > 
> > And delegating that from a shrinker to a worker is not valid as far
> > as I 
> > can see, instead of reducing the memory pressure the shrinker would
> > then 
> > increase it.
> 
> Perhaps an ignorant question, but aren't IO devices potentially woken
> up for swapout at memory pressure time using runtime PM, thereby
> increasing memory pressure in a similar way?
> 
> > 
> > > I think that's an orthogonal discussion, though. There are other
> > > reasons shrinking might fail, like the bo being busy in direct
> > > reclaim
> > > (shouldn't wait for idle there but ok in kswapd), Other points of
> > > failure is ofc shmem radix tree allocations (not seen one yet,
> > > though)
> > > which might succeed with a smaller bo.
> > > (Not saying, though, that there isn't more to be done with the xe
> > > runtime pm implementation).
> > 
> > I don't think that argumentation is valid.
> > 
> > When a BO is locked then that it is ok to not shrink it, but TTM
> > should 
> > be able to determine all those prerequisites.
> > 
> > In other words the idea of a function returning a BO to the driver
> > is
> > that the driver is obligated to shrink that one.
> 
> But that doesn't take potential driver deadlock avoidance into
> account,
> so it would restrict the driver shrinkers by assuming that all
> deadlock
> avoidance needed is known by TTM.
> 
> And since small memory allocations are allowed even at reclaim time
> to
> be able so reclaim or shrink even more memory, IMO the pm resume
> problem reduces into a deadlock avoidance problem where we use
> pm_tryget similar to bo_trylock. Important thing, though, in both
> cases, that the device is woken or similarly the lock is released in
> reasonable time for the reclaim to retry and succeed.
> 
> > 
> > That other necessary allocation can fail like shmen for example is 
> > obvious as well, but that's why we discussed to allow shrinking BOs
> > partially as well.
> 
> Good point, that eliminates that problem.
> 
> > 
> > And I really don't think this discussion is orthogonal. We are
> > basically 
> > discussing what drivers should do and not should do. And as far as
> > I
> > can 
> > see the requirement to expose the LRUs to drivers comes up only
> > because 
> > the driver wants to do something it shouldn't.
> 
> Currently we have the purgeable vs non-purgeable and driver-level
> trylock-type deadlock avoidance not falling into those categories.
> And
> I'd like to categorize runtime pm as a trylock-type deadlock
> avoidance.
> 
> bo trylock and idle could, as you say, be handled in the TTM helper,
> so
> can purgeable and non-purgeable at lack of swap-space, provided that
> we
> provide such a flag in the ttm_bo ofc?
> 
> > 
> > > > > And again, all other drm bo shrinkers do this. We just want
> > > > > to
> > > > > do
> > > > > the
> > > > > same.
> > > > Do you have pointers?
> > > As Sima said, this is complicated but not beyond comprehension:
> > > i915
> > > https://elixir.bootlin.com/linux/v6.11-rc4/source/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c#L317
> > 
> > As far as I can tell what i915 does here is extremely questionable.
> > 
> >      if (sc->nr_scanned < sc->nr_to_scan && current_is_kswapd()) {
> > ....
> >          with_intel_runtime_pm(&i915->runtime_pm, wakeref) {
> > 
> > with_intel_runtime_pm() then calls pm_runtime_get_sync().
> > 
> > So basically the i915 shrinker assumes that when called from
> > kswapd()
> > that it can synchronously wait for runtime PM to power up the
> > device
> > again.
> > 
> > As far as I can tell that means that a device driver makes strong
> > and
> > completely undocumented assumptions how kswapd works internally.
> 
> Admittedly that looks weird
> 
> But I'd really expect a reclaim lockdep splat to happen there if the
> i915 pm did something not-allowed. IIRC, the design direction the
> i915
> people got from mm people regarding the shrinkers was to avoid any
> sleeps in direct reclaim and punt it to kswapd. Need to ask i915
> people
> how they can get away with that.
> 
> 

So it turns out that Xe integrated pm resume is reclaim-safe, and I'd
expect i915's to be as well. Xe discrete pm resume isn't.

So that means that, at least for integrated, the i915 shrinker should
be ok from that POW, and punting certain bos to kswapd is not AFAICT
abusing any undocumented features of kswapd but rather a way to avoid
resuming the device during direct reclaim, like documented.

/Thomas