[PATCH 4/7] drm/ttm: move LRU walk defines into new internal header

Thu Aug 22 09:29:24 UTC 2024

Am 22.08.24 um 10:21 schrieb Thomas Hellström:
> On Thu, 2024-08-22 at 09:55 +0200, Christian König wrote:
>> Am 22.08.24 um 08:47 schrieb Thomas Hellström:
>>>>>> As Sima said, this is complicated but not beyond
>>>>>> comprehension:
>>>>>> i915
>>>>>> https://elixir.bootlin.com/linux/v6.11-rc4/source/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c#L317
>>>>> As far as I can tell what i915 does here is extremely
>>>>> questionable.
>>>>>
>>>>>        if (sc->nr_scanned < sc->nr_to_scan &&
>>>>> current_is_kswapd()) {
>>>>> ....
>>>>>            with_intel_runtime_pm(&i915->runtime_pm, wakeref) {
>>>>>
>>>>> with_intel_runtime_pm() then calls pm_runtime_get_sync().
>>>>>
>>>>> So basically the i915 shrinker assumes that when called from
>>>>> kswapd()
>>>>> that it can synchronously wait for runtime PM to power up the
>>>>> device
>>>>> again.
>>>>>
>>>>> As far as I can tell that means that a device driver makes
>>>>> strong
>>>>> and
>>>>> completely undocumented assumptions how kswapd works
>>>>> internally.
>>>> Admittedly that looks weird
>>>>
>>>> But I'd really expect a reclaim lockdep splat to happen there if
>>>> the
>>>> i915 pm did something not-allowed. IIRC, the design direction the
>>>> i915
>>>> people got from mm people regarding the shrinkers was to avoid
>>>> any
>>>> sleeps in direct reclaim and punt it to kswapd. Need to ask i915
>>>> people
>>>> how they can get away with that.
>>>>
>>>>
>>> So it turns out that Xe integrated pm resume is reclaim-safe, and
>>> I'd
>>> expect i915's to be as well. Xe discrete pm resume isn't.
>>>
>>> So that means that, at least for integrated, the i915 shrinker
>>> should
>>> be ok from that POW, and punting certain bos to kswapd is not
>>> AFAICT
>>> abusing any undocumented features of kswapd but rather a way to
>>> avoid
>>> resuming the device during direct reclaim, like documented.
>> The more I think about this the more I disagree to this driver
>> design.
>> In my opinion device drivers should *never* resume runtime PM in a
>> shrinker callback in the first place.
> Runtime PM resume is allowed even from irq context if carefully
> implemented by the driver and flagged as such to the core.
>
> https://docs.kernel.org/power/runtime_pm.html
>
> Resuming runtime PM from reclaim therefore shouldn't be an issue IMO,
> and really up to the driver.

Mhm when it's up to the driver on which level to use runtime PM then 
that at least explains why the framework doesn't have lockdep annotations.

Ok, that is at least convincing the what i915 does here should work somehow.

>> When the device is turned off it means that all of it's operations
>> are
>> stopped and eventually power to caches etc turned off as well. So I
>> don't see any ongoing writeback operations or similar either.
>>
>> So the question is why do we need to power on the device in a
>> shrinker
>> in the first place?
>>
>> What could be is that the device needs to flush GART TLBs or similar
>> when it is turned on, e.g. that you grab a PM reference to make sure
>> that during your HW operation the device doesn't suspend.
> Exactly why the i915 needs to flush the GART I'm not sure of but I
> suspect the gart TLB might be forgotten while suspended.

Well that is unproblematic. Amdgpu and I think nouveau does something 
similar.

But you don't need to resume the hardware for this, just grabbing the 
reference to make sure that it doesn't suspend is sufficient.

The assumption I make here is that you don't need to do anything when 
the hardware is power down anyway. That seems to be true for at least 
the hardware designs I'm aware of.

>> But that doesn't mean that you should resume the device. In other
>> words
>> when the device is powered down you shouldn't power it up again.
>>
>> And for GART we already have the necessary move callback implemented
>> in
>> TTM. This is done by radeon, amdgpu and nouveu in a common way as far
>> as
>> I can see.
>>
>> So why should Xe be special and follow the very questionable approach
>> of
>> i915 here?
> For Xe, Lunar Lake (integrated) has the interesting design that each bo
> carries compression metadata that needs to be blitted to system pages
> during shrinking. The alternative is to resolve all buffer objects at
> device runtime suspend...

That's the same for amdgpu as well, but when the device is powered down 
those compression data needs to be evacuated anyway.

> But runtime PM aside, with a one-bo only approach we still have the
> drawbacks that it
>
> * eliminates possibility for driver deadlock avoidance
> * Requires TTM knowledge of "purgeable" bos
> * Requires an additional LRU to avoid O(n2) traversal of already
> shrunken objects
> * Drivers with legitimate shrinker designs that don't fit in the TTM-
> enforced model will have frustrated maintainers.

I still find that only halve-convincing. The real question is if it's a 
good idea to give drivers the power to decide what to shrink and what 
not to shrink.

And at least with the arguments and experience at hand I would vote for 
not doing that. We have added the eviction_valuable callback for amdgpu 
and ended up in quite a mess with that.

Background is that some eviction decision done by the driver where not 
as optimal as we hoped it to be.

On the other hand keeping track of all the swapped out objects should be 
TTMs job anyway, e.g. having a TTM_PL_SWAPPED domain.

So in my mind the ideal solution still looks like this:

driver_specific_shrinker_scan(...)
{
     driver_specific_preparations(...);
     bo = ttm_reserve_next_bo_to_shrink(...);
     ttm_bo_validate(bo, TTM_PL_SWAPPED);
     ttm_bo_unreserver(bo);
     driver_specific_cleanups(...);
}

When there is a potential deadlock because the shrinker might be called 
from driver code which holds locks the driver needs to it's specific 
preparation or cleanup then those would apply to all BOs and not just 
the one returned from TTM.

The only use case I can see were the driver would need to filter out the 
BOs to shrink would be if TTM doesn't know about all the information to 
make a decision what to shrink and exactly that is what I try to avoid.

Regards,
Christian.

>
> Thanks,
> Thomas
>
>
>> Regards,
>> Christian.
>>
>>
>>> /Thomas
>>>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20240822/9fa6d29e/attachment-0001.htm>