<!DOCTYPE html><html><head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body> Am 22.08.24 um 10:21 schrieb Thomas Hellström: <blockquote type="cite" cite="mid:75bce0097d86896fa70d6dba4c8ddb429a5bc1bc.camel@linux.intel.com"> <pre class="moz-quote-pre" wrap="">On Thu, 2024-08-22 at 09:55 +0200, Christian König wrote: </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">Am 22.08.24 um 08:47 schrieb Thomas Hellström: </pre> <blockquote type="cite"> <blockquote type="cite"> <blockquote type="cite"> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">As Sima said, this is complicated but not beyond comprehension: i915 <a class="moz-txt-link-freetext" href="https://elixir.bootlin.com/linux/v6.11-rc4/source/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c#L317">https://elixir.bootlin.com/linux/v6.11-rc4/source/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c#L317</a> </pre> </blockquote> <pre class="moz-quote-pre" wrap="">As far as I can tell what i915 does here is extremely questionable. if (sc->nr_scanned < sc->nr_to_scan && current_is_kswapd()) { .... with_intel_runtime_pm(&i915->runtime_pm, wakeref) { with_intel_runtime_pm() then calls pm_runtime_get_sync(). So basically the i915 shrinker assumes that when called from kswapd() that it can synchronously wait for runtime PM to power up the device again. As far as I can tell that means that a device driver makes strong and completely undocumented assumptions how kswapd works internally. </pre> </blockquote> <pre class="moz-quote-pre" wrap="">Admittedly that looks weird But I'd really expect a reclaim lockdep splat to happen there if the i915 pm did something not-allowed. IIRC, the design direction the i915 people got from mm people regarding the shrinkers was to avoid any sleeps in direct reclaim and punt it to kswapd. Need to ask i915 people how they can get away with that. </pre> </blockquote> <pre class="moz-quote-pre" wrap="">So it turns out that Xe integrated pm resume is reclaim-safe, and I'd expect i915's to be as well. Xe discrete pm resume isn't. So that means that, at least for integrated, the i915 shrinker should be ok from that POW, and punting certain bos to kswapd is not AFAICT abusing any undocumented features of kswapd but rather a way to avoid resuming the device during direct reclaim, like documented. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> The more I think about this the more I disagree to this driver design. In my opinion device drivers should *never* resume runtime PM in a shrinker callback in the first place. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> Runtime PM resume is allowed even from irq context if carefully implemented by the driver and flagged as such to the core. <a class="moz-txt-link-freetext" href="https://docs.kernel.org/power/runtime_pm.html">https://docs.kernel.org/power/runtime_pm.html</a> Resuming runtime PM from reclaim therefore shouldn't be an issue IMO, and really up to the driver.</pre> </blockquote> Mhm when it's up to the driver on which level to use runtime PM then that at least explains why the framework doesn't have lockdep annotations. Ok, that is at least convincing the what i915 does here should work somehow. <blockquote type="cite" cite="mid:75bce0097d86896fa70d6dba4c8ddb429a5bc1bc.camel@linux.intel.com"> <blockquote type="cite"> <pre class="moz-quote-pre" wrap=""> When the device is turned off it means that all of it's operations are stopped and eventually power to caches etc turned off as well. So I don't see any ongoing writeback operations or similar either. So the question is why do we need to power on the device in a shrinker in the first place? What could be is that the device needs to flush GART TLBs or similar when it is turned on, e.g. that you grab a PM reference to make sure that during your HW operation the device doesn't suspend. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> Exactly why the i915 needs to flush the GART I'm not sure of but I suspect the gart TLB might be forgotten while suspended.</pre> </blockquote> Well that is unproblematic. Amdgpu and I think nouveau does something similar. But you don't need to resume the hardware for this, just grabbing the reference to make sure that it doesn't suspend is sufficient. The assumption I make here is that you don't need to do anything when the hardware is power down anyway. That seems to be true for at least the hardware designs I'm aware of. <blockquote type="cite" cite="mid:75bce0097d86896fa70d6dba4c8ddb429a5bc1bc.camel@linux.intel.com"> <blockquote type="cite"> <pre class="moz-quote-pre" wrap=""> But that doesn't mean that you should resume the device. In other words when the device is powered down you shouldn't power it up again. And for GART we already have the necessary move callback implemented in TTM. This is done by radeon, amdgpu and nouveu in a common way as far as I can see. So why should Xe be special and follow the very questionable approach of i915 here? </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> For Xe, Lunar Lake (integrated) has the interesting design that each bo carries compression metadata that needs to be blitted to system pages during shrinking. The alternative is to resolve all buffer objects at device runtime suspend...</pre> </blockquote> That's the same for amdgpu as well, but when the device is powered down those compression data needs to be evacuated anyway. <blockquote type="cite" cite="mid:75bce0097d86896fa70d6dba4c8ddb429a5bc1bc.camel@linux.intel.com"> <pre class="moz-quote-pre" wrap="">But runtime PM aside, with a one-bo only approach we still have the drawbacks that it * eliminates possibility for driver deadlock avoidance * Requires TTM knowledge of "purgeable" bos * Requires an additional LRU to avoid O(n2) traversal of already shrunken objects * Drivers with legitimate shrinker designs that don't fit in the TTM- enforced model will have frustrated maintainers.</pre> </blockquote> I still find that only halve-convincing. The real question is if it's a good idea to give drivers the power to decide what to shrink and what not to shrink. And at least with the arguments and experience at hand I would vote for not doing that. We have added the eviction_valuable callback for amdgpu and ended up in quite a mess with that. Background is that some eviction decision done by the driver where not as optimal as we hoped it to be. On the other hand keeping track of all the swapped out objects should be TTMs job anyway, e.g. having a TTM_PL_SWAPPED domain. So in my mind the ideal solution still looks like this: driver_specific_shrinker_scan(...) { driver_specific_preparations(...); bo = ttm_reserve_next_bo_to_shrink(...); ttm_bo_validate(bo, TTM_PL_SWAPPED); ttm_bo_unreserver(bo); driver_specific_cleanups(...); } When there is a potential deadlock because the shrinker might be called from driver code which holds locks the driver needs to it's specific preparation or cleanup then those would apply to all BOs and not just the one returned from TTM. The only use case I can see were the driver would need to filter out the BOs to shrink would be if TTM doesn't know about all the information to make a decision what to shrink and exactly that is what I try to avoid. Regards, Christian. <blockquote type="cite" cite="mid:75bce0097d86896fa70d6dba4c8ddb429a5bc1bc.camel@linux.intel.com"> <pre class="moz-quote-pre" wrap=""> Thanks, Thomas </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap=""> Regards, Christian. </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap=""> /Thomas </pre> </blockquote> </blockquote> <pre class="moz-quote-pre" wrap=""> </pre> </blockquote> </body> </html>