[Intel-gfx] [PATCH] drm/i915: Ditch i915 globals shrink infrastructure

Thu Jul 22 14:02:32 UTC 2021

On 22/07/2021 14:37, Jason Ekstrand wrote:
> On Thu, Jul 22, 2021 at 5:34 AM Tvrtko Ursulin
> <tvrtko.ursulin at linux.intel.com> wrote:
>> On 22/07/2021 11:16, Daniel Vetter wrote:
>>> On Thu, Jul 22, 2021 at 11:02:55AM +0100, Tvrtko Ursulin wrote:
>>>> On 21/07/2021 19:32, Daniel Vetter wrote:
>>>>> This essentially reverts
>>>>>
>>>>> commit 84a1074920523430f9dc30ff907f4801b4820072
>>>>> Author: Chris Wilson <chris at chris-wilson.co.uk>
>>>>> Date:   Wed Jan 24 11:36:08 2018 +0000
>>>>>
>>>>>        drm/i915: Shrink the GEM kmem_caches upon idling
>>>>>
>>>>> mm/vmscan.c:do_shrink_slab() is a thing, if there's an issue with it
>>>>> then we need to fix that there, not hand-roll our own slab shrinking
>>>>> code in i915.
>>>>
>>>> This is somewhat incomplete statement which ignores a couple of angles so I
>>>> wish there was a bit more time to respond before steam rolling it in. :(
>>>>
>>>> The removed code was not a hand rolled shrinker, but about managing slab
>>>> sizes in face of bursty workloads. Core code does not know when i915 is
>>>> active and when it is idle, so calling kmem_cache_shrink() after going idle
>>>> wass supposed to help with house keeping by doing house keeping work outside
>>>> of the latency sensitive phase.
>>>>
>>>> To "fix" (improve really) it in core as you suggest, would need some method
>>>> of signaling when a slab user feels is an opportunte moment to do this house
>>>> keeping. And kmem_cache_shrink is just that so I don't see the problem.
>>>>
>>>> Granted, argument kmem_cache_shrink is not much used is a valid one so
>>>> discussion overall is definitely valid. Becuase on the higher level we could
>>>> definitely talk about which workloads actually benefit from this code and
>>>> how much which probably no one knows at this point.
> 
> Pardon me for being a bit curt here, but that discussion should have
> happened 3.5 years ago when this landed.  The entire justification we
> have on record for this change is, "When we finally decide the gpu is
> idle, that is a good time to shrink our kmem_caches."  We have no
> record of any workloads which benefit from this and no recorded way to
> reproduce any supposed benefits, even if it requires a microbenchmark.
> But we added over 100 lines of code for it anyway, including a bunch
> of hand-rolled RCU juggling.  Ripping out unjustified complexity is
> almost always justified, IMO.  The burden of proof here isn't on
> Daniel to show he isn't regressing anything but it was on you and
> Chris to show that complexity was worth something back in 2018 when
> this landed.

It feels like there is so much knee-jerk when looking at code added by 
Chris which often results in not reading properly what I wrote.

For instance I did not ask for any proof of no regressions, neither I 
claimed any regressions. In fact I said clearly that at this point it is 
not known what benefited from it. Statement at the time wasn't clear so 
you would need to ask Chris whether he remembers any better than what I 
can find in mailing list archives. Heck I even said the argument to 
remove is completely valid..

Point is, process used to be more inclusive and IMO there is no 
technical justification to fast track this type of change. Compared to 
other work in progress there was approaching zero maintenance cost with 
this.

Besides, mm folks may still say that it is good hygiene to tidy own 
slabs at opportune moments. Maybe it is a stretch but we don't know if 
we don't ask. There are certainly online references to slab reclaim 
being problematic in the past. There was nothing urgent in this "revert" 
which couldn't have waited a bit longer, or at least _some_ of the 
involved people copied.

Regards,

Tvrtko