[Intel-gfx] [PATCH v8 3/7] drm/i915: Define and use GuC and CTB TLB invalidation routines

Tue Oct 10 09:00:42 UTC 2023

On 09/10/2023 20:14, John Harrison wrote:
> On 10/9/2023 01:56, Tvrtko Ursulin wrote:
>> On 06/10/2023 19:20, Jonathan Cavitt wrote:
>>> From: Prathap Kumar Valsan <prathap.kumar.valsan at intel.com>
>>>
>>> The GuC firmware had defined the interface for Translation Look-Aside
>>> Buffer (TLB) invalidation.  We should use this interface when
>>> invalidating the engine and GuC TLBs.
>>> Add additional functionality to intel_gt_invalidate_tlb, invalidating
>>> the GuC TLBs and falling back to GT invalidation when the GuC is
>>> disabled.
>>> The invalidation is done by sending a request directly to the GuC
>>> tlb_lookup that invalidates the table.  The invalidation is submitted as
>>> a wait request and is performed in the CT event handler.  This means we
>>> cannot perform this TLB invalidation path if the CT is not enabled.
>>> If the request isn't fulfilled in two seconds, this would constitute
>>> an error in the invalidation as that would constitute either a lost
>>> request or a severe GuC overload.
>>>
>>> With this new invalidation routine, we can perform GuC-based GGTT
>>> invalidations.  GuC-based GGTT invalidation is incompatible with
>>> MMIO invalidation so we should not perform MMIO invalidation when
>>> GuC-based GGTT invalidation is expected.
>>>
>>> Purpose of xarray:
>>> The tlb_lookup table is allocated as an xarray because the set of
>>> pending TLB invalidations may have no upper bound.  The consequence of
>>> this is that all actions interfacing with this table need to use the
>>> xarray functions, such as xa_alloc_cyclic_irq for array insertion.
>>>
>>> Purpose of must_wait_woken:
>>> Our wait for the G2H ack for the completion of a TLB invalidation is
>>> mandatory; we must wait for the HW to confirm that the physical
>>> addresses are no longer accessible before we return those to the system.
>>>
>>> On switching to using the wait_woken() convenience routine, we
>>> introduced ourselves to an issue where wait_woken() may complete early
>>> under a kthread that is stopped. Since we send a TLB invalidation when
>>> we try to release pages from the shrinker, we can be called from any
>>> process; including kthreads.
>>>
>>> Using wait_woken() from any process context causes another issue. The
>>> use of is_kthread_should_stop() assumes that any task with PF_KTHREAD
>>> set was made by kthread_create() and has called set_kthread_struct().
>>> This is not true for the raw kernel_thread():
>>
>> This explanation misses the main point of my ask - which is to explain 
>> why a simpler scheme isn't sufficient. Simpler scheme aka not needed 
>> the xarray or any flavour of wait_token().
>>
>> In other words it is obvious we have to wait for the invalidation ack, 
>> but not obvious why we need a complicated scheme.
> The alternative being to simply serialise all TLB invalidation requests? 
> Thus, no complex tracking required as there is only one in flight at a 
> time? That seems inefficient and a potential performance impact if a 
> bunch of invalidations are required back to back. But given that the 
> current scheme is global invalidation only (no support for ranges / per 
> page invalidation yet), is it possible to get multiple back-to-back 
> requests?

It is possible to get a stream of invalidation requests but all that 
come with put_pages() are serialized on the gt->tlb.invalidate_lock 
anyway. So for them only benefit with the complicated approach versus 
the dumb single wait queue is avoiding wake up storms.

Then the second source of TLB invalidations is ggtt->invalidate(). I am 
not sure if those are frequent enough to warrant parallelism. Definitely 
shouldn't be for things like context images and ringbuffers. So I was 
asking if maybe framebuffers but don't know.

Regards,

Tvrtko