[Intel-gfx] [PATCH] drm/i915/execlists: Poison the CSB after use

Tue Oct 30 09:59:18 UTC 2018

Chris Wilson <chris at chris-wilson.co.uk> writes:

> Quoting Mika Kuoppala (2018-10-30 09:31:56)
>> Chris Wilson <chris at chris-wilson.co.uk> writes:
>> 
>> > After reading the event status from the CSB, write back 0 (an invalid
>> > value) so we can detect if the HW should signal a new event without
>> > writing the event in the future.
>> >
>> > References: https://bugs.freedesktop.org/show_bug.cgi?id=108315
>> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>> > Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
>> > ---
>> >  drivers/gpu/drm/i915/intel_lrc.c | 3 +++
>> >  1 file changed, 3 insertions(+)
>> >
>> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>> > index 22b57b8926fc..126efe20d2d6 100644
>> > --- a/drivers/gpu/drm/i915/intel_lrc.c
>> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> > @@ -910,6 +910,9 @@ static void process_csb(struct intel_engine_cs *engine)
>> >                         execlists->active);
>> >  
>> >               status = buf[2 * head];
>> > +             GEM_BUG_ON(!status);
>> 
>> Assuming we still have a timing issue in here, how about
>> we poll a little until status != 0 and then continue with warning?
>
> If there's any race condition here, we definitely do not want to paper
> over it.
>  
>> We could recover by finding the 'bit late' status, instead of
>> oopsing out.
>
> Oopsing out tells us where the problem is very concisely.

It would deliver the same information, so not papering over. Only
benefit is that with this signalling it wont be lost.

>  
>> > +             GEM_DEBUG_EXEC(WRITE_ONCE(*(u32 *)(buf + 2 * head), 0));
>> 
>> What I am afraid here is that we change the timing and cache dynamics
>> for our debug builds so that we bury the pesky thing.
>
> That too is a result.

Agreed, so you want to observe behaviour with and without.

>> Perhaps I am wandering too far but lets consider for the csb loop:
>> 
>> read head,tail;
>> rmb();
>> 
>> for_each_csb() {
>>   64 bit read 
>>   64 bit write to zero it, unconditionally 
>>   act_on_it()
>> }
>> 
>> Too heavy?
>
> Too papery - shouts that we don't know what we or the hw is doing. We
> want to pretend that we know what we are doing at least.

Fair enough. Mainly the amount of reads with and without debugs, changes
inside the csb loop was my concern. But that view should be static to
cpu at this point regardless.

So lets try to find out what exactly how the hardware writes
the csb entries.

This patch does give us more details,
Reviewed-by: Mika Kuoppala <mika.kuoppala at linux.intel.com>