[Intel-gfx] [igt-dev] [PATCH i-g-t 1/2] tests/gem_exec_await: Relax the busy spinner

Mon Nov 19 19:34:37 UTC 2018

Quoting Tvrtko Ursulin (2018-11-19 19:18:52)
> 
> On 19/11/2018 16:18, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-11-19 15:33:56)
> >>
> >> On 19/11/2018 15:28, Chris Wilson wrote:
> >>> Quoting Tvrtko Ursulin (2018-11-19 15:22:28)
> >>>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> >>>>
> >>>> Add some nop instructions between recursive batch buffer start calls to
> >>>> give system some breathing room. Without these, especially when coupled
> >>>> with memory pressure, false GPU hangs can be observed caused by the
> >>>> inability of the chip to cope.
> >>>
> >>> Doesn't seem to be required. And the machines most susceptible to timer
> >>> errors due to busyspin have not show the issue.
> >>
> >> With the memory pressure subtest, the second patch in this series, it
> >> was make it or break it to have the nops. Without them it was GPU hangs
> >> all around, and with them so far all clean.
> > 
> > First machine bsw, just applying patch 2/2,
> > 
> > IGT-Version: 1.23-gb6b8d829 (x86_64) (Linux: 4.20.0-rc2+ x86_64)
> > Using Execlists submission
> > Ring size: 131 batches
> > Starting subtest: wide-all
> > wide: 420 cycles: 24121.034us
> > Subtest wide-all: SUCCESS (47.060s)
> > Starting subtest: wide-contexts
> > wide: 340 cycles: 24893.896us
> > Subtest wide-contexts: SUCCESS (22.265s)
> > Starting subtest: wide-contexts-mempressure
> > wide: 232 cycles: 25153.899us
> > Subtest wide-contexts-mempressure: SUCCESS (23.141s)
> > 
> > :|
> 
> Yes, I think what I had before I cleaned up the test case was more 
> copy&paste of the memory pressure thread from gem_syslatency - including 
> the rtprio and multithreadedness. So I was possibly starving the 
> tasklets and who knows what not, as well as applying memory pressure. 
> However, fact still is adding nops to the spinner made even that monster 
> pass repeatedly. I'll play with it more tomorrow.

I am a bit nervous about using the noops to avoid the issue, as I
presume that there is a more realistic workload that could generate
similar system latencies, i.e. that there exists a pathological case
that users will hit for similar stalls (gem shrinker perchance).
-Chris