[Intel-gfx] [PATCH 4/5] drm/i915: Disable semaphore busywaits on saturated systems

Tue Apr 30 09:04:43 UTC 2019

Quoting Tvrtko Ursulin (2019-04-30 09:55:59)
> 
> On 29/04/2019 19:00, Chris Wilson wrote:
> > Asking the GPU to busywait on a memory address, perhaps not unexpectedly
> > in hindsight for a shared system, leads to bus contention that affects
> > CPU programs trying to concurrently access memory. This can manifest as
> > a drop in transcode throughput on highly over-saturated workloads.
> > 
> > The only clue offered by perf, is that the bus-cycles (perf stat -e
> > bus-cycles) jumped by 50% when enabling semaphores. This corresponds
> > with extra CPU active cycles being attributed to intel_idle's mwait.
> > 
> > This patch introduces a heuristic to try and detect when more than one
> > client is submitting to the GPU pushing it into an oversaturated state.
> > As we already keep track of when the semaphores are signaled, we can
> > inspect their state on submitting the busywait batch and if we planned
> > to use a semaphore but were too late, conclude that the GPU is
> > overloaded and not try to use semaphores in future requests. In
> > practice, this means we optimistically try to use semaphores for the
> > first frame of a transcode job split over multiple engines, and fail is
> > there are multiple clients active and continue not to use semaphores for
> > the subsequent frames in the sequence. Periodically, trying to
> > optimistically switch semaphores back on whenever the client waits to
> > catch up with the transcode results.
> > 
> 
> [snipped long benchmark results]
> 
> > Indicating that we've recovered the regression from enabling semaphores
> > on this saturated setup, with a hint towards an overall improvement.
> > 
> > Very similar, but of smaller magnitude, results are observed on both
> > Skylake(gt2) and Kabylake(gt4). This may be due to the reduced impact of
> > bus-cycles, where we see a 50% hit on Broxton, it is only 10% on the big
> > core, in this particular test.
> > 
> > One observation to make here is that for a greedy client trying to
> > maximise its own throughput, using semaphores is the right choice. It is
> > only the holistic system-wide view that semaphores of one client
> > impacts another and reduces the overall throughput where we would choose
> > to disable semaphores.
> 
> Since we acknowledge problem is the shared nature of the iGPU, my 
> concern is that we still cannot account for both partners here when 
> deciding to omit semaphore emission. In other words we trade bus 
> throughput for submission latency.
> 
> Assuming a light GPU task (in the sense of not oversubscribing, but with 
> ping-pong inter-engine dependencies), simultaneous to a heavier CPU 
> task, our latency improvement still imposes a performance penalty on the 
> latter.

Maybe, maybe not. I think you have to be in the position where there is
no GPU latency to be gained for the increased bus traffic to lose.

> For instance a consumer level single stream transcoding session with CPU 
> heavy part of the pipeline, or a CPU intensive game.
> 
> (Ideally we would need a bus saturation signal to feed into our logic, 
> not just engine saturation. Which I don't think is possible.)
> 
> So I am still leaning towards being cautious and just abandoning 
> semaphores for now.

Being greedy, the single consumer case is compelling. The same
benchmarks see 5-10% throughput improvement for the single client
(depending on machine).
-Chris