[Intel-gfx] [PATCH 2/2] drm/i915/guc: default to using GuC submission where possible

Tue Apr 26 14:00:15 UTC 2016

On Mon, Apr 25, 2016 at 09:29:42AM +0100, Chris Wilson wrote:
> On Mon, Apr 25, 2016 at 08:31:07AM +0100, Dave Gordon wrote:
> > On 22/04/16 19:51, Chris Wilson wrote:
> > >On Fri, Apr 22, 2016 at 07:45:15PM +0100, Chris Wilson wrote:
> > >>On Fri, Apr 22, 2016 at 07:22:55PM +0100, Dave Gordon wrote:
> > >>>This patch simply changes the default value of "enable_guc_submission"
> > >>>from 0 (never) to -1 (auto). This means that GuC submission will be
> > >>>used if the platform has a GuC, the GuC supports the request submission
> > >>>protocol, and any required GuC firmwware was successfully loaded. If any
> > >>>of these conditions are not met, the driver will fall back to using
> > >>>execlist mode.
> > >
> > >I just remembered something else.
> > >
> > >  * Work Items:
> > >  * There are several types of work items that the host may place into a
> > >  * workqueue, each with its own requirements and limitations. Currently only
> > >  * WQ_TYPE_INORDER is needed to support legacy submission via GuC, which
> > >  * represents in-order queue. The kernel driver packs ring tail pointer and an
> > >  * ELSP context descriptor dword into Work Item.
> > >
> > >Is this right? You only allocate a single client covering all engines and
> > >specify INORDER. We expect parallel execution between engines, is this
> > >supported? Empirically it seems like guc is only executing commands in
> > >series across engines and not in parallel.
> > >-Chris
> > 
> > AFAIK, INORDER represents in-order executions of elements in the
> > GuC's (internal) submission queue, which is per-engine; i.e. this
> > option bypasses the GuC's internal scheduling algorithms and makes
> > the GuC behave as a simple dispatcher. It demultiplexes work queue
> > items into the multiple submission queues, then executes them in
> > order from there.
> > 
> > Alex can probably confirm this in the GuC code, but I really think
> > we'd have noticed if execution were serialised across engines. For a
> > start, the validation tests that have one engine busy-spin while
> > waiting for a batch on a different engine to update a buffer
> > wouldn't ever finish.
> 
> That doesn't seem to be the issue, we can run in parallel it seems
> (busy-spin on one engine doesn't prevent a write on the second). It's 
> just the latency it seems. Overall the execution latency goes up
> substantially with guc, and in this case it does not seem to be executing
> the second execbuf on the second ring until after the first completes.

That sounds like a decent bug in guc code, and defeats the point of all
the work to speed up execlist submission going on right now.

Can we have non-slow guc somehow? Do we need to escalate this to the
firmware folks and first make sure they have a firmware released which
doesn't like to twiddle thumsb (assuming it's a guc issue indeed and not
in how we submit things)?

Afaiui the point of guc was to reduce submission latency by again having a
queue to submit to, instead of the 1.5 submit ports with execlist. There's
other reasons on top, but if firmware engineers butchered that it doesn't
look good.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch