[Intel-gfx] [PATCH 2/2] drm/i915/guc: default to using GuC submission where possible

Wed Apr 27 17:53:25 UTC 2016

On 26/04/16 15:00, Daniel Vetter wrote:
> On Mon, Apr 25, 2016 at 09:29:42AM +0100, Chris Wilson wrote:
>> On Mon, Apr 25, 2016 at 08:31:07AM +0100, Dave Gordon wrote:
>>> On 22/04/16 19:51, Chris Wilson wrote:
>>>> On Fri, Apr 22, 2016 at 07:45:15PM +0100, Chris Wilson wrote:
>>>>> On Fri, Apr 22, 2016 at 07:22:55PM +0100, Dave Gordon wrote:
>>>>>> This patch simply changes the default value of "enable_guc_submission"
>>>>> >from 0 (never) to -1 (auto). This means that GuC submission will be
>>>>>> used if the platform has a GuC, the GuC supports the request submission
>>>>>> protocol, and any required GuC firmwware was successfully loaded. If any
>>>>>> of these conditions are not met, the driver will fall back to using
>>>>>> execlist mode.
>>>>
>>>> I just remembered something else.
>>>>
>>>>   * Work Items:
>>>>   * There are several types of work items that the host may place into a
>>>>   * workqueue, each with its own requirements and limitations. Currently only
>>>>   * WQ_TYPE_INORDER is needed to support legacy submission via GuC, which
>>>>   * represents in-order queue. The kernel driver packs ring tail pointer and an
>>>>   * ELSP context descriptor dword into Work Item.
>>>>
>>>> Is this right? You only allocate a single client covering all engines and
>>>> specify INORDER. We expect parallel execution between engines, is this
>>>> supported? Empirically it seems like guc is only executing commands in
>>>> series across engines and not in parallel.
>>>> -Chris
>>>
>>> AFAIK, INORDER represents in-order executions of elements in the
>>> GuC's (internal) submission queue, which is per-engine; i.e. this
>>> option bypasses the GuC's internal scheduling algorithms and makes
>>> the GuC behave as a simple dispatcher. It demultiplexes work queue
>>> items into the multiple submission queues, then executes them in
>>> order from there.
>>>
>>> Alex can probably confirm this in the GuC code, but I really think
>>> we'd have noticed if execution were serialised across engines. For a
>>> start, the validation tests that have one engine busy-spin while
>>> waiting for a batch on a different engine to update a buffer
>>> wouldn't ever finish.
>>
>> That doesn't seem to be the issue, we can run in parallel it seems
>> (busy-spin on one engine doesn't prevent a write on the second). It's
>> just the latency it seems. Overall the execution latency goes up
>> substantially with guc, and in this case it does not seem to be executing
>> the second execbuf on the second ring until after the first completes.
>
> That sounds like a decent bug in guc code, and defeats the point of all
> the work to speed up execlist submission going on right now.
>
> Can we have non-slow guc somehow? Do we need to escalate this to the
> firmware folks and first make sure they have a firmware released which
> doesn't like to twiddle thumsb (assuming it's a guc issue indeed and not
> in how we submit things)?

According to the numbers I was getting yesterday, GuC submission is now 
slightly faster than execlists on the render engine (because execlists 
is slower on that engine), but still a bit slower on the others. See

http://www.spinics.net/lists/intel-gfx/msg94140.html

> Afaiui the point of guc was to reduce submission latency by again having a
> queue to submit to, instead of the 1.5 submit ports with execlist. There's
> other reasons on top, but if firmware engineers butchered that it doesn't
> look good.
> -Daniel

I don't think it was ever about latency. I think the GuC was added to 
reduce the overhead of fielding context-switch interrupts on the CPU.

.Dave.