[Intel-gfx] [PATCH 55/56] drm/i915: Remove 'faked' request from LRC submission
Jesse Barnes
jbarnes at virtuousgeek.org
Wed Mar 11 09:44:24 PDT 2015
On 03/11/2015 09:14 AM, Daniel Vetter wrote:
> On Wed, Mar 11, 2015 at 02:53:39PM +0000, John Harrison wrote:
>> On 05/03/2015 14:49, Daniel Vetter wrote:
>>> On Thu, Mar 05, 2015 at 01:57:31PM +0000, John.C.Harrison at Intel.com wrote:
>>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>>
>>>> The LRC submission code requires a request for tracking purposes. It does not
>>>> actually require that request to 'complete' it simply uses it for keeping hold
>>>> of reference counts on contexts and such like.
>>>>
>>>> In the case where the ring buffer is completely full, the LRC code looks for a
>>>> pending request that would free up sufficient space upon completion and waits
>>>> for it. If no such request can be found it resorts to simply polling the free
>>>> space count until it is big enough. This situation should only occur when the
>>>> entire buffer is filled with the request currently being generated. I.e., the
>>>> user is trying to submit a single piece of work that is large than the ring
>>>> buffer itself (which should be impossible because very large batch buffers don't
>>>> consume any more ring buffer space). Before starting to poll, a submit call is
>>>> made to make sure that the currently queued up work in the buffer will actually
>>>> be submtted and thus the poll will eventually succeed.
>>>>
>>>> The problem here is that the 'official' request cannot be used as that could
>>>> lead to multiple LRC submissions being tagged to a single request structure.
>>>> Instead, the code fakes up a private request structure and uses that.
>>>>
>>>> This patch moves the faked request allocation higher up in the call stack to the
>>>> wait code itself (rather than being at the very lowest submission level). Thus
>>>> it is now obvious where the faked request is coming from and why it is
>>>> necessary. The patch also replaces it with a call to the official request
>>>> allocation code rather than attempting to duplicate that code. This becomes
>>>> especially important in the future when the request allocation changes to
>>>> accommodate a conversion to struct fence.
>>>>
>>>> For: VIZ-5115
>>>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>>> This is only possible if you pile up tons of olr. Since your patch series
>>> fixes this design issue by removing olr I think we can just put a WARN_ON
>>> in here if this ever happens and bail out with -ELOSTMYMARBLES or
>>> something. And then rip out all this complexity.
>>>
>>> Or do I miss something important?
>>> -Daniel
>>
>> Yeah, you missed the extremely important bug in the free space calculation
>> that meant this impossible code path was being hit on a regular basis. The
>> LRC wait_request code differed from the legacy wait_request code in the the
>> latter was updated with request->postfix changes and the former was not.
>> Thus the LRC one would happily find a request that frees up enough space,
>> wait on it, retire it and then find there was still not enough space!
>>
>> New patches to fix the space calculation bug and to completely remove the
>> polling path will be forth coming...
>
> Hm, Jesse did dig out a regression where gem_ringfill blows up on
> execlist. That igt is specifically to exercise this cornercases. I'm not
> sure whith bugzilla Jesse meant, there's two:
>
> https://bugs.freedesktop.org/show_bug.cgi?id=89001
> https://bugs.freedesktop.org/show_bug.cgi?id=88865
>
> Can you please check whether your fixes help for those issues two?
>
> Also adding Jesse since he's chasing these atm.
Ah cool, sounds related at least. 89001 was the one I looked at
yesterday. The worrying thing was that this bug caused a hard system
hang. Not even the reset button worked... I guess we should have an
HSD for that too so we can root cause the system part of it.
Jesse
More information about the Intel-gfx
mailing list