[Intel-gfx] [CI-ping 15/15] drm/i915: Late request cancellations are harmful

Tue Apr 19 12:35:32 UTC 2016

On 13/04/16 15:21, John Harrison wrote:
> On 13/04/2016 10:57, Daniel Vetter wrote:
>> On Tue, Apr 12, 2016 at 09:03:09PM +0100, Chris Wilson wrote:
>>> Conceptually, each request is a record of a hardware transaction - we
>>> build up a list of pending commands and then either commit them to
>>> hardware, or cancel them. However, whilst building up the list of
>>> pending commands, we may modify state outside of the request and make
>>> references to the pending request. If we do so and then cancel that
>>> request, external objects then point to the deleted request leading to
>>> both graphical and memory corruption.
>>>
>>> The easiest example is to consider object/VMA tracking. When we mark an
>>> object as active in a request, we store a pointer to this, the most
>>> recent request, in the object. Then we want to free that object, we wait
>>> for the most recent request to be idle before proceeding (otherwise the
>>> hardware will write to pages now owned by the system, or we will attempt
>>> to read from those pages before the hardware is finished writing). If
>>> the request was cancelled instead, that wait completes immediately. As a
>>> result, all requests must be committed and not cancelled if the external
>>> state is unknown.
>>>
>>> All that remains of i915_gem_request_cancel() users are just a couple of
>>> extremely unlikely allocation failures, so remove the API entirely.
>>>
>>> A consequence of committing all incomplete requests is that we generate
>>> excess breadcrumbs and fill the ring much more often with dummy work. We
>>> have completely undone the outstanding_last_seqno optimisation.
>>>
>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93907
>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
>>> Cc: stable at vger.kernel.org
>> Cc: John Harrison <John.C.Harrison at Intel.com>
>>
>> I'd like John's ack on this on too, but patch itself looks sound. Fast
>> r-b
>> since we've discussed this a while ago already ...
>
> I think this is going to cause a problem with the scheduler. You are
> effectively saying that the execbuf call cannot fail beyond the point of
> allocating a request. If it gets that far then it must go all the way
> and submit the request to the hardware. With a scheduler, that means
> adding it to the scheduler's queues and tracking it all the way through
> the system to completion. If nothing else, that sounds like a lot of
> extra overhead for no actual work. Or worse if the failure is at a point
> where the request cannot be sent further through the system because it
> was something critical that failed then you are really stuffed.
>
> I'm not sure what the other option would be though, short of being able
> to undo the last read/write object tracking updates.

With the chained-ownership code we have in the scheduler, it becomes 
perfectly possible to undo the last-read/write tracking changes.

I'd much rather see any failure during submission rewound and undone, so 
we can just return -EAGAIN at any point and let someone retry if required.

This just looks like a hack to work around not having a properly 
transactional model of request submission :(

.Dave.