[Intel-gfx] [PATCH 07/32] drm/i915: Store the reset counter when constructing a request

Mon Jan 4 07:58:21 PST 2016

On 16/12/15 10:19, Chris Wilson wrote:
> On Wed, Dec 16, 2015 at 10:44:19AM +0100, Daniel Vetter wrote:
>> On Fri, Dec 11, 2015 at 11:33:03AM +0000, Chris Wilson wrote:
>>> As the request is only valid during the same global reset epoch, we can
>>> record the current reset_counter when constructing the request and reuse
>>> it when waiting upon that request in future. This removes a very hairy
>>> atomic check serialised by the struct_mutex at the time of waiting and
>>> allows us to transfer those waits to a central dispatcher for all
>>> waiters and all requests.
>>>
>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
>>
>> "global reset epoch" got me thinking about what the implications for TDR
>> are. Now the current TDR patches iirc are still somewhat tacked on the
>> side of the existing reset handling, and so don't end up incrementing the
>> reset counter.
>>
>> But I don't agree with that design decision, and I think any reset must
>> increment the counter (and do the reset-in-progress flag dance), since
>> that's the only way to guarantee that waiters will get off locks. Even
>> though we have fewer of those with every kernel release.
>>
>> But then there's the problem that at least for some request they'll still
>> be valid after the reset, and would break with this design here. Which can
>> easily be fixed by incrementing the reset epoch for every request we
>> decide should keep running (or which gets re-scheduled/emitted for another
>> attempt), and doing that explicitly seems like a good idea.
>>
>> The only implication I see is that for per-ring reset we might want to go
>> to a per-engine reset epoch counter.
>
> Yes, I think per-engine epoch's will be required. (I've been trying to
> choose my words carefully so that it is clear that it only applies to
> today and not in a TDR future!)
>
>> So given all that I think this is a solid idea. But one comment below as a
>> fallout.
>>
>>> ---
>>>   drivers/gpu/drm/i915/i915_drv.h         |  2 +-
>>>   drivers/gpu/drm/i915/i915_gem.c         | 40 +++++++++++----------------------
>>>   drivers/gpu/drm/i915/intel_display.c    |  7 +-----
>>>   drivers/gpu/drm/i915/intel_lrc.c        |  7 ------
>>>   drivers/gpu/drm/i915/intel_ringbuffer.c |  6 -----
>>>   5 files changed, 15 insertions(+), 47 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>>> index 1043ddd670a5..f30c305a6889 100644
>>> --- a/drivers/gpu/drm/i915/i915_drv.h
>>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>>> @@ -2178,6 +2178,7 @@ struct drm_i915_gem_request {
>>>   	/** On Which ring this request was generated */
>>>   	struct drm_i915_private *i915;
>>>   	struct intel_engine_cs *ring;
>>> +	unsigned reset_counter;
>>>
>>>   	 /** GEM sequence number associated with the previous request,
>>>   	  * when the HWS breadcrumb is equal to this the GPU is processing
>>> @@ -3059,7 +3060,6 @@ void __i915_add_request(struct drm_i915_gem_request *req,
>>>   #define i915_add_request_no_flush(req) \
>>>   	__i915_add_request(req, NULL, false)
>>>   int __i915_wait_request(struct drm_i915_gem_request *req,
>>> -			unsigned reset_counter,
>>>   			bool interruptible,
>>>   			s64 *timeout,
>>>   			struct intel_rps_client *rps);
>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>> index 27e617b76418..b17cc0e42a4f 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> @@ -1214,7 +1214,6 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
>>>   /**
>>>    * __i915_wait_request - wait until execution of request has finished
>>>    * @req: duh!
>>> - * @reset_counter: reset sequence associated with the given request
>>>    * @interruptible: do an interruptible wait (normally yes)
>>>    * @timeout: in - how long to wait (NULL forever); out - how much time remaining
>>>    *
>>> @@ -1229,7 +1228,6 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
>>>    * errno with remaining time filled in timeout argument.
>>>    */
>>>   int __i915_wait_request(struct drm_i915_gem_request *req,
>>> -			unsigned reset_counter,
>>>   			bool interruptible,
>>>   			s64 *timeout,
>>>   			struct intel_rps_client *rps)
>>> @@ -1288,7 +1286,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
>>>
>>>   		/* We need to check whether any gpu reset happened in between
>>>   		 * the caller grabbing the seqno and now ... */
>>> -		if (reset_counter != i915_reset_counter(&dev_priv->gpu_error)) {
>>> +		if (req->reset_counter != i915_reset_counter(&dev_priv->gpu_error)) {
>>
>> READ_ONCE(req->reset) as a future proofing maybe for TDR? Or punt that to
>> TDR? Could be confusing to READ_ONCE something that's (with the current
>> code at least) invariant.
>
> Right, my plan was to solve this in TDR! At the moment, I am thinking
> that we regenerate a new request for each one resubmitted (that requires
> patching the seqno embedded into the ring before restarting the ring),
> but otherwise we can just duplicate the contents of the request.
> -Chris

With the scheduler, there's no need to patch the contents of the 
ringbuffer; we just reset it (so it's empty) and then have the scheduler 
re-emit the requests, starting after the one that hung.

.Dave.