[Intel-gfx] [PATCH 15/15] drm/i915/guc: Split hw submission for replay after GPU reset

Mon Nov 28 14:19:32 UTC 2016

On Mon, Nov 28, 2016 at 02:02:33PM +0000, Tvrtko Ursulin wrote:
> 
> On 25/11/2016 09:30, Chris Wilson wrote:
> >Something I missed before sending off the partial series was that the
> >non-scheduler guc reset path was broken (in the full series, this is
> >pushed to the execlists reset handler). The issue is that after a reset,
> >we have to refill the GuC workqueues, which we do by resubmitting the
> >requests. However, if we already have submitted them, the fences within
> >them have already been used and triggering them again is an error.
> >Instead, just repopulate the guc workqueue.
> >
> >[  115.858560] [IGT] gem_busy: starting subtest hang-render
> >[  135.839867] [drm] GPU HANG: ecode 9:0:0xe757fefe, in gem_busy [1716], reason: Hang on render ring, action: reset
> >[  135.839902] drm/i915: Resetting chip after gpu hang
> >[  135.839957] [drm] RC6 on
> >[  135.858351] ------------[ cut here ]------------
> >[  135.858357] WARNING: CPU: 2 PID: 45 at drivers/gpu/drm/i915/i915_sw_fence.c:108 i915_sw_fence_complete+0x25/0x30
> >[  135.858357] Modules linked in: rfcomm bnep binfmt_misc nls_iso8859_1 input_leds snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core btusb btrtl snd_hwdep snd_pcm 8250_dw snd_seq_midi hid_lenovo snd_seq_midi_event snd_rawmidi iwlwifi x86_pkg_temp_thermal coretemp snd_seq crct10dif_pclmul snd_seq_device hci_uart snd_timer crc32_pclmul ghash_clmulni_intel idma64 aesni_intel virt_dma btbcm snd btqca aes_x86_64 btintel lrw cfg80211 bluetooth gf128mul glue_helper ablk_helper cryptd soundcore intel_lpss_pci intel_pch_thermal intel_lpss_acpi intel_lpss acpi_als mfd_core kfifo_buf acpi_pad industrialio autofs4 hid_plantronics usbhid dm_mirror dm_region_hash dm_log sdhci_pci ahci sdhci libahci i2c_hid hid
> >[  135.858389] CPU: 2 PID: 45 Comm: kworker/2:1 Tainted: G        W       4.9.0-rc4+ #238
> >[  135.858389] Hardware name:                  /NUC6i3SYB, BIOS SYSKLi35.86A.0024.2015.1027.2142 10/27/2015
> >[  135.858392] Workqueue: events_long i915_hangcheck_elapsed
> >[  135.858394]  ffffc900001bf9b8 ffffffff812bb238 0000000000000000 0000000000000000
> >[  135.858396]  ffffc900001bf9f8 ffffffff8104f621 0000006c00000000 ffff8808296137f8
> >[  135.858398]  0000000000000a00 ffff8808457a0000 ffff880845764e60 ffff880845760000
> >[  135.858399] Call Trace:
> >[  135.858403]  [<ffffffff812bb238>] dump_stack+0x4d/0x65
> >[  135.858405]  [<ffffffff8104f621>] __warn+0xc1/0xe0
> >[  135.858406]  [<ffffffff8104f748>] warn_slowpath_null+0x18/0x20
> >[  135.858408]  [<ffffffff813f8c15>] i915_sw_fence_complete+0x25/0x30
> >[  135.858410]  [<ffffffff813f8fad>] i915_sw_fence_commit+0xd/0x30
> >[  135.858412]  [<ffffffff8142e591>] __i915_gem_request_submit+0xe1/0xf0
> >[  135.858413]  [<ffffffff8142e5c8>] i915_gem_request_submit+0x28/0x40
> >[  135.858415]  [<ffffffff814433e7>] i915_guc_submit+0x47/0x210
> >[  135.858417]  [<ffffffff81443e98>] i915_guc_submission_enable+0x468/0x540
> >[  135.858419]  [<ffffffff81442495>] intel_guc_setup+0x715/0x810
> >[  135.858421]  [<ffffffff8142b6b4>] i915_gem_init_hw+0x114/0x2a0
> >[  135.858423]  [<ffffffff813eeaa8>] i915_reset+0xe8/0x120
> >[  135.858424]  [<ffffffff813f3937>] i915_reset_and_wakeup+0x157/0x180
> >[  135.858426]  [<ffffffff813f79db>] i915_handle_error+0x1ab/0x230
> >[  135.858428]  [<ffffffff812c760d>] ? scnprintf+0x4d/0x90
> >[  135.858430]  [<ffffffff81435985>] i915_hangcheck_elapsed+0x275/0x3d0
> >[  135.858432]  [<ffffffff810668cf>] process_one_work+0x12f/0x410
> >[  135.858433]  [<ffffffff81066bf3>] worker_thread+0x43/0x4d0
> >[  135.858435]  [<ffffffff81066bb0>] ? process_one_work+0x410/0x410
> >[  135.858436]  [<ffffffff81066bb0>] ? process_one_work+0x410/0x410
> >[  135.858438]  [<ffffffff8106bbb4>] kthread+0xd4/0xf0
> >[  135.858440]  [<ffffffff8106bae0>] ? kthread_park+0x60/0x60
> >
> >v2: Only resubmit submitted requests
> >v3: Don't forget the pending requests have reserved space.
> >
> >Fixes: d55ac5bf97c6 ("drm/i915: Defer transfer onto execution timeline to actual hw submission")
> >Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> >---
> > drivers/gpu/drm/i915/i915_guc_submission.c | 37 ++++++++++++++++--------------
> > 1 file changed, 20 insertions(+), 17 deletions(-)
> >
> >diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
> >index 00b5fa871644..e12ff881d38d 100644
> >--- a/drivers/gpu/drm/i915/i915_guc_submission.c
> >+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
> >@@ -602,12 +602,9 @@ static int guc_ring_doorbell(struct i915_guc_client *gc)
> > }
> >
> > /**
> >- * i915_guc_submit() - Submit commands through GuC
> >+ * __i915_guc_submit() - Submit commands through GuC
> >  * @rq:		request associated with the commands
> >  *
> >- * Return:	0 on success, otherwise an errno.
> >- * 		(Note: nonzero really shouldn't happen!)
> >- *
> >  * The caller must have already called i915_guc_wq_reserve() above with
> >  * a result of 0 (success), guaranteeing that there is space in the work
> >  * queue for the new request, so enqueuing the item cannot fail.
> >@@ -619,7 +616,7 @@ static int guc_ring_doorbell(struct i915_guc_client *gc)
> >  * The only error here arises if the doorbell hardware isn't functioning
> >  * as expected, which really shouln't happen.
> >  */
> >-static void i915_guc_submit(struct drm_i915_gem_request *rq)
> >+static void __i915_guc_submit(struct drm_i915_gem_request *rq)
> > {
> > 	struct drm_i915_private *dev_priv = rq->i915;
> > 	struct intel_engine_cs *engine = rq->engine;
> >@@ -628,17 +625,6 @@ static void i915_guc_submit(struct drm_i915_gem_request *rq)
> > 	struct i915_guc_client *client = guc->execbuf_client;
> > 	int b_ret;
> >
> >-	/* We keep the previous context alive until we retire the following
> >-	 * request. This ensures that any the context object is still pinned
> >-	 * for any residual writes the HW makes into it on the context switch
> >-	 * into the next object following the breadcrumb. Otherwise, we may
> >-	 * retire the context too early.
> >-	 */
> >-	rq->previous_context = engine->last_context;
> >-	engine->last_context = rq->ctx;
> >-
> >-	i915_gem_request_submit(rq);
> >-
> > 	spin_lock(&client->wq_lock);
> > 	guc_wq_item_append(client, rq);
> >
> >@@ -658,6 +644,23 @@ static void i915_guc_submit(struct drm_i915_gem_request *rq)
> > 	spin_unlock(&client->wq_lock);
> > }
> >
> >+static void i915_guc_submit(struct drm_i915_gem_request *rq)
> >+{
> >+	struct intel_engine_cs *engine = rq->engine;
> >+
> >+	/* We keep the previous context alive until we retire the following
> >+	 * request. This ensures that any the context object is still pinned
> >+	 * for any residual writes the HW makes into it on the context switch
> >+	 * into the next object following the breadcrumb. Otherwise, we may
> >+	 * retire the context too early.
> >+	 */
> >+	rq->previous_context = engine->last_context;
> >+	engine->last_context = rq->ctx;
> >+
> >+	i915_gem_request_submit(rq);
> >+	__i915_guc_submit(rq);
> >+}
> >+
> > /*
> >  * Everything below here is concerned with setup & teardown, and is
> >  * therefore not part of the somewhat time-critical batch-submission
> >@@ -1556,7 +1559,7 @@ int i915_guc_submission_enable(struct drm_i915_private *dev_priv)
> > 		/* Replay the current set of previously submitted requests */
> > 		list_for_each_entry(rq, &engine->timeline->requests, link) {
> > 			client->wq_rsvd += sizeof(struct guc_wq_item);
> >-			i915_guc_submit(rq);
> >+			__i915_guc_submit(rq);
> > 		}
> > 	}
> >
> >
> 
> The only thing bothering me here is how this interacts with the
> common, or "execlist" part of the reset.

Also applies to ringbuffer (some parts at least).

> Copying Mika in hope he can help here. :)
> 
> I see that i915_gem_reset_engine runs before this GuC wq refill and
> does some magic with the requests on the engine->timeline->requests
> lists.
> 
> Would it be possible to somehow better integrate the two things?

I thought they were nicely separated. The common work to disable the
guilty context is handled by the core reset function. The specific work
required to restart and replay the requests handled by the different
backends.

i915_gem_reset_engine: is for disabling the guilty context
engine->reset: updates the context / ring for restarting past a hung
batch (if required)
engine->init: starts the engine, processing the pending requests if any

engine->reset is a bad name, since we have per-engine resets as well,
but I haven't a better verb yet for handling the request bumping after a
reset.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre