[Intel-gfx] [PATCH 1/4] drm/i915: handle teardown of HWSP when context is freed

Mon Jan 25 10:08:08 PST 2016

On Fri, Jan 22, 2016 at 02:51:50PM +0000, Dave Gordon wrote:
> On 22/01/16 10:28, Mika Kuoppala wrote:
> >Dave Gordon <david.s.gordon at intel.com> writes:
> >
> >>Existing code did a kunmap() on the wrong page, and didn't account for the
> >>fact that the HWSP and the default context are different offsets within the
> >>same GEM object.
> >>
> >
> >Just to note here that usually we try to split bug fixes early in the
> >series with minimal code changes. This helps cherrypickups
> >to fixes/stable. And also lessens the probability that we accidentally
> >revert bugfix when some other problem occurs with the patch.
> 
> OK, let's drop this one from the series (they're orthogonal code changes,
> only related by the fact they're all to do with init/fini), and I'll
> resubmit it separately.
> 
> Please continue this series from 2/4 ...
> 
> >>This patch improves symmetry by creating lrc_teardown_hardware_status_page()
> >>to complement lrc_setup_hardware_status_page(), making sure that they both
> >>take the same argument (pointer to engine). They encapsulate all the details
> >>of the relationship between the default context and the status page, so the
> >>rest of the LRC code doesn't have to know about it.
> >>
> >>Signed-off-by: Dave Gordon <david.s.gordon at intel.com>
> >>---
> >>  drivers/gpu/drm/i915/intel_lrc.c | 57 ++++++++++++++++++++++++++++------------
> >>  1 file changed, 40 insertions(+), 17 deletions(-)
> >>
> >>diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> >>index 73d4347..3914120 100644
> >>--- a/drivers/gpu/drm/i915/intel_lrc.c
> >>+++ b/drivers/gpu/drm/i915/intel_lrc.c
> >>@@ -226,9 +226,8 @@ enum {
> >>  #define CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT  0x17
> >>
> >>  static int intel_lr_context_pin(struct drm_i915_gem_request *rq);
> >>-static void lrc_setup_hardware_status_page(struct intel_engine_cs *ring,
> >>-		struct drm_i915_gem_object *default_ctx_obj);
> >>-
> >>+static void lrc_setup_hardware_status_page(struct intel_engine_cs *ring);
> >>+static void lrc_teardown_hardware_status_page(struct intel_engine_cs *ring);
> >>
> >>  /**
> >>   * intel_sanitize_enable_execlists() - sanitize i915.enable_execlists
> >>@@ -1536,8 +1535,7 @@ static int gen8_init_common_ring(struct intel_engine_cs *ring)
> >>  	struct drm_i915_private *dev_priv = dev->dev_private;
> >>  	u8 next_context_status_buffer_hw;
> >>
> >>-	lrc_setup_hardware_status_page(ring,
> >>-				dev_priv->kernel_context->engine[ring->id].state);
> >>+	lrc_setup_hardware_status_page(ring);
> >>
> >>  	I915_WRITE_IMR(ring, ~(ring->irq_enable_mask | ring->irq_keep_mask));
> >>  	I915_WRITE(RING_HWSTAM(ring->mmio_base), 0xffffffff);
> >>@@ -1992,10 +1990,9 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *ring)
> >>  	i915_cmd_parser_fini_ring(ring);
> >>  	i915_gem_batch_pool_fini(&ring->batch_pool);
> >>
> >>-	if (ring->status_page.obj) {
> >>-		kunmap(sg_page(ring->status_page.obj->pages->sgl));
> >>-		ring->status_page.obj = NULL;
> >>-	}
> >>+	/* Status page should have gone already */
> >>+	WARN_ON(ring->status_page.page_addr);
> >>+	WARN_ON(ring->status_page.obj);
> >>
> >>  	ring->disable_lite_restore_wa = false;
> >>  	ring->ctx_desc_template = 0;
> >>@@ -2434,6 +2431,11 @@ void intel_lr_context_free(struct intel_context *ctx)
> >>  			continue;
> >>
> >>  		if (ctx == ctx->i915->kernel_context) {
> >>+			/*
> >>+			 * The HWSP is part of the default context
> >>+			 * object in LRC mode.
> >>+			 */
> >>+			lrc_teardown_hardware_status_page(ringbuf->ring);
> >>  			intel_unpin_ringbuffer_obj(ringbuf);
> >>  			i915_gem_object_ggtt_unpin(ctx_obj);
> >>  		}
> >>@@ -2482,24 +2484,45 @@ uint32_t intel_lr_context_size(struct intel_engine_cs *ring)
> >>  	return ret;
> >>  }
> >>
> >>-static void lrc_setup_hardware_status_page(struct intel_engine_cs *ring,
> >>-		struct drm_i915_gem_object *default_ctx_obj)
> >>+static void lrc_setup_hardware_status_page(struct intel_engine_cs *ring)
> >>  {
> >>-	struct drm_i915_private *dev_priv = ring->dev->dev_private;
> >>+	struct drm_i915_private *dev_priv = to_i915(ring->dev);
> >>+	struct intel_context *dctx = dev_priv->kernel_context;
> >>+	struct drm_i915_gem_object *dctx_obj = dctx->engine[ring->id].state;
> >>+	u64 dctx_addr = i915_gem_obj_ggtt_offset(dctx_obj);
> >>  	struct page *page;
> >>
> >>-	/* The HWSP is part of the default context object in LRC mode. */
> >>-	ring->status_page.gfx_addr = i915_gem_obj_ggtt_offset(default_ctx_obj)
> >>-			+ LRC_PPHWSP_PN * PAGE_SIZE;
> >>-	page = i915_gem_object_get_page(default_ctx_obj, LRC_PPHWSP_PN);
> >>+	/*
> >>+	 * The HWSP is part of the default context object in LRC mode.
> >>+	 * Note that it doesn't contribute to the refcount!
> >>+	 */
> >>+	page = i915_gem_object_get_page(dctx_obj, LRC_PPHWSP_PN);
> >>  	ring->status_page.page_addr = kmap(page);
> >>-	ring->status_page.obj = default_ctx_obj;
> >>+	ring->status_page.gfx_addr = dctx_addr + LRC_PPHWSP_PN * PAGE_SIZE;
> >>+	ring->status_page.obj = dctx_obj;
> >>
> >>  	I915_WRITE(RING_HWS_PGA(ring->mmio_base),
> >>  			(u32)ring->status_page.gfx_addr);
> >>  	POSTING_READ(RING_HWS_PGA(ring->mmio_base));
> >>  }
> >>
> >>+/* This should be called before the default context is destroyed */
> >>+static void lrc_teardown_hardware_status_page(struct intel_engine_cs *ring)
> >>+{
> >>+	struct drm_i915_gem_object *dctx_obj = ring->status_page.obj;
> >>+	struct page *page;
> >>+
> >>+	WARN_ON(!dctx_obj);
> >>+
> >>+	if (ring->status_page.page_addr) {
> >>+		page = i915_gem_object_get_page(dctx_obj, LRC_PPHWSP_PN);
> >>+		kunmap(page);
> >>+		ring->status_page.page_addr = NULL;
> >>+	}
> >>+
> >
> >As the page is always kmapped along with setting of status_page.obj,
> >I fail to see why we need the check here for the page_addr.
> 
> The WARN_ON() tells the developer something unexpected happened.
> 
> The page_addr test avoids doing something that may cause an OOPS so that the
> driver can continue (to unload!).
> 
> IMHO, teardown code *in particular* should be written to anticipate being
> called in all sorts of inconsistent states, because a common error handling
> strategy is just to say,
> 
>     oops, something bad happened at some stage of setup!
>     let's tear down everything and pass the error back
> 
> which is certainly easier than tracking exactly where the error occurred and
> undoing only the steps known to have succeeded -- that leads to lots of
> "goto err_1", "goto err_2", etc :(

Russian dolls error code unwinding is pretty much standard kernel
practice. And since it's also mostly dead code (well in practice) it's not
too much of a validation horror show. This will likely change a lot if we
start using EPROBE_DEFER in earnest.

This comment is just for context, i.e. I think this is totally fine. It's
a bit fragile, but you'll be fighting against a _lot_ of inertia with this
;-)

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch