[Intel-gfx] [PATCH 1/4] drm/i915: handle teardown of HWSP when context is freed
Daniel Vetter
daniel at ffwll.ch
Mon Jan 25 10:08:08 PST 2016
On Fri, Jan 22, 2016 at 02:51:50PM +0000, Dave Gordon wrote:
> On 22/01/16 10:28, Mika Kuoppala wrote:
> >Dave Gordon <david.s.gordon at intel.com> writes:
> >
> >>Existing code did a kunmap() on the wrong page, and didn't account for the
> >>fact that the HWSP and the default context are different offsets within the
> >>same GEM object.
> >>
> >
> >Just to note here that usually we try to split bug fixes early in the
> >series with minimal code changes. This helps cherrypickups
> >to fixes/stable. And also lessens the probability that we accidentally
> >revert bugfix when some other problem occurs with the patch.
>
> OK, let's drop this one from the series (they're orthogonal code changes,
> only related by the fact they're all to do with init/fini), and I'll
> resubmit it separately.
>
> Please continue this series from 2/4 ...
>
> >>This patch improves symmetry by creating lrc_teardown_hardware_status_page()
> >>to complement lrc_setup_hardware_status_page(), making sure that they both
> >>take the same argument (pointer to engine). They encapsulate all the details
> >>of the relationship between the default context and the status page, so the
> >>rest of the LRC code doesn't have to know about it.
> >>
> >>Signed-off-by: Dave Gordon <david.s.gordon at intel.com>
> >>---
> >> drivers/gpu/drm/i915/intel_lrc.c | 57 ++++++++++++++++++++++++++++------------
> >> 1 file changed, 40 insertions(+), 17 deletions(-)
> >>
> >>diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> >>index 73d4347..3914120 100644
> >>--- a/drivers/gpu/drm/i915/intel_lrc.c
> >>+++ b/drivers/gpu/drm/i915/intel_lrc.c
> >>@@ -226,9 +226,8 @@ enum {
> >> #define CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT 0x17
> >>
> >> static int intel_lr_context_pin(struct drm_i915_gem_request *rq);
> >>-static void lrc_setup_hardware_status_page(struct intel_engine_cs *ring,
> >>- struct drm_i915_gem_object *default_ctx_obj);
> >>-
> >>+static void lrc_setup_hardware_status_page(struct intel_engine_cs *ring);
> >>+static void lrc_teardown_hardware_status_page(struct intel_engine_cs *ring);
> >>
> >> /**
> >> * intel_sanitize_enable_execlists() - sanitize i915.enable_execlists
> >>@@ -1536,8 +1535,7 @@ static int gen8_init_common_ring(struct intel_engine_cs *ring)
> >> struct drm_i915_private *dev_priv = dev->dev_private;
> >> u8 next_context_status_buffer_hw;
> >>
> >>- lrc_setup_hardware_status_page(ring,
> >>- dev_priv->kernel_context->engine[ring->id].state);
> >>+ lrc_setup_hardware_status_page(ring);
> >>
> >> I915_WRITE_IMR(ring, ~(ring->irq_enable_mask | ring->irq_keep_mask));
> >> I915_WRITE(RING_HWSTAM(ring->mmio_base), 0xffffffff);
> >>@@ -1992,10 +1990,9 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *ring)
> >> i915_cmd_parser_fini_ring(ring);
> >> i915_gem_batch_pool_fini(&ring->batch_pool);
> >>
> >>- if (ring->status_page.obj) {
> >>- kunmap(sg_page(ring->status_page.obj->pages->sgl));
> >>- ring->status_page.obj = NULL;
> >>- }
> >>+ /* Status page should have gone already */
> >>+ WARN_ON(ring->status_page.page_addr);
> >>+ WARN_ON(ring->status_page.obj);
> >>
> >> ring->disable_lite_restore_wa = false;
> >> ring->ctx_desc_template = 0;
> >>@@ -2434,6 +2431,11 @@ void intel_lr_context_free(struct intel_context *ctx)
> >> continue;
> >>
> >> if (ctx == ctx->i915->kernel_context) {
> >>+ /*
> >>+ * The HWSP is part of the default context
> >>+ * object in LRC mode.
> >>+ */
> >>+ lrc_teardown_hardware_status_page(ringbuf->ring);
> >> intel_unpin_ringbuffer_obj(ringbuf);
> >> i915_gem_object_ggtt_unpin(ctx_obj);
> >> }
> >>@@ -2482,24 +2484,45 @@ uint32_t intel_lr_context_size(struct intel_engine_cs *ring)
> >> return ret;
> >> }
> >>
> >>-static void lrc_setup_hardware_status_page(struct intel_engine_cs *ring,
> >>- struct drm_i915_gem_object *default_ctx_obj)
> >>+static void lrc_setup_hardware_status_page(struct intel_engine_cs *ring)
> >> {
> >>- struct drm_i915_private *dev_priv = ring->dev->dev_private;
> >>+ struct drm_i915_private *dev_priv = to_i915(ring->dev);
> >>+ struct intel_context *dctx = dev_priv->kernel_context;
> >>+ struct drm_i915_gem_object *dctx_obj = dctx->engine[ring->id].state;
> >>+ u64 dctx_addr = i915_gem_obj_ggtt_offset(dctx_obj);
> >> struct page *page;
> >>
> >>- /* The HWSP is part of the default context object in LRC mode. */
> >>- ring->status_page.gfx_addr = i915_gem_obj_ggtt_offset(default_ctx_obj)
> >>- + LRC_PPHWSP_PN * PAGE_SIZE;
> >>- page = i915_gem_object_get_page(default_ctx_obj, LRC_PPHWSP_PN);
> >>+ /*
> >>+ * The HWSP is part of the default context object in LRC mode.
> >>+ * Note that it doesn't contribute to the refcount!
> >>+ */
> >>+ page = i915_gem_object_get_page(dctx_obj, LRC_PPHWSP_PN);
> >> ring->status_page.page_addr = kmap(page);
> >>- ring->status_page.obj = default_ctx_obj;
> >>+ ring->status_page.gfx_addr = dctx_addr + LRC_PPHWSP_PN * PAGE_SIZE;
> >>+ ring->status_page.obj = dctx_obj;
> >>
> >> I915_WRITE(RING_HWS_PGA(ring->mmio_base),
> >> (u32)ring->status_page.gfx_addr);
> >> POSTING_READ(RING_HWS_PGA(ring->mmio_base));
> >> }
> >>
> >>+/* This should be called before the default context is destroyed */
> >>+static void lrc_teardown_hardware_status_page(struct intel_engine_cs *ring)
> >>+{
> >>+ struct drm_i915_gem_object *dctx_obj = ring->status_page.obj;
> >>+ struct page *page;
> >>+
> >>+ WARN_ON(!dctx_obj);
> >>+
> >>+ if (ring->status_page.page_addr) {
> >>+ page = i915_gem_object_get_page(dctx_obj, LRC_PPHWSP_PN);
> >>+ kunmap(page);
> >>+ ring->status_page.page_addr = NULL;
> >>+ }
> >>+
> >
> >As the page is always kmapped along with setting of status_page.obj,
> >I fail to see why we need the check here for the page_addr.
>
> The WARN_ON() tells the developer something unexpected happened.
>
> The page_addr test avoids doing something that may cause an OOPS so that the
> driver can continue (to unload!).
>
> IMHO, teardown code *in particular* should be written to anticipate being
> called in all sorts of inconsistent states, because a common error handling
> strategy is just to say,
>
> oops, something bad happened at some stage of setup!
> let's tear down everything and pass the error back
>
> which is certainly easier than tracking exactly where the error occurred and
> undoing only the steps known to have succeeded -- that leads to lots of
> "goto err_1", "goto err_2", etc :(
Russian dolls error code unwinding is pretty much standard kernel
practice. And since it's also mostly dead code (well in practice) it's not
too much of a validation horror show. This will likely change a lot if we
start using EPROBE_DEFER in earnest.
This comment is just for context, i.e. I think this is totally fine. It's
a bit fragile, but you'll be fighting against a _lot_ of inertia with this
;-)
Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
More information about the Intel-gfx
mailing list