[Intel-gfx] [PATCH] drm/i915: cleanup lrc engine state when lrc is freed
Ville Syrjälä
ville.syrjala at linux.intel.com
Mon Jan 12 08:24:38 PST 2015
On Fri, Jan 09, 2015 at 06:47:41PM +0000, Dave Gordon wrote:
> On 08/01/15 13:40, Mika Kuoppala wrote:
> > i915_gem_validate_context() will check the engine->state to see if it can
> > submit into a ringbuffer. But when we are releasing the context we leave the
> > engine state to a non null value. Thus after a successful hang recovery
> > we might mistakenly submit to a non initialized ringbuffer resulting in:
> >
> > [ 1991.356418] ------------[ cut here ]------------
> > [ 1991.359192] WARNING: CPU: 1 PID: 2335 at lib/iomap.c:43 bad_io_access+0x3d/0x40()
> > [ 1991.361966] Bad IO access at port 0x24 (outl(val,port))
> > [ 1991.364750] Modules linked in: snd_hda_codec_hdmi i915 x86_pkg_temp_thermal coretemp kvm_intel kvm snd_hda_intel snd_hda_controller snd_hda_codec crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hwdep snd_pcm aesni_intel aes_x86_64 glue_helper lrw i2c_algo_bit gf128mul ablk_helper drm_kms_helper cryptd snd_seq_midi snd_seq_midi_event serio_raw drm snd_rawmidi snd_seq snd_seq_device snd_timer video snd soundcore mei_me lpc_ich bnep mac_hid acpi_pad mei rfcomm bluetooth parport_pc ppdev lp parport nls_iso8859_1 e1000e ptp ahci libahci pps_core sdhci_acpi sdhci
> > [ 1991.370827] CPU: 1 PID: 2335 Comm: gem_ringfill Tainted: G W 3.19.0-rc3+ #50
> > [ 1991.373838] Hardware name: Intel Corporation Broadwell Client platform/SawTooth Peak, BIOS BDW-E1R1.86C.0092.R00.1408311942 08/31/2014
> > [ 1991.376902] ffffffff81aa1a46 ffff88014910fac8 ffffffff8173dbcf 0000000000000001
> > [ 1991.379978] ffff88014910fb18 ffff88014910fb08 ffffffff8107007a ffff88014910fb28
> > [ 1991.383037] ffff880147209940 ffff8800aafa8718 ffff8800aafa0000 ffff8800aafa1918
> > [ 1991.386094] Call Trace:
> > [ 1991.389140] [<ffffffff8173dbcf>] dump_stack+0x45/0x57
> > [ 1991.392207] [<ffffffff8107007a>] warn_slowpath_common+0x8a/0xc0
> > [ 1991.395268] [<ffffffff810700f6>] warn_slowpath_fmt+0x46/0x50
> > [ 1991.398330] [<ffffffffa053290c>] ? intel_logical_ring_begin+0x3c/0x240 [i915]
> > [ 1991.401395] [<ffffffff813985bd>] bad_io_access+0x3d/0x40
> > [ 1991.404462] [<ffffffff81398763>] iowrite32+0x33/0x40
> > [ 1991.407529] [<ffffffffa0533585>] gen8_init_rcs_context+0xd5/0x170 [i915]
> > [ 1991.410605] [<ffffffffa0533d17>] intel_lr_context_deferred_create+0x657/0x8e0 [i915]
> > [ 1991.413668] [<ffffffffa050eff1>] i915_gem_do_execbuffer.isra.22+0xed1/0xf60 [i915]
> > [ 1991.416736] [<ffffffff811c0125>] ? __kmalloc+0x55/0x1b0
> > [ 1991.419801] [<ffffffffa051029c>] ? i915_gem_execbuffer2+0x6c/0x2c0 [i915]
> > [ 1991.422772] [<ffffffffa05102e1>] i915_gem_execbuffer2+0xb1/0x2c0 [i915]
> > [ 1991.425632] [<ffffffffa01b8ab4>] drm_ioctl+0x1a4/0x630 [drm]
> > [ 1991.428454] [<ffffffff811258bc>] ? acct_account_cputime+0x1c/0x20
> > [ 1991.431255] [<ffffffff811ee378>] do_vfs_ioctl+0x2f8/0x510
> > [ 1991.434009] [<ffffffff8109f834>] ? vtime_account_user+0x54/0x60
> > [ 1991.436778] [<ffffffff811ee611>] SyS_ioctl+0x81/0xa0
> > [ 1991.439553] [<ffffffff81745cb4>] ? int_check_syscall_exit_work+0x34/0x3d
> > [ 1991.442306] [<ffffffff81745a2d>] system_call_fastpath+0x16/0x1b
> >
> > Fix this by setting all the engine fields properly when lrc is freed.
> >
> > Cc: Thomas Daniel <thomas.daniel at intel.com>
> > Signed-off-by: Mika Kuoppala <mika.kuoppala at intel.com>
> > ---
> > drivers/gpu/drm/i915/intel_lrc.c | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > index 7670a0f..32684d9 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -1777,6 +1777,10 @@ void intel_lr_context_free(struct intel_context *ctx)
> > intel_destroy_ringbuffer_obj(ringbuf);
> > kfree(ringbuf);
> > drm_gem_object_unreference(&ctx_obj->base);
> > + WARN_ON(ctx->engine[i].unpin_count != 0);
> > + ctx->engine[i].unpin_count = 0;
> > + ctx->engine[i].ringbuf = NULL;
> > + ctx->engine[i].state = NULL;
> > }
> > }
> > }
>
> Hi,
>
> I don't quite see how this can fix the problem illustrated by the stack
> trace above. AFAICS intel_lr_context_free() is called /only/ from
> i915_gem_context_free(), which should mean that the refcount on the
> intel_context object is already zero, and that it will be freed on
> return. So the contents of ctx->engine[] should be irrelevant ...
I've now managed to hit this fairly reliably with
'gem_ringfill --r render' + ctrl-c straight after boot.
To me it looks like some kind of bigger mess with olr. If olr was already there,
logical_ring_alloc_request() will do absolutely nothing, which means the ringbuf
won't even be pinned/mapped and ringbug->virtual_start will remain NULL. And that
leads to the w/a init to call iowrite() with some close to 0 iomem addressses.
I'm not sure how this is supposed to actually work. To me it looks olr should never
be allowed to be != NULL when creating the context/ringbuf to avoid this problem. And
yet somehow the presence of olr doesn't always trigger the 'bad IO access' warns. In
fact I can't seem to trigger it more than once until I reboot the machine. I tried to
add a warning [1] to catch all cases when olr is already there, and it will trigger in
most cases when I hit ctrl-c while running gem_ringfill. And yet the 'Bad IO access'
doesn't follow it except once after boot.
[1]
@@ -1906,6 +1926,8 @@ int intel_lr_context_deferred_create(struct intel_context *ctx,
}
+ WARN_ON(ring->outstanding_lazy_request);
+
ret = populate_lr_context(ctx, ctx_obj, ring, ringbuf);
if (ret) {
DRM_DEBUG_DRIVER("Failed to populate LRC: %d\n", ret);
>
> void i915_gem_context_free(struct kref *ctx_ref)
> {
> struct intel_context *ctx = container_of(ctx_ref,
> typeof(*ctx), ref);
>
> trace_i915_context_free(ctx);
>
> if (i915.enable_execlists)
> intel_lr_context_free(ctx);
>
> i915_ppgtt_put(ctx->ppgtt);
>
> if (ctx->legacy_hw_ctx.rcs_state)
> drm_gem_object_unreference(&ctx->legacy_hw_ctx.rcs_state->base);
>
> list_del(&ctx->link);
> kfree(ctx);
> }
>
> ... unless something is trying to reuse the context while it is still in
> the process of being deleted :(
>
> In addition, the stack trace above implies that ctx->engine[].state WAS
> NULL when i915_gem_validate_context() was called, otherwise it would not
> have called intel_lr_context_deferred_create()
>
> if (i915.enable_execlists && !ctx->engine[ring->id].state) {
> int ret = intel_lr_context_deferred_create(ctx, ring);
> if (ret) {
> DRM_DEBUG("Could not create LRC %u: %d\n", ctx_id, ret);
> return ERR_PTR(ret);
> }
> }
>
> and likewise that function would not have called gen8_init_rcs_context()
> unless this was a new context:
>
> if (ctx == ring->default_context)
> lrc_setup_hardware_status_page(ring, ctx_obj);
> else if (ring->id == RCS && !ctx->rcs_initialized) {
> if (ring->init_context) {
> ret = ring->init_context(ring, ctx);
> if (ret) {
> DRM_ERROR("ring init context: %d\n", ret);
> ctx->engine[ring->id].ringbuf = NULL;
> ctx->engine[ring->id].state = NULL;
> goto error;
> }
> }
>
> ctx->rcs_initialized = true;
> }
>
> Note that rcs_initialized is never cleared, even with your change, so
> that in a use-after-free situation we wouldn't end up in this path. So I
> think the mystery is how this context ended up in an inconsistent state:
> has it been partially freed and then reused, or has some part of the new
> context allocation path failed but not been unwound correctly?
>
> And if setting to NULL a pointer that's inside a structure that's in the
> process of being freed actually makes a difference, doesn't that mean
> there's a use-after-free issue somewhere?
>
> .Dave.
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
--
Ville Syrjälä
Intel OTC
More information about the Intel-gfx
mailing list