[Intel-gfx] [PATCH] drm/i915: cleanup lrc engine state when lrc is freed

Dave Gordon david.s.gordon at intel.com
Fri Jan 9 10:47:41 PST 2015


On 08/01/15 13:40, Mika Kuoppala wrote:
> i915_gem_validate_context() will check the engine->state to see if it can
> submit into a ringbuffer. But when we are releasing the context we leave the
> engine state to a non null value. Thus after a successful hang recovery
> we might mistakenly submit to a non initialized ringbuffer resulting in:
> 
> [ 1991.356418] ------------[ cut here ]------------
> [ 1991.359192] WARNING: CPU: 1 PID: 2335 at lib/iomap.c:43 bad_io_access+0x3d/0x40()
> [ 1991.361966] Bad IO access at port 0x24 (outl(val,port))
> [ 1991.364750] Modules linked in: snd_hda_codec_hdmi i915 x86_pkg_temp_thermal coretemp kvm_intel kvm snd_hda_intel snd_hda_controller snd_hda_codec crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hwdep snd_pcm aesni_intel aes_x86_64 glue_helper lrw i2c_algo_bit gf128mul ablk_helper drm_kms_helper cryptd snd_seq_midi snd_seq_midi_event serio_raw drm snd_rawmidi snd_seq snd_seq_device snd_timer video snd soundcore mei_me lpc_ich bnep mac_hid acpi_pad mei rfcomm bluetooth parport_pc ppdev lp parport nls_iso8859_1 e1000e ptp ahci libahci pps_core sdhci_acpi sdhci
> [ 1991.370827] CPU: 1 PID: 2335 Comm: gem_ringfill Tainted: G        W      3.19.0-rc3+ #50
> [ 1991.373838] Hardware name: Intel Corporation Broadwell Client platform/SawTooth Peak, BIOS BDW-E1R1.86C.0092.R00.1408311942 08/31/2014
> [ 1991.376902]  ffffffff81aa1a46 ffff88014910fac8 ffffffff8173dbcf 0000000000000001
> [ 1991.379978]  ffff88014910fb18 ffff88014910fb08 ffffffff8107007a ffff88014910fb28
> [ 1991.383037]  ffff880147209940 ffff8800aafa8718 ffff8800aafa0000 ffff8800aafa1918
> [ 1991.386094] Call Trace:
> [ 1991.389140]  [<ffffffff8173dbcf>] dump_stack+0x45/0x57
> [ 1991.392207]  [<ffffffff8107007a>] warn_slowpath_common+0x8a/0xc0
> [ 1991.395268]  [<ffffffff810700f6>] warn_slowpath_fmt+0x46/0x50
> [ 1991.398330]  [<ffffffffa053290c>] ? intel_logical_ring_begin+0x3c/0x240 [i915]
> [ 1991.401395]  [<ffffffff813985bd>] bad_io_access+0x3d/0x40
> [ 1991.404462]  [<ffffffff81398763>] iowrite32+0x33/0x40
> [ 1991.407529]  [<ffffffffa0533585>] gen8_init_rcs_context+0xd5/0x170 [i915]
> [ 1991.410605]  [<ffffffffa0533d17>] intel_lr_context_deferred_create+0x657/0x8e0 [i915]
> [ 1991.413668]  [<ffffffffa050eff1>] i915_gem_do_execbuffer.isra.22+0xed1/0xf60 [i915]
> [ 1991.416736]  [<ffffffff811c0125>] ? __kmalloc+0x55/0x1b0
> [ 1991.419801]  [<ffffffffa051029c>] ? i915_gem_execbuffer2+0x6c/0x2c0 [i915]
> [ 1991.422772]  [<ffffffffa05102e1>] i915_gem_execbuffer2+0xb1/0x2c0 [i915]
> [ 1991.425632]  [<ffffffffa01b8ab4>] drm_ioctl+0x1a4/0x630 [drm]
> [ 1991.428454]  [<ffffffff811258bc>] ? acct_account_cputime+0x1c/0x20
> [ 1991.431255]  [<ffffffff811ee378>] do_vfs_ioctl+0x2f8/0x510
> [ 1991.434009]  [<ffffffff8109f834>] ? vtime_account_user+0x54/0x60
> [ 1991.436778]  [<ffffffff811ee611>] SyS_ioctl+0x81/0xa0
> [ 1991.439553]  [<ffffffff81745cb4>] ? int_check_syscall_exit_work+0x34/0x3d
> [ 1991.442306]  [<ffffffff81745a2d>] system_call_fastpath+0x16/0x1b
> 
> Fix this by setting all the engine fields properly when lrc is freed.
> 
> Cc: Thomas Daniel <thomas.daniel at intel.com>
> Signed-off-by: Mika Kuoppala <mika.kuoppala at intel.com>
> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 7670a0f..32684d9 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1777,6 +1777,10 @@ void intel_lr_context_free(struct intel_context *ctx)
>  			intel_destroy_ringbuffer_obj(ringbuf);
>  			kfree(ringbuf);
>  			drm_gem_object_unreference(&ctx_obj->base);
> +			WARN_ON(ctx->engine[i].unpin_count != 0);
> +			ctx->engine[i].unpin_count = 0;
> +			ctx->engine[i].ringbuf = NULL;
> +			ctx->engine[i].state = NULL;
>  		}
>  	}
>  }

Hi,

I don't quite see how this can fix the problem illustrated by the stack
trace above.  AFAICS intel_lr_context_free() is called /only/ from
i915_gem_context_free(), which should mean that the refcount on the
intel_context object is already zero, and that it will be freed on
return. So the contents of ctx->engine[] should be irrelevant ...

void i915_gem_context_free(struct kref *ctx_ref)
{
    struct intel_context *ctx = container_of(ctx_ref,
                                             typeof(*ctx), ref);

    trace_i915_context_free(ctx);

    if (i915.enable_execlists)
        intel_lr_context_free(ctx);

    i915_ppgtt_put(ctx->ppgtt);

    if (ctx->legacy_hw_ctx.rcs_state)
        drm_gem_object_unreference(&ctx->legacy_hw_ctx.rcs_state->base);

    list_del(&ctx->link);
    kfree(ctx);
}

... unless something is trying to reuse the context while it is still in
the process of being deleted :(

In addition, the stack trace above implies that ctx->engine[].state WAS
NULL when i915_gem_validate_context() was called, otherwise it would not
have called intel_lr_context_deferred_create()

    if (i915.enable_execlists && !ctx->engine[ring->id].state) {
        int ret = intel_lr_context_deferred_create(ctx, ring);
        if (ret) {
             DRM_DEBUG("Could not create LRC %u: %d\n", ctx_id, ret);
             return ERR_PTR(ret);
        }
    }

and likewise that function would not have called gen8_init_rcs_context()
unless this was a new context:

    if (ctx == ring->default_context)
        lrc_setup_hardware_status_page(ring, ctx_obj);
    else if (ring->id == RCS && !ctx->rcs_initialized) {
        if (ring->init_context) {
            ret = ring->init_context(ring, ctx);
            if (ret) {
                 DRM_ERROR("ring init context: %d\n", ret);
                 ctx->engine[ring->id].ringbuf = NULL;
                 ctx->engine[ring->id].state = NULL;
                 goto error;
            }
        }

        ctx->rcs_initialized = true;
    }

Note that rcs_initialized is never cleared, even with your change, so
that in a use-after-free situation we wouldn't end up in this path. So I
think the mystery is how this context ended up in an inconsistent state:
has it been partially freed and then reused, or has some part of the new
context allocation path failed but not been unwound correctly?

And if setting to NULL a pointer that's inside a structure that's in the
process of being freed actually makes a difference, doesn't that mean
there's a use-after-free issue somewhere?

.Dave.


More information about the Intel-gfx mailing list