[Mesa-dev] [PATCH] i965: Unbind deleted shaders from brw_context, fixing malloc heisenbug.
Pohjolainen, Topi
topi.pohjolainen at gmail.com
Thu Jan 26 07:41:24 UTC 2017
On Wed, Jan 25, 2017 at 11:14:32PM -0800, Kenneth Graunke wrote:
> Applications may delete a shader program, create a new one, and bind it
> before the next draw. With terrible luck, malloc may randomly return a
> chunk of memory for the new gl_program that happened to be the exact
> same pointer as our previously bound gl_program. In this case, our
> logic to detect new programs in brw_upload_pipeline_state() would break:
>
> if (brw->vertex_program != ctx->VertexProgram._Current) {
> brw->vertex_program = ctx->VertexProgram._Current;
> brw->ctx.NewDriverState |= BRW_NEW_VERTEX_PROGRAM;
> }
>
> Because the pointer is the same, we'd think it was the same program.
> But it could be wildly different - a different stage altogether,
> different sets of resources, and so on. This causes utter chaos.
>
> As unlikely as this seems, I believe I hit this when running a subset
> of the CTS in a loop, in a group of tests that churns through simple
> programs, deleting and rebuilding them. Presumably malloc uses a
> bucketing cache of sorts, and so freeing up a gl_program and allocating
> a new one fairly quickly causes it to reuse that memory.
>
> The result was that brw->vertex_program->info.num_ssbos claimed the
> program had SSBOs, while brw->vs.base.prog_data.binding_table claimed
> that there were none. This was crazy, because the binding table is
> calculated from info.num_ssbos - the shader info appeared to change
> between shader compile time and draw time. Careful use of watchpoints
> revealed that it was being clobbered by rzalloc's memset when building
> an entirely different program...
>
> Fortunately, our 0xd0d0d0d0 canary for unused binding table entries
> caused us to crash out of bounds when trying to upload SSBOs, or we
> may have never discovered this heisenbug.
>
> Fixes crashes in GL45-CTS.compute_shader.sso-case2 when using a hacked
> cts-runner that only runs GL45-CTS.compute_shader.s* in EGL config ID 5
> at 64x64 in a loop with 100 iterations.
>
> Cc: "17.0 13.0 12.0" <mesa-stable at lists.freedesktop.org>
> Signed-off-by: Kenneth Graunke <kenneth at whitecape.org>
Really good work, must have taken quite a bit of time!!!
Reviewed-by: Topi Pohjolainen <topi.pohjolainen at intel.com>
I've noticed the same malloc pattern with miptree's. Gave me a
headache when just dumping miptrees and not paying attention to
deallocations in between.
> ---
> src/mesa/drivers/dri/i965/brw_program.c | 43 +++++++++++++++++++++++++++++++++
> 1 file changed, 43 insertions(+)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_program.c b/src/mesa/drivers/dri/i965/brw_program.c
> index e81f6b15c0a..673dc502ad4 100644
> --- a/src/mesa/drivers/dri/i965/brw_program.c
> +++ b/src/mesa/drivers/dri/i965/brw_program.c
> @@ -177,6 +177,49 @@ static struct gl_program *brwNewProgram(struct gl_context *ctx, GLenum target,
> static void brwDeleteProgram( struct gl_context *ctx,
> struct gl_program *prog )
> {
> + struct brw_context *brw = brw_context(ctx);
> +
> + /* Beware! prog's refcount has reached zero, and it's about to be freed.
> + *
> + * In brw_upload_pipeline_state(), we compare brw->foo_program to
> + * ctx->FooProgram._Current, and flag BRW_NEW_FOO_PROGRAM if the
> + * pointer has changed.
> + *
> + * We cannot leave brw->foo_program as a dangling pointer to the dead
> + * program. malloc() may allocate the same memory for a new gl_program,
> + * causing us to see matching pointers...but totally different programs.
> + *
> + * We cannot set brw->foo_program to NULL, either. If we've deleted the
> + * active program, Mesa may set ctx->FooProgram._Current to NULL. That
> + * would cause us to see matching pointers (NULL == NULL), and fail to
> + * detect that a program has changed since our last draw.
> + *
> + * So, set it to a bogus gl_program pointer that will never match,
> + * causing us to properly reevaluate the state on our next draw.
> + *
> + * Getting this wrong causes heisenbugs which are very hard to catch,
> + * as you need a very specific allocation pattern to hit the problem.
> + */
> + static const struct gl_program deleted_program;
> +
> + if (brw->vertex_program == prog)
> + brw->vertex_program = &deleted_program;
> +
> + if (brw->tess_ctrl_program == prog)
> + brw->tess_ctrl_program = &deleted_program;
> +
> + if (brw->tess_eval_program == prog)
> + brw->tess_eval_program = &deleted_program;
> +
> + if (brw->geometry_program == prog)
> + brw->geometry_program = &deleted_program;
> +
> + if (brw->fragment_program == prog)
> + brw->fragment_program = &deleted_program;
> +
> + if (brw->compute_program == prog)
> + brw->compute_program = &deleted_program;
> +
> _mesa_delete_program( ctx, prog );
> }
>
> --
> 2.11.0
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
More information about the mesa-dev
mailing list