[Mesa-dev] [PATCH] i965: Unbind deleted shaders from brw_context, fixing malloc heisenbug.

Pohjolainen, Topi topi.pohjolainen at gmail.com
Thu Jan 26 07:41:24 UTC 2017


On Wed, Jan 25, 2017 at 11:14:32PM -0800, Kenneth Graunke wrote:
> Applications may delete a shader program, create a new one, and bind it
> before the next draw.  With terrible luck, malloc may randomly return a
> chunk of memory for the new gl_program that happened to be the exact
> same pointer as our previously bound gl_program.  In this case, our
> logic to detect new programs in brw_upload_pipeline_state() would break:
> 
>       if (brw->vertex_program != ctx->VertexProgram._Current) {
>          brw->vertex_program = ctx->VertexProgram._Current;
>          brw->ctx.NewDriverState |= BRW_NEW_VERTEX_PROGRAM;
>       }
> 
> Because the pointer is the same, we'd think it was the same program.
> But it could be wildly different - a different stage altogether,
> different sets of resources, and so on.  This causes utter chaos.
> 
> As unlikely as this seems, I believe I hit this when running a subset
> of the CTS in a loop, in a group of tests that churns through simple
> programs, deleting and rebuilding them.  Presumably malloc uses a
> bucketing cache of sorts, and so freeing up a gl_program and allocating
> a new one fairly quickly causes it to reuse that memory.
> 
> The result was that brw->vertex_program->info.num_ssbos claimed the
> program had SSBOs, while brw->vs.base.prog_data.binding_table claimed
> that there were none.  This was crazy, because the binding table is
> calculated from info.num_ssbos - the shader info appeared to change
> between shader compile time and draw time.  Careful use of watchpoints
> revealed that it was being clobbered by rzalloc's memset when building
> an entirely different program...
> 
> Fortunately, our 0xd0d0d0d0 canary for unused binding table entries
> caused us to crash out of bounds when trying to upload SSBOs, or we
> may have never discovered this heisenbug.
> 
> Fixes crashes in GL45-CTS.compute_shader.sso-case2 when using a hacked
> cts-runner that only runs GL45-CTS.compute_shader.s* in EGL config ID 5
> at 64x64 in a loop with 100 iterations.
> 
> Cc: "17.0 13.0 12.0" <mesa-stable at lists.freedesktop.org>
> Signed-off-by: Kenneth Graunke <kenneth at whitecape.org>

Really good work, must have taken quite a bit of time!!!

Reviewed-by: Topi Pohjolainen <topi.pohjolainen at intel.com>


I've noticed the same malloc pattern with miptree's. Gave me a
headache when just dumping miptrees and not paying attention to
deallocations in between.

> ---
>  src/mesa/drivers/dri/i965/brw_program.c | 43 +++++++++++++++++++++++++++++++++
>  1 file changed, 43 insertions(+)
> 
> diff --git a/src/mesa/drivers/dri/i965/brw_program.c b/src/mesa/drivers/dri/i965/brw_program.c
> index e81f6b15c0a..673dc502ad4 100644
> --- a/src/mesa/drivers/dri/i965/brw_program.c
> +++ b/src/mesa/drivers/dri/i965/brw_program.c
> @@ -177,6 +177,49 @@ static struct gl_program *brwNewProgram(struct gl_context *ctx, GLenum target,
>  static void brwDeleteProgram( struct gl_context *ctx,
>  			      struct gl_program *prog )
>  {
> +   struct brw_context *brw = brw_context(ctx);
> +
> +   /* Beware!  prog's refcount has reached zero, and it's about to be freed.
> +    *
> +    * In brw_upload_pipeline_state(), we compare brw->foo_program to
> +    * ctx->FooProgram._Current, and flag BRW_NEW_FOO_PROGRAM if the
> +    * pointer has changed.
> +    *
> +    * We cannot leave brw->foo_program as a dangling pointer to the dead
> +    * program.  malloc() may allocate the same memory for a new gl_program,
> +    * causing us to see matching pointers...but totally different programs.
> +    *
> +    * We cannot set brw->foo_program to NULL, either.  If we've deleted the
> +    * active program, Mesa may set ctx->FooProgram._Current to NULL.  That
> +    * would cause us to see matching pointers (NULL == NULL), and fail to
> +    * detect that a program has changed since our last draw.
> +    *
> +    * So, set it to a bogus gl_program pointer that will never match,
> +    * causing us to properly reevaluate the state on our next draw.
> +    *
> +    * Getting this wrong causes heisenbugs which are very hard to catch,
> +    * as you need a very specific allocation pattern to hit the problem.
> +    */
> +   static const struct gl_program deleted_program;
> +
> +   if (brw->vertex_program == prog)
> +      brw->vertex_program = &deleted_program;
> +
> +   if (brw->tess_ctrl_program == prog)
> +      brw->tess_ctrl_program = &deleted_program;
> +
> +   if (brw->tess_eval_program == prog)
> +      brw->tess_eval_program = &deleted_program;
> +
> +   if (brw->geometry_program == prog)
> +      brw->geometry_program = &deleted_program;
> +
> +   if (brw->fragment_program == prog)
> +      brw->fragment_program = &deleted_program;
> +
> +   if (brw->compute_program == prog)
> +      brw->compute_program = &deleted_program;
> +
>     _mesa_delete_program( ctx, prog );
>  }
>  
> -- 
> 2.11.0
> 
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev


More information about the mesa-dev mailing list