[Mesa-dev] [Mesa-stable] [PATCH] i965/gen6: Fix HiZ hang in WebGL Google maps

Fri Dec 20 15:09:08 PST 2013

On 12/20/2013 01:36 PM, Kenneth Graunke wrote:
> On 12/20/2013 04:47 AM, Chad Versace wrote:
>> We need to emit depth stall flushes before depth and hiz resolves.
>> Placing them at the top of blorp's state emission fixes the hang.
>>
>> Fixes HiZ hang in the new WebGL Google maps on Sandybridge Chrome OS.
>> Tested by zooming in and out continuously for 2 hours.
>>
>> This patch is based on
>> https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/8bc07bb70163c3706fb4ba5f980e57dc942f56dd
>>
>> CC: mesa-stable at lists.freedesktop.org
>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=70740
>> Signed-off-by: Stéphane Marchesin <marcheu at chromium.org>
>> Signed-off-by: Chad Versace <chad.versace at linux.intel.com>
>> ---
>>   src/mesa/drivers/dri/i965/gen6_blorp.cpp | 12 ++++++++++++
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/src/mesa/drivers/dri/i965/gen6_blorp.cpp b/src/mesa/drivers/dri/i965/gen6_blorp.cpp
>> index 6a5841f..3a0e7ec 100644
>> --- a/src/mesa/drivers/dri/i965/gen6_blorp.cpp
>> +++ b/src/mesa/drivers/dri/i965/gen6_blorp.cpp
>> @@ -1012,6 +1012,16 @@ gen6_blorp_emit_primitive(struct brw_context *brw,
>>      ADVANCE_BATCH();
>>   }
>>
>> +static void
>> +gen6_emit_hiz_workaround(struct brw_context *brw, enum gen6_hiz_op hiz_op)
>> +{
>> +   if (hiz_op == GEN6_HIZ_OP_DEPTH_RESOLVE ||
>> +       hiz_op == GEN6_HIZ_OP_HIZ_RESOLVE) {
>> +      brw->batch.need_workaround_flush = true;
>> +      intel_emit_post_sync_nonzero_flush(brw);
>> +      intel_emit_depth_stall_flushes(brw);
>> +   }
>> +}
>>
>>   /**
>>    * \brief Execute a blit or render pass operation.
>> @@ -1034,6 +1044,8 @@ gen6_blorp_exec(struct brw_context *brw,
>>      uint32_t wm_bind_bo_offset = 0;
>>
>>      uint32_t prog_offset = params->get_wm_prog(brw, &prog_data);
>> +
>> +   gen6_emit_hiz_workaround(brw, params->hiz_op);
>>      gen6_emit_3dstate_multisample(brw, params->dst.num_samples);
>>      gen6_emit_3dstate_sample_mask(brw,
>>                                    params->dst.num_samples > 1 ?
>>
>
> I'm fine with landing this as is, since it's trivial and getting
> Sandybridge stable is really critical.  I'm glad to see this won't
> happen on Gen7+.
>
> Reviewed-by: Kenneth Graunke <kenneth at whitecape.org>

Ok. Your's is the second r-b for the patch as-is. And this fixes hangs
affecting users on all distros.  So I'll commit it as-is, with refinements
to follow as I validate them.

Validation takes a long time, multiple hours, so I think it unwise to postpone
committing this bugfix until its perfect. The needed build-test-refine cycles to
attain a perfect fix may take a long time. (Thankfully the reproduction steps
are automated now).

>
> That said, it would be nice to refine it a little.
 >
> The intel_emit_post_sync_nonzero_flush(brw) call should not be
> necessary.  The very first BLORP call, gen6_emit_3dstate_multisample,
> already calls it.  However, it may not have taken effect in the past,
> since intel_emit_post_sync_nonzero_flush checks need_workaround_flush.
>
> I also am dubious whether we need the depth stall flushes, since we
> already do them before 3DSTATE_DEPTH_BUFFER, which is documented as
> necessary.
>
> So here's what I think is actually going on:
>
> We've seen strong evidence that it's 100% required to do the post-sync
> nonzero workaround before non-pipelined commands.  (My patches which
> added missing post-sync workarounds greatly reduced the number of GPU
> hangs.)
>
> I believe that the root of the problem is that needs_workaround_flush is
> not getting set to true sufficiently often, so the post-sync non-zero
> workaround isn't happening when necessary.  Currently, we flag it at the
> start of each batch, and /after/ each BLORP operation.
>
> That means it's missing in two cases:
> 1. Switching from normal rendering to BLORP drawing.
>     (...fixed by this patch.)
> 2. Between successive draws (3DPRIMITIVEs) within a batch.
>     (I believe this is also necessary, and is not fixed by this patch.)
>

After thinking about this more this morning, I also suspected that the
bug's root problem was that need_workaround_flush needed to be set
at the top of blorp (your #1).

Why do you suspect #2? Experimentation? Hardware docs?

> I would love to see an alternate patch which sets need_workaround_flush
> in two places:
> 1. Batchbuffer init (for the very first draw)
> 2. After each 3DPRIMITIVE command.
>
> I believe that would fix this hang and potentially other future hangs.

Agreed. I'll follow-up with patches for that.