[Intel-gfx] [Mesa-dev] [PATCH] drm/i915: Enable the HiZ RAW Stall Optimization on Gen8.

Mon Jan 12 10:09:12 PST 2015

On 12/01/15 18:02, Ben Widawsky wrote:
> On Mon, Jan 12, 2015 at 02:02:34PM +0200, Ville Syrjälä wrote:
>> On Sun, Jan 11, 2015 at 07:14:57PM -0800, Ben Widawsky wrote:
>>> On Sun, Jan 11, 2015 at 07:05:21PM -0800, Kenneth Graunke wrote:
>>>> On Sunday, January 11, 2015 05:46:09 PM Ben Widawsky wrote:
>>>>> On Sun, Jan 11, 2015 at 04:05:25PM -0800, Kenneth Graunke wrote:
>>>>>> On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
>>>>>>> On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
>>>>>>>> This is an important optimization for avoiding read-after-write (RAW)
>>>>>>>> stalls in the HiZ buffer.  Certain workloads would run very slowly with
>>>>>>>> HiZ enabled, but run much faster with the "hiz=false" driconf option.
>>>>>>>> With this patch, they run at full speed even with HiZ.
>>>>>>>>
>>>>>>>> Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
>>>>>>>> (Iris Pro 6200).
>>>>>>>>
>>>>>>>> Thanks to Jesse Barnes for finding this missing bit!
>>>>>>>> Thanks to Chris Wilson for helping me find where to set it.
>>>>>>>>
>>>>>>>> Signed-off-by: Kenneth Graunke <kenneth at whitecape.org>
>>>>>>>> Cc: Jesse Barnes <jbarnes at virtuousgeek.org>
>>>>>>>> ---
>>>>>>>>  drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++++++++++++++
>>>>>>>>  1 file changed, 15 insertions(+)
>>>>>>>>
>>>>>>>> Here's an alternate patch which implements the workaround in the kernel
>>>>>>>> instead of Mesa.  It's probably better to do it there, since the kernel
>>>>>>>> does it on Haswell already.
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
>>>>>>>> index dabc1d8..23020d6 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
>>>>>>>> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
>>>>>>>> @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct intel_engine_cs *ring)
>>>>>>>>  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED |
>>>>>>>>  			  (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE : 0));
>>>>>>>>  
>>>>>>>> +	/* From the Haswell PRM, Command Reference: Registers, CACHE_MODE_0:
>>>>>>>> +	 * "The Hierarchical Z RAW Stall Optimization allows non-overlapping
>>>>>>>> +	 *  polygons in the same 8x4 pixel/sample area to be processed without
>>>>>>>> +	 *  stalling waiting for the earlier ones to write to Hierarchical Z
>>>>>>>> +	 *  buffer."
>>>>>>>> +	 *
>>>>>>>> +	 * This optimization is off by default for Broadwell; turn it on.
>>>>>>>> +	 */
>>>>>>>> +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
>>>>>>>> +
>>>>>>>>  	/* Wa4x4STCOptimizationDisable:bdw */
>>>>>>>>  	WA_SET_BIT_MASKED(CACHE_MODE_1,
>>>>>>>>  			  GEN8_4x4_STC_OPTIMIZATION_DISABLE);
>>>>>>>> @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct intel_engine_cs *ring)
>>>>>>>>  			  HDC_FORCE_NON_COHERENT |
>>>>>>>>  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED);
>>>>>>>>  
>>>>>>>> +	/* According to the CACHE_MODE_0 default value documentation, some
>>>>>>>> +	 * CHV platforms disable this optimization by default.  Turn it on.
>>>>>>>> +	 */
>>>>>>>> +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
>>>>>>>> +
>>>>>>>>  	/* Improve HiZ throughput on CHV. */
>>>>>>>>  	WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
>>>>>>>>  
>>>>>>>
>>>>>>> I think you should do this as two separate patches, 1 per platform. For the BSW
>>>>>>> patch (given that I had the same functionality in the kernel patch I asked you
>>>>>>> to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my kernel patch
>>>>>>> which we can use for the commit):
>>>>>>> Signed-off-by: Ben Widawsky <ben at bwidawsk.net>
>>>>>>
>>>>>> Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll split it
>>>>>> and resubmit...
>>>>>>
>>>>>
>>>>> It's not my call, it's just nice to have platform specific bisection. And the
>>>>> patch wasn't on the list, it was the one I kept asking you to look at in my
>>>>> branch :-)
>>>>>
>>>>>>> I haven't looked at Broadwell docs, so I'll let someone else take care of that.
>>>>>>>
>>>>>>> I don't know if I agree with Chris that we should call these in the workaround
>>>>>>> section, but whatever. init_clock_gating is equally sucky.
>>>>>>
>>>>>> init_clock_gating doesn't work.  The register writes don't stick and they have
>>>>>> no effect at all.  Setting them here makes them actually take effect in the
>>>>>> context.
>>>>>>
>>>>>> --Ken
>>>>>
>>>>> Separate thread now, but are you sure? We're setting at least two context
>>>>> specific registers in there today, among them: GEN7_FF_THREAD_MODE (which is
>>>>> important to performance).
>>>>
>>>> It looks like we're setting:
>>>> - [BDW] RC_SLEEP_PSMI_CONTROL - 0x2050
>>>
>>> dword offset 0x1c in the context image
>>
>> power context, not logical context
>>
>>>> - [BDW, CHV] FF_THREAD_MODE - 0x20a0
>>>
>>> dword offset 0x2a in the context image
>>
>> Also power context
>>
>>>> - [CHV] RC_SLEEP_PSMI_CONTROL - 0x12050
>>>
>>> Kinda surprised this one isn't there. I'm not sure how it can work correctly.
>>
>> We're not frobbing with this anywhere but gen6_bsd_ring_write_tail(). In any
>> case it's a VCS register. Sadly I've not found any documentation for !RCS
>> power context, but I'm assuming every engine has a power context of some sort.
>>
> 
> Yeah, Ken and I resolved this offline. Any idea why the bits don't stick when
> written via MMIO?
> 
>> -- 
>> Ville Syrjälä
>> Intel OTC

Doesn't BSpec say writing via MMIO is unreliable if the engine is
running (for some definition of 'running')? Do they stick even
temporarily? If we read them back just after writing, do the MMIO
registers have the expected values? Or are they lost only after a
context switch (or some other event)?

.Dave.