[Mesa-dev] [PATCH v2 10/42] i965: Calculate appropriate L3 partition weights for the current pipeline state.

Thu Nov 19 11:51:06 PST 2015

On Thu, Nov 19, 2015 at 4:24 AM, Francisco Jerez <currojerez at riseup.net> wrote:
> Kristian Høgsberg <krh at bitplanet.net> writes:
>
>> On Tue, Nov 17, 2015 at 9:54 PM, Jordan Justen
>> <jordan.l.justen at intel.com> wrote:
>>> From: Francisco Jerez <currojerez at riseup.net>
>>>
>>> This calculates a rather conservative partitioning of the L3 cache
>>> based on the shaders currently bound to the pipeline and whether they
>>> use SLM, atomics, images or scratch space.  The result is intended to
>>> be fine-tuned later on based on other pipeline state.
>>> ---
>>>  src/mesa/drivers/dri/i965/brw_compiler.h  |  1 +
>>>  src/mesa/drivers/dri/i965/gen7_l3_state.c | 53 +++++++++++++++++++++++++++++++
>>>  2 files changed, 54 insertions(+)
>>>
>>> diff --git a/src/mesa/drivers/dri/i965/brw_compiler.h b/src/mesa/drivers/dri/i965/brw_compiler.h
>>> index 8f147d3..ef8bddb 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_compiler.h
>>> +++ b/src/mesa/drivers/dri/i965/brw_compiler.h
>>> @@ -300,6 +300,7 @@ struct brw_stage_prog_data {
>>>
>>>     unsigned curb_read_length;
>>>     unsigned total_scratch;
>>> +   unsigned total_shared;
>>>
>>>     /**
>>>      * Register where the thread expects to find input data from the URB
>>> diff --git a/src/mesa/drivers/dri/i965/gen7_l3_state.c b/src/mesa/drivers/dri/i965/gen7_l3_state.c
>>> index 4d0cfcd..1a88261 100644
>>> --- a/src/mesa/drivers/dri/i965/gen7_l3_state.c
>>> +++ b/src/mesa/drivers/dri/i965/gen7_l3_state.c
>>> @@ -258,6 +258,59 @@ get_l3_config(const struct brw_device_info *devinfo, struct brw_l3_weights w0)
>>>  }
>>>
>>>  /**
>>> + * Return a reasonable default L3 configuration for the specified device based
>>> + * on whether SLM and DC are required.  In the non-SLM non-DC case the result
>>> + * is intended to approximately resemble the hardware defaults.
>>> + */
>>> +static struct brw_l3_weights
>>> +get_default_l3_weights(const struct brw_device_info *devinfo,
>>> +                       bool needs_dc, bool needs_slm)
>>> +{
>>> +   struct brw_l3_weights w = {{ 0 }};
>>> +
>>> +   w.w[L3P_SLM] = needs_slm;
>>> +   w.w[L3P_URB] = 1.0;
>>> +
>>> +   if (devinfo->gen >= 8) {
>>> +      w.w[L3P_ALL] = 1.0;
>>> +   } else {
>>> +      w.w[L3P_DC] = needs_dc ? 0.1 : 0;
>>> +      w.w[L3P_RO] = devinfo->is_baytrail ? 0.5 : 1.0;
>>> +   }
>>> +
>>> +   return norm_l3_weights(w);
>>> +}
>>> +
>>> +/**
>>> + * Calculate the desired L3 partitioning based on the current state of the
>>> + * pipeline.  For now this simply returns the conservative defaults calculated
>>> + * by get_default_l3_weights(), but we could probably do better by gathering
>>> + * more statistics from the pipeline state (e.g. guess of expected URB usage
>>> + * and bound surfaces), or by using feed-back from performance counters.
>>> + */
>>> +static struct brw_l3_weights
>>> +get_pipeline_state_l3_weights(const struct brw_context *brw)
>>> +{
>>> +   const struct brw_stage_state *stage_states[] = {
>>> +      &brw->vs.base, &brw->gs.base, &brw->wm.base, &brw->cs.base
>>> +   };
>>> +   bool needs_dc = false, needs_slm = false;
>>
>> This doesn't seem optimal - we should evaluate the 3D pipe and the
>> compute pipe separately depending on which  one is active. For
>> example, if we have a current compute program that uses SLM, but are
>> using the 3D pipeline, we'll get a partition that includes SLM even
>> for the 3D pipe.
>>
> The intention of this patch is not to provide an optimal heuristic, but
> to implement a simple heuristic that calculates conservative defaults in
> order to guarantee functional correctness.  It would be possible to base
> the result on the currently active pipeline with minimal changes to this
> function (and making sure that the L3 config atom is invalidated while
> switching pipelines), but I don't think we want to switch back and forth
> between SLM and non-SLM configurations if the application interleaves
> draw and compute operations, because the L3 partitioning is global
> rather than per-pipeline and transitions are expensive -- Switching
> between L3 configuration requires a full pipeline stall and flushing and
> invalidation of all L3-backed caches, doing what you suggest would
> likely lead to cache-thrashing and might prevent pipelining of compute
> and render workloads [At least on BDW+ -- On Gen7 it might be impossible
> to achieve pipelining of render and compute workloads due to hardware
> issues].  If the result of the heuristic is based on the currently
> active pipeline some sort of hysteresis will likely be desirable, which
> in this patch I get for free by basing the calculation on the context
> state rather than on the currently active pipeline.  I agree with you
> that we'll probably need a more sophisticated heuristic in the future,
> but that falls outside the scope of this series -- If anything we need
> to be able to run benchmarks exercising CS and SLM before we can find
> out which kind of heuristic we want and what degree of hysteresis is
> desirable.

Switching between compute and 3d pipeline (PIPELINE_SELECT) also
requires flush+stall followed by invalidate PIPE_CONTROLs. It's
similar to the BRW_NEW_BATCH condition that's already in there in that
it makes reprogramming the L3 partition free.

This series is already way past what I would call a simple and correct
correct solution. If you want 'simple and correct' we can scale this
back to something that just enables DC and SLM when needed and do all
the weighting and heuristics when we have more data. On the other
hand, if you think the weighting machinery here is required, I don't
think you can argue that evaluating 3D and compute L3 requirements
separately is too complicated.

Kristian

>> Kristian
>>
>>> +   for (unsigned i = 0; i < ARRAY_SIZE(stage_states); i++) {
>>> +      const struct gl_shader_program *prog =
>>> +         brw->ctx._Shader->CurrentProgram[stage_states[i]->stage];
>>> +      const struct brw_stage_prog_data *prog_data = stage_states[i]->prog_data;
>>> +
>>> +      needs_dc |= (prog && prog->NumAtomicBuffers) ||
>>> +         (prog_data && (prog_data->total_scratch || prog_data->nr_image_params));
>>> +      needs_slm |= prog_data && prog_data->total_shared;
>>> +   }
>>> +
>>> +   return get_default_l3_weights(brw->intelScreen->devinfo,
>>> +                                 needs_dc, needs_slm);
>>> +}
>>> +
>>> +/**
>>>   * Program the hardware to use the specified L3 configuration.
>>>   */
>>>  static void
>>> --
>>> 2.6.2
>>>