[Mesa-dev] [PATCH v2 10/42] i965: Calculate appropriate L3 partition weights for the current pipeline state.

Fri Nov 20 04:46:37 PST 2015

Kristian Høgsberg <krh at bitplanet.net> writes:

> On Thu, Nov 19, 2015 at 4:24 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>> Kristian Høgsberg <krh at bitplanet.net> writes:
>>
>>> On Tue, Nov 17, 2015 at 9:54 PM, Jordan Justen
>>> <jordan.l.justen at intel.com> wrote:
>>>> From: Francisco Jerez <currojerez at riseup.net>
>>>>
>>>> This calculates a rather conservative partitioning of the L3 cache
>>>> based on the shaders currently bound to the pipeline and whether they
>>>> use SLM, atomics, images or scratch space.  The result is intended to
>>>> be fine-tuned later on based on other pipeline state.
>>>> ---
>>>>  src/mesa/drivers/dri/i965/brw_compiler.h  |  1 +
>>>>  src/mesa/drivers/dri/i965/gen7_l3_state.c | 53 +++++++++++++++++++++++++++++++
>>>>  2 files changed, 54 insertions(+)
>>>>
>>>> diff --git a/src/mesa/drivers/dri/i965/brw_compiler.h b/src/mesa/drivers/dri/i965/brw_compiler.h
>>>> index 8f147d3..ef8bddb 100644
>>>> --- a/src/mesa/drivers/dri/i965/brw_compiler.h
>>>> +++ b/src/mesa/drivers/dri/i965/brw_compiler.h
>>>> @@ -300,6 +300,7 @@ struct brw_stage_prog_data {
>>>>
>>>>     unsigned curb_read_length;
>>>>     unsigned total_scratch;
>>>> +   unsigned total_shared;
>>>>
>>>>     /**
>>>>      * Register where the thread expects to find input data from the URB
>>>> diff --git a/src/mesa/drivers/dri/i965/gen7_l3_state.c b/src/mesa/drivers/dri/i965/gen7_l3_state.c
>>>> index 4d0cfcd..1a88261 100644
>>>> --- a/src/mesa/drivers/dri/i965/gen7_l3_state.c
>>>> +++ b/src/mesa/drivers/dri/i965/gen7_l3_state.c
>>>> @@ -258,6 +258,59 @@ get_l3_config(const struct brw_device_info *devinfo, struct brw_l3_weights w0)
>>>>  }
>>>>
>>>>  /**
>>>> + * Return a reasonable default L3 configuration for the specified device based
>>>> + * on whether SLM and DC are required.  In the non-SLM non-DC case the result
>>>> + * is intended to approximately resemble the hardware defaults.
>>>> + */
>>>> +static struct brw_l3_weights
>>>> +get_default_l3_weights(const struct brw_device_info *devinfo,
>>>> +                       bool needs_dc, bool needs_slm)
>>>> +{
>>>> +   struct brw_l3_weights w = {{ 0 }};
>>>> +
>>>> +   w.w[L3P_SLM] = needs_slm;
>>>> +   w.w[L3P_URB] = 1.0;
>>>> +
>>>> +   if (devinfo->gen >= 8) {
>>>> +      w.w[L3P_ALL] = 1.0;
>>>> +   } else {
>>>> +      w.w[L3P_DC] = needs_dc ? 0.1 : 0;
>>>> +      w.w[L3P_RO] = devinfo->is_baytrail ? 0.5 : 1.0;
>>>> +   }
>>>> +
>>>> +   return norm_l3_weights(w);
>>>> +}
>>>> +
>>>> +/**
>>>> + * Calculate the desired L3 partitioning based on the current state of the
>>>> + * pipeline.  For now this simply returns the conservative defaults calculated
>>>> + * by get_default_l3_weights(), but we could probably do better by gathering
>>>> + * more statistics from the pipeline state (e.g. guess of expected URB usage
>>>> + * and bound surfaces), or by using feed-back from performance counters.
>>>> + */
>>>> +static struct brw_l3_weights
>>>> +get_pipeline_state_l3_weights(const struct brw_context *brw)
>>>> +{
>>>> +   const struct brw_stage_state *stage_states[] = {
>>>> +      &brw->vs.base, &brw->gs.base, &brw->wm.base, &brw->cs.base
>>>> +   };
>>>> +   bool needs_dc = false, needs_slm = false;
>>>
>>> This doesn't seem optimal - we should evaluate the 3D pipe and the
>>> compute pipe separately depending on which  one is active. For
>>> example, if we have a current compute program that uses SLM, but are
>>> using the 3D pipeline, we'll get a partition that includes SLM even
>>> for the 3D pipe.
>>>
>> The intention of this patch is not to provide an optimal heuristic, but
>> to implement a simple heuristic that calculates conservative defaults in
>> order to guarantee functional correctness.  It would be possible to base
>> the result on the currently active pipeline with minimal changes to this
>> function (and making sure that the L3 config atom is invalidated while
>> switching pipelines), but I don't think we want to switch back and forth
>> between SLM and non-SLM configurations if the application interleaves
>> draw and compute operations, because the L3 partitioning is global
>> rather than per-pipeline and transitions are expensive -- Switching
>> between L3 configuration requires a full pipeline stall and flushing and
>> invalidation of all L3-backed caches, doing what you suggest would
>> likely lead to cache-thrashing and might prevent pipelining of compute
>> and render workloads [At least on BDW+ -- On Gen7 it might be impossible
>> to achieve pipelining of render and compute workloads due to hardware
>> issues].  If the result of the heuristic is based on the currently
>> active pipeline some sort of hysteresis will likely be desirable, which
>> in this patch I get for free by basing the calculation on the context
>> state rather than on the currently active pipeline.  I agree with you
>> that we'll probably need a more sophisticated heuristic in the future,
>> but that falls outside the scope of this series -- If anything we need
>> to be able to run benchmarks exercising CS and SLM before we can find
>> out which kind of heuristic we want and what degree of hysteresis is
>> desirable.
>
> Switching between compute and 3d pipeline (PIPELINE_SELECT) also
> requires flush+stall followed by invalidate PIPE_CONTROLs. It's
> similar to the BRW_NEW_BATCH condition that's already in there in that
> it makes reprogramming the L3 partition free.
>

The stall is only necessary on Gen7, since Gen8 it's not strictly
required AFAIK and even if you do stall there's no need to drain any
L3-backed caches.

> This series is already way past what I would call a simple and correct
> correct solution. If you want 'simple and correct' we can scale this
> back to something that just enables DC and SLM when needed and do all
> the weighting and heuristics when we have more data. On the other
> hand, if you think the weighting machinery here is required, I don't
> think you can argue that evaluating 3D and compute L3 requirements
> separately is too complicated.
>
I didn't mean that making the heuristic depend on the currently active
pipeline the naive way would be too complicated, as I said the changes
to this patch would be minimal.  I'm just saying that I have no evidence
that it would be an optimization rather than a pessimization -- Otherwise
I'd be glad to implement it.

> Kristian
>
>>> Kristian
>>>
>>>> +   for (unsigned i = 0; i < ARRAY_SIZE(stage_states); i++) {
>>>> +      const struct gl_shader_program *prog =
>>>> +         brw->ctx._Shader->CurrentProgram[stage_states[i]->stage];
>>>> +      const struct brw_stage_prog_data *prog_data = stage_states[i]->prog_data;
>>>> +
>>>> +      needs_dc |= (prog && prog->NumAtomicBuffers) ||
>>>> +         (prog_data && (prog_data->total_scratch || prog_data->nr_image_params));
>>>> +      needs_slm |= prog_data && prog_data->total_shared;
>>>> +   }
>>>> +
>>>> +   return get_default_l3_weights(brw->intelScreen->devinfo,
>>>> +                                 needs_dc, needs_slm);
>>>> +}
>>>> +
>>>> +/**
>>>>   * Program the hardware to use the specified L3 configuration.
>>>>   */
>>>>  static void
>>>> --
>>>> 2.6.2
>>>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20151120/ca908bf7/attachment-0001.sig>