[Mesa-dev] [PATCH v2 10/42] i965: Calculate appropriate L3 partition weights for the current pipeline state.
Francisco Jerez
currojerez at riseup.net
Fri Nov 20 04:46:37 PST 2015
Kristian Høgsberg <krh at bitplanet.net> writes:
> On Thu, Nov 19, 2015 at 4:24 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>> Kristian Høgsberg <krh at bitplanet.net> writes:
>>
>>> On Tue, Nov 17, 2015 at 9:54 PM, Jordan Justen
>>> <jordan.l.justen at intel.com> wrote:
>>>> From: Francisco Jerez <currojerez at riseup.net>
>>>>
>>>> This calculates a rather conservative partitioning of the L3 cache
>>>> based on the shaders currently bound to the pipeline and whether they
>>>> use SLM, atomics, images or scratch space. The result is intended to
>>>> be fine-tuned later on based on other pipeline state.
>>>> ---
>>>> src/mesa/drivers/dri/i965/brw_compiler.h | 1 +
>>>> src/mesa/drivers/dri/i965/gen7_l3_state.c | 53 +++++++++++++++++++++++++++++++
>>>> 2 files changed, 54 insertions(+)
>>>>
>>>> diff --git a/src/mesa/drivers/dri/i965/brw_compiler.h b/src/mesa/drivers/dri/i965/brw_compiler.h
>>>> index 8f147d3..ef8bddb 100644
>>>> --- a/src/mesa/drivers/dri/i965/brw_compiler.h
>>>> +++ b/src/mesa/drivers/dri/i965/brw_compiler.h
>>>> @@ -300,6 +300,7 @@ struct brw_stage_prog_data {
>>>>
>>>> unsigned curb_read_length;
>>>> unsigned total_scratch;
>>>> + unsigned total_shared;
>>>>
>>>> /**
>>>> * Register where the thread expects to find input data from the URB
>>>> diff --git a/src/mesa/drivers/dri/i965/gen7_l3_state.c b/src/mesa/drivers/dri/i965/gen7_l3_state.c
>>>> index 4d0cfcd..1a88261 100644
>>>> --- a/src/mesa/drivers/dri/i965/gen7_l3_state.c
>>>> +++ b/src/mesa/drivers/dri/i965/gen7_l3_state.c
>>>> @@ -258,6 +258,59 @@ get_l3_config(const struct brw_device_info *devinfo, struct brw_l3_weights w0)
>>>> }
>>>>
>>>> /**
>>>> + * Return a reasonable default L3 configuration for the specified device based
>>>> + * on whether SLM and DC are required. In the non-SLM non-DC case the result
>>>> + * is intended to approximately resemble the hardware defaults.
>>>> + */
>>>> +static struct brw_l3_weights
>>>> +get_default_l3_weights(const struct brw_device_info *devinfo,
>>>> + bool needs_dc, bool needs_slm)
>>>> +{
>>>> + struct brw_l3_weights w = {{ 0 }};
>>>> +
>>>> + w.w[L3P_SLM] = needs_slm;
>>>> + w.w[L3P_URB] = 1.0;
>>>> +
>>>> + if (devinfo->gen >= 8) {
>>>> + w.w[L3P_ALL] = 1.0;
>>>> + } else {
>>>> + w.w[L3P_DC] = needs_dc ? 0.1 : 0;
>>>> + w.w[L3P_RO] = devinfo->is_baytrail ? 0.5 : 1.0;
>>>> + }
>>>> +
>>>> + return norm_l3_weights(w);
>>>> +}
>>>> +
>>>> +/**
>>>> + * Calculate the desired L3 partitioning based on the current state of the
>>>> + * pipeline. For now this simply returns the conservative defaults calculated
>>>> + * by get_default_l3_weights(), but we could probably do better by gathering
>>>> + * more statistics from the pipeline state (e.g. guess of expected URB usage
>>>> + * and bound surfaces), or by using feed-back from performance counters.
>>>> + */
>>>> +static struct brw_l3_weights
>>>> +get_pipeline_state_l3_weights(const struct brw_context *brw)
>>>> +{
>>>> + const struct brw_stage_state *stage_states[] = {
>>>> + &brw->vs.base, &brw->gs.base, &brw->wm.base, &brw->cs.base
>>>> + };
>>>> + bool needs_dc = false, needs_slm = false;
>>>
>>> This doesn't seem optimal - we should evaluate the 3D pipe and the
>>> compute pipe separately depending on which one is active. For
>>> example, if we have a current compute program that uses SLM, but are
>>> using the 3D pipeline, we'll get a partition that includes SLM even
>>> for the 3D pipe.
>>>
>> The intention of this patch is not to provide an optimal heuristic, but
>> to implement a simple heuristic that calculates conservative defaults in
>> order to guarantee functional correctness. It would be possible to base
>> the result on the currently active pipeline with minimal changes to this
>> function (and making sure that the L3 config atom is invalidated while
>> switching pipelines), but I don't think we want to switch back and forth
>> between SLM and non-SLM configurations if the application interleaves
>> draw and compute operations, because the L3 partitioning is global
>> rather than per-pipeline and transitions are expensive -- Switching
>> between L3 configuration requires a full pipeline stall and flushing and
>> invalidation of all L3-backed caches, doing what you suggest would
>> likely lead to cache-thrashing and might prevent pipelining of compute
>> and render workloads [At least on BDW+ -- On Gen7 it might be impossible
>> to achieve pipelining of render and compute workloads due to hardware
>> issues]. If the result of the heuristic is based on the currently
>> active pipeline some sort of hysteresis will likely be desirable, which
>> in this patch I get for free by basing the calculation on the context
>> state rather than on the currently active pipeline. I agree with you
>> that we'll probably need a more sophisticated heuristic in the future,
>> but that falls outside the scope of this series -- If anything we need
>> to be able to run benchmarks exercising CS and SLM before we can find
>> out which kind of heuristic we want and what degree of hysteresis is
>> desirable.
>
> Switching between compute and 3d pipeline (PIPELINE_SELECT) also
> requires flush+stall followed by invalidate PIPE_CONTROLs. It's
> similar to the BRW_NEW_BATCH condition that's already in there in that
> it makes reprogramming the L3 partition free.
>
The stall is only necessary on Gen7, since Gen8 it's not strictly
required AFAIK and even if you do stall there's no need to drain any
L3-backed caches.
> This series is already way past what I would call a simple and correct
> correct solution. If you want 'simple and correct' we can scale this
> back to something that just enables DC and SLM when needed and do all
> the weighting and heuristics when we have more data. On the other
> hand, if you think the weighting machinery here is required, I don't
> think you can argue that evaluating 3D and compute L3 requirements
> separately is too complicated.
>
I didn't mean that making the heuristic depend on the currently active
pipeline the naive way would be too complicated, as I said the changes
to this patch would be minimal. I'm just saying that I have no evidence
that it would be an optimization rather than a pessimization -- Otherwise
I'd be glad to implement it.
> Kristian
>
>>> Kristian
>>>
>>>> + for (unsigned i = 0; i < ARRAY_SIZE(stage_states); i++) {
>>>> + const struct gl_shader_program *prog =
>>>> + brw->ctx._Shader->CurrentProgram[stage_states[i]->stage];
>>>> + const struct brw_stage_prog_data *prog_data = stage_states[i]->prog_data;
>>>> +
>>>> + needs_dc |= (prog && prog->NumAtomicBuffers) ||
>>>> + (prog_data && (prog_data->total_scratch || prog_data->nr_image_params));
>>>> + needs_slm |= prog_data && prog_data->total_shared;
>>>> + }
>>>> +
>>>> + return get_default_l3_weights(brw->intelScreen->devinfo,
>>>> + needs_dc, needs_slm);
>>>> +}
>>>> +
>>>> +/**
>>>> * Program the hardware to use the specified L3 configuration.
>>>> */
>>>> static void
>>>> --
>>>> 2.6.2
>>>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20151120/ca908bf7/attachment-0001.sig>
More information about the mesa-dev
mailing list