[Mesa-dev] [PATCH 5/6] gallium/radeon: add a heuristic dynamically enabling DCC for scanout surfaces

Mon Jun 27 18:41:28 UTC 2016

On Mon, Jun 27, 2016 at 10:35 AM, Nicolai Hähnle <nhaehnle at gmail.com> wrote:
> On 24.06.2016 20:48, Marek Olšák wrote:
>>
>> On Fri, Jun 24, 2016 at 1:09 PM, Nicolai Hähnle <nhaehnle at gmail.com>
>> wrote:
>>>
>>> On 22.06.2016 20:29, Marek Olšák wrote:
>>>>
>>>>
>>>> From: Marek Olšák <marek.olsak at amd.com>
>>>>
>>>> DCC for displayable surfaces is allocated in a separate buffer and is
>>>> enabled or disabled based on PS invocations from 2 frames ago (to let
>>>> queries go idle) and the number of slow clears from the current frame.
>>>>
>>>> At least an equivalent of 5 fullscreen draws or slow clears must be done
>>>> to enable DCC. (PS invocations / (width * height) + num_slow_clears >=
>>>> 5)
>>>>
>>>> Pipeline statistic queries are always active if a color buffer that can
>>>> have separate DCC is bound, even if separate DCC is disabled. That means
>>>> the window color buffer is always monitored and DCC is enabled only when
>>>> the situation is right.
>>>>
>>>> The tracking of per-texture queries in r600_common_context is quite
>>>> ugly,
>>>> but I don't see a better way.
>>>>
>>>> The first fast clear always enables DCC. DCC decompression can disable
>>>> it.
>>>> A later fast clear can enable it again. Enable/disable typically happens
>>>> only once per frame.
>>>>
>>>> The impact is expected to be negligible because games usually don't have
>>>> a high level of overdraw. DCC usually activates when too much blending
>>>> is happening (smoke rendering) or when testing glClear performance and
>>>> CMASK isn't supported (Stoney).
>>>
>>>
>>>
>>> Nice stuff. One corner case to think of: what happens when DCC is enabled
>>> for a texture that is currently bound? Needs the same treatment as when
>>> DCC
>>> is disabled, right?
>>
>>
>> Not sure what you mean, can you be more specific? Note that it can't
>> be bound as a sampler by apps because it's a window framebuffer.
>
>
> Okay, that makes sense.
>
>
>>>
>>> More comments below...
>>>
>>>
>>>> ---
>>>>    src/gallium/drivers/radeon/r600_pipe_common.c |  15 ++
>>>>    src/gallium/drivers/radeon/r600_pipe_common.h |  40 +++++
>>>>    src/gallium/drivers/radeon/r600_texture.c     | 239
>>>> ++++++++++++++++++++++++++
>>>>    src/gallium/drivers/radeonsi/si_blit.c        |  14 +-
>>>>    src/gallium/drivers/radeonsi/si_state.c       |  15 ++
>>>>    src/gallium/drivers/radeonsi/si_state_draw.c  |   5 +-
>>>>    6 files changed, 326 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/src/gallium/drivers/radeon/r600_pipe_common.c
>>>> b/src/gallium/drivers/radeon/r600_pipe_common.c
>>>> index 5d4a679..66afcfa 100644
>>>> --- a/src/gallium/drivers/radeon/r600_pipe_common.c
>>>> +++ b/src/gallium/drivers/radeon/r600_pipe_common.c
>>>> @@ -397,6 +397,21 @@ bool r600_common_context_init(struct
>>>> r600_common_context *rctx,
>>>>
>>>>    void r600_common_context_cleanup(struct r600_common_context *rctx)
>>>>    {
>>>> +       unsigned i,j;
>>>> +
>>>> +       /* Release DCC stats. */
>>>> +       for (i = 0; i < ARRAY_SIZE(rctx->dcc_stats); i++) {
>>>> +               assert(!rctx->dcc_stats[i].query_active);
>>>> +
>>>> +               for (j = 0; j < ARRAY_SIZE(rctx->dcc_stats[i].ps_stats);
>>>> j++)
>>>> +                       if (rctx->dcc_stats[i].ps_stats[j])
>>>> +                               rctx->b.destroy_query(&rctx->b,
>>>> +
>>>> rctx->dcc_stats[i].ps_stats[j]);
>>>> +
>>>> +               pipe_resource_reference((struct pipe_resource**)
>>>> +                                       &rctx->dcc_stats[i].tex, NULL);
>>>> +       }
>>>> +
>>>>          if (rctx->gfx.cs)
>>>>                  rctx->ws->cs_destroy(rctx->gfx.cs);
>>>>          if (rctx->dma.cs)
>>>> diff --git a/src/gallium/drivers/radeon/r600_pipe_common.h
>>>> b/src/gallium/drivers/radeon/r600_pipe_common.h
>>>> index 92cba13..cdec907 100644
>>>> --- a/src/gallium/drivers/radeon/r600_pipe_common.h
>>>> +++ b/src/gallium/drivers/radeon/r600_pipe_common.h
>>>> @@ -272,6 +272,25 @@ struct r600_texture {
>>>>           * dcc_offset contains the absolute GPUVM address, not the
>>>> relative one.
>>>>           */
>>>>          struct r600_resource            *dcc_separate_buffer;
>>>> +       /* When DCC is temporarily disabled, the separate buffer is
>>>> here.
>>>> */
>>>> +       struct r600_resource            *last_dcc_separate_buffer;
>>>> +       /* We need to track DCC dirtiness, because st/dri usually calls
>>>> +        * flush_resource twice per frame (not a bug) and we don't wanna
>>>> +        * decompress DCC twice. Also, the dirty tracking must be done
>>>> even
>>>> +        * if DCC isn't used, because it's required by the DCC usage
>>>> analysis
>>>> +        * for a possible future enablement.
>>>> +        */
>>>> +       bool                            separate_dcc_dirty;
>>>> +       /* Statistics gathering for the DCC enablement heuristic. */
>>>> +       bool                            dcc_gather_statistics;
>>>> +       /* Estimate of how much this color buffer is written to in units
>>>> of
>>>> +        * full-screen draws: ps_invocations / (width * height)
>>>> +        * Shader kills, late Z, and blending with trivial discards make
>>>> it
>>>> +        * inaccurate (we need to count CB updates, not PS invocations).
>>>> +        */
>>>> +       unsigned                        ps_draw_ratio;
>>>> +       /* The number of clears since the last DCC usage analysis. */
>>>> +       unsigned                        num_slow_clears;
>>>>
>>>>          /* Counter that should be non-zero if the texture is bound to a
>>>>           * framebuffer. Implemented in radeonsi only.
>>>> @@ -536,6 +555,21 @@ struct r600_common_context {
>>>>          float                           sample_locations_8x[8][2];
>>>>          float                           sample_locations_16x[16][2];
>>>>
>>>> +       /* Statistics gathering for the DCC enablement heuristic. It
>>>> can't
>>>> be
>>>> +        * in r600_texture because r600_texture can be shared by
>>>> multiple
>>>> +        * contexts. This is for back buffers only. We shouldn't get too
>>>> many
>>>> +        * of those.
>>>> +        */
>>>> +       struct {
>>>> +               struct r600_texture             *tex;
>>>> +               /* Query queue: 0 = usually active, 1 = waiting, 2 =
>>>> readback. */
>>>> +               struct pipe_query               *ps_stats[3];
>>>> +               /* If all slots are used and another slot is needed,
>>>> +                * the least recently used slot is evicted based on
>>>> this.
>>>> */
>>>> +               int64_t                         last_use_timestamp;
>>>> +               bool                            query_active;
>>>> +       } dcc_stats[2];
>>>> +
>>>>          /* The list of all texture buffer objects in this context.
>>>>           * This list is walked when a buffer is invalidated/reallocated
>>>> and
>>>>           * the GPU addresses are updated. */
>>>> @@ -688,6 +722,12 @@ struct pipe_surface
>>>> *r600_create_surface_custom(struct pipe_context *pipe,
>>>>                                                  const struct
>>>> pipe_surface
>>>> *templ,
>>>>                                                  unsigned width,
>>>> unsigned
>>>> height);
>>>>    unsigned r600_translate_colorswap(enum pipe_format format, bool
>>>> do_endian_swap);
>>>> +void vi_separate_dcc_start_query(struct pipe_context *ctx,
>>>> +                                struct r600_texture *tex);
>>>> +void vi_separate_dcc_stop_query(struct pipe_context *ctx,
>>>> +                               struct r600_texture *tex);
>>>> +void vi_separate_dcc_analyze_stats(struct pipe_context *ctx,
>>>> +                                  struct r600_texture *tex);
>>>>    void vi_dcc_clear_level(struct r600_common_context *rctx,
>>>>                          struct r600_texture *rtex,
>>>>                          unsigned level, unsigned clear_value);
>>>> diff --git a/src/gallium/drivers/radeon/r600_texture.c
>>>> b/src/gallium/drivers/radeon/r600_texture.c
>>>> index 23be5ed..7295ab6 100644
>>>> --- a/src/gallium/drivers/radeon/r600_texture.c
>>>> +++ b/src/gallium/drivers/radeon/r600_texture.c
>>>> @@ -26,9 +26,11 @@
>>>>     */
>>>>    #include "r600_pipe_common.h"
>>>>    #include "r600_cs.h"
>>>> +#include "r600_query.h"
>>>>    #include "util/u_format.h"
>>>>    #include "util/u_memory.h"
>>>>    #include "util/u_pack_color.h"
>>>> +#include "os/os_time.h"
>>>>    #include <errno.h>
>>>>    #include <inttypes.h>
>>>>
>>>> @@ -567,6 +569,7 @@ static void r600_texture_destroy(struct pipe_screen
>>>> *screen,
>>>>          }
>>>>          pb_reference(&resource->buf, NULL);
>>>>          r600_resource_reference(&rtex->dcc_separate_buffer, NULL);
>>>> +       r600_resource_reference(&rtex->last_dcc_separate_buffer, NULL);
>>>>          FREE(rtex);
>>>>    }
>>>>
>>>> @@ -1017,6 +1020,7 @@ r600_texture_create_object(struct pipe_screen
>>>> *screen,
>>>>          rtex->non_disp_tiling = rtex->is_depth &&
>>>> rtex->surface.level[0].mode >= RADEON_SURF_MODE_1D;
>>>>          /* Applies to GCN. */
>>>>          rtex->last_msaa_resolve_target_micro_mode =
>>>> rtex->surface.micro_tile_mode;
>>>> +       rtex->ps_draw_ratio = 100; /* start with a sufficiently high
>>>> number */
>>>>
>>>>          if (rtex->is_depth) {
>>>>                  if (!(base->flags & (R600_RESOURCE_FLAG_TRANSFER |
>>>> @@ -1705,6 +1709,224 @@ unsigned r600_translate_colorswap(enum
>>>> pipe_format
>>>> format, bool do_endian_swap)
>>>>          return ~0U;
>>>>    }
>>>>
>>>> +/* PIPELINE_STAT-BASED DCC ENABLEMENT FOR DISPLAYABLE SURFACES */
>>>> +
>>>> +/**
>>>> + * Return the per-context slot where DCC statistics queries for the
>>>> texture live.
>>>> + */
>>>> +static unsigned vi_get_context_dcc_stats_index(struct
>>>> r600_common_context
>>>> *rctx,
>>>> +                                              struct r600_texture *tex)
>>>> +{
>>>> +       int i, empty_slot = -1;
>>>> +
>>>> +       for (i = 0; i < ARRAY_SIZE(rctx->dcc_stats); i++) {
>>>> +               /* Return if found. */
>>>> +               if (rctx->dcc_stats[i].tex == tex) {
>>>> +                       rctx->dcc_stats[i].last_use_timestamp =
>>>> os_time_get();
>>>> +                       return i;
>>>> +               }
>>>> +
>>>> +               /* Record the first seen empty slot. */
>>>> +               if (empty_slot == -1 && !rctx->dcc_stats[i].tex)
>>>> +                       empty_slot = i;
>>>> +       }
>>>> +
>>>> +       /* Not found. Remove the oldest member to make space in the
>>>> array.
>>>> */
>>>> +       if (empty_slot == -1) {
>>>> +               int oldest_slot = 0;
>>>> +
>>>> +               /* Find the oldest slot. */
>>>> +               for (i = 1; i < ARRAY_SIZE(rctx->dcc_stats); i++)
>>>> +                       if
>>>> (rctx->dcc_stats[oldest_slot].last_use_timestamp >
>>>> +                           rctx->dcc_stats[i].last_use_timestamp)
>>>> +                               oldest_slot = i;
>>>> +
>>>> +               /* Clean up the oldest slot. */
>>>> +               if (rctx->dcc_stats[oldest_slot].query_active)
>>>> +                       vi_separate_dcc_stop_query(&rctx->b,
>>>> +
>>>> rctx->dcc_stats[oldest_slot].tex);
>>>> +
>>>> +               for (i = 0; i <
>>>> ARRAY_SIZE(rctx->dcc_stats[oldest_slot].ps_stats); i++)
>>>> +                       if (rctx->dcc_stats[oldest_slot].ps_stats[i]) {
>>>> +                               rctx->b.destroy_query(&rctx->b,
>>>> +
>>>> rctx->dcc_stats[oldest_slot].ps_stats[i]);
>>>> +                               rctx->dcc_stats[oldest_slot].ps_stats[i]
>>>> =
>>>> NULL;
>>>> +                       }
>>>> +
>>>> +               pipe_resource_reference((struct pipe_resource**)
>>>> +
>>>> &rctx->dcc_stats[oldest_slot].tex,
>>>> NULL);
>>>> +               empty_slot = oldest_slot;
>>>> +       }
>>>> +
>>>> +       /* Add the texture to the new slot. */
>>>> +       pipe_resource_reference((struct
>>>> pipe_resource**)&rctx->dcc_stats[empty_slot].tex,
>>>> +                               &tex->resource.b.b);
>>>> +       rctx->dcc_stats[empty_slot].last_use_timestamp = os_time_get();
>>>> +       return empty_slot;
>>>> +}
>>>> +
>>>> +static struct pipe_query *
>>>> +vi_create_resuming_pipestats_query(struct pipe_context *ctx)
>>>> +{
>>>> +       struct r600_query_hw *query = (struct r600_query_hw*)
>>>> +               ctx->create_query(ctx, PIPE_QUERY_PIPELINE_STATISTICS,
>>>> 0);
>>>> +
>>>> +       query->flags |= R600_QUERY_HW_FLAG_BEGIN_RESUMES;
>>>> +       return (struct pipe_query*)query;
>>>> +}
>>>> +
>>>> +/**
>>>> + * Called when binding a color buffer.
>>>> + */
>>>> +void vi_separate_dcc_start_query(struct pipe_context *ctx,
>>>> +                                struct r600_texture *tex)
>>>> +{
>>>> +       struct r600_common_context *rctx = (struct
>>>> r600_common_context*)ctx;
>>>> +       unsigned i = vi_get_context_dcc_stats_index(rctx, tex);
>>>> +
>>>> +       assert(!rctx->dcc_stats[i].query_active);
>>>> +
>>>> +       if (!rctx->dcc_stats[i].ps_stats[0])
>>>> +               rctx->dcc_stats[i].ps_stats[0] =
>>>> vi_create_resuming_pipestats_query(ctx);
>>>> +
>>>> +       /* begin or resume the query */
>>>> +       ctx->begin_query(ctx, rctx->dcc_stats[i].ps_stats[0]);
>>>> +       rctx->dcc_stats[i].query_active = true;
>>>> +}
>>>> +
>>>> +/**
>>>> + * Called when unbinding a color buffer.
>>>> + */
>>>> +void vi_separate_dcc_stop_query(struct pipe_context *ctx,
>>>> +                               struct r600_texture *tex)
>>>> +{
>>>> +       struct r600_common_context *rctx = (struct
>>>> r600_common_context*)ctx;
>>>> +       unsigned i = vi_get_context_dcc_stats_index(rctx, tex);
>>>> +
>>>> +       assert(rctx->dcc_stats[i].query_active);
>>>> +       assert(rctx->dcc_stats[i].ps_stats[0]);
>>>> +
>>>> +       /* pause or end the query */
>>>> +       ctx->end_query(ctx, rctx->dcc_stats[i].ps_stats[0]);
>>>> +       rctx->dcc_stats[i].query_active = false;
>>>> +}
>>>> +
>>>> +static bool vi_can_enable_separate_dcc(struct r600_texture *tex)
>>>> +{
>>>> +       /* The minimum number of fullscreen draws per frame that is
>>>> required
>>>> +        * to enable DCC. */
>>>> +       return tex->ps_draw_ratio + tex->num_slow_clears >= 5;
>>>> +}
>>>
>>>
>>>
>>> Rename this function to vi_should_enable_separate_dcc or similar.
>>
>>
>> OK.
>>
>>>
>>>
>>>> +
>>>> +/* Called by fast clear. */
>>>> +static void vi_separate_dcc_try_enable(struct r600_common_context
>>>> *rctx,
>>>> +                                      struct r600_texture *tex)
>>>> +{
>>>> +       /* The intent is to use this with shared displayable back
>>>> buffers,
>>>> +        * but it's not strictly limited only to them.
>>>> +        */
>>>> +       if (!tex->resource.is_shared ||
>>>> +           !(tex->resource.external_usage &
>>>> PIPE_HANDLE_USAGE_EXPLICIT_FLUSH) ||
>>>> +           tex->resource.b.b.target != PIPE_TEXTURE_2D ||
>>>> +           tex->surface.last_level > 0 ||
>>>> +           !tex->surface.dcc_size)
>>>> +               return;
>>>> +
>>>> +       if (tex->dcc_offset)
>>>> +               return; /* already enabled */
>>>> +
>>>> +       if (!vi_can_enable_separate_dcc(tex))
>>>> +               return; /* stats show that DCC decompression is too
>>>> expensive */
>>>> +
>>>> +       assert(tex->surface.level[0].dcc_enabled);
>>>> +       assert(!tex->dcc_separate_buffer);
>>>> +
>>>> +       r600_texture_discard_cmask(rctx->screen, tex);
>>>> +
>>>> +       /* Get a DCC buffer. */
>>>> +       if (tex->last_dcc_separate_buffer) {
>>>> +               tex->dcc_separate_buffer =
>>>> tex->last_dcc_separate_buffer;
>>>> +               tex->last_dcc_separate_buffer = NULL;
>>>> +       } else {
>>>> +               tex->dcc_separate_buffer = (struct r600_resource*)
>>>> +                       r600_aligned_buffer_create(rctx->b.screen, 0,
>>>> +                                                  PIPE_USAGE_DEFAULT,
>>>> +
>>>> tex->surface.dcc_size,
>>>> +
>>>> tex->surface.dcc_alignment);
>>>> +               if (!tex->dcc_separate_buffer)
>>>> +                       return;
>>>> +
>>>> +               /* Enabling for the first time, so start the query. */
>>>> +               tex->dcc_gather_statistics = true;
>>>> +               vi_separate_dcc_start_query(&rctx->b, tex);
>>>
>>>
>>>
>>> I think it would be cleaner to put the statistics enablement above the
>>> vi_can_enable_separate_dcc/vi_should_enable_separate_dcc check, since
>>> then
>>> the code doesn't rely on a magic initialization of ps_ratio. Generally it
>>> would separate the statistics gathering more cleanly from the
>>> enable/disable.
>>
>>
>> The magic initilization of ps_ratio is required for tests that don't
>> call SwapBuffers or call it too late. I understand the idea about
>> separating the statistics gathering, but do we need it separate? There
>> is no reason to gather statistics before the first clear.
>
>
> Hmm, so the idea is that you already want DCC to be enabled "optimistically"
> on the first clear? That makes sense. I think it would be good to clarify
> that in the comment where ps_ratio is initialized, and then the code here
> could stay as-is.
>
>
>>>
>>>> +       }
>>>> +
>>>> +       /* dcc_offset is the absolute GPUVM address. */
>>>> +       tex->dcc_offset = tex->dcc_separate_buffer->gpu_address;
>>>> +
>>>> +       /* no need to flag anything since this is called by fast clear
>>>> that
>>>> +        * flags framebuffer state
>>>> +        */
>>>> +}
>>>> +
>>>> +/**
>>>> + * Called by pipe_context::flush_resource, the place where DCC
>>>> decompression
>>>> + * takes place.
>>>> + */
>>>> +void vi_separate_dcc_analyze_stats(struct pipe_context *ctx,
>>>> +                                  struct r600_texture *tex)
>>>
>>>
>>>
>>> Bike-shedding: I'd prefer a name like _gather_stats or _collect_stats.
>>>
>>>> +{
>>>> +       struct r600_common_context *rctx = (struct
>>>> r600_common_context*)ctx;
>>>> +       unsigned i = vi_get_context_dcc_stats_index(rctx, tex);
>>>> +       bool query_active = rctx->dcc_stats[i].query_active;
>>>> +       bool disable = false;
>>>> +
>>>> +       if (rctx->dcc_stats[i].ps_stats[2]) {
>>>> +               union pipe_query_result result;
>>>> +
>>>> +               /* Read the results. */
>>>> +               ctx->get_query_result(ctx,
>>>> rctx->dcc_stats[i].ps_stats[2],
>>>> +                                     true, &result);
>>>
>>>
>>>
>>> What if this stalls?
>>
>>
>> If flush_resource / DCC decompression is done only once per frame,
>> there is a latency of two frames. That should be more than enough time
>> even if it stalls. Do you have any concerns about that?
>
>
> I have to admit it's more a matter of principle. People do sometimes look
> (stupidly, I know) at very high frame rates of what are essentially
> swapbuffers tests. It would be good to know that such tests aren't
> accidentally slowed down by this heuristic, but I'm also not familiar enough
> with the swapbuffers timing and any throttling that happens there to know
> for sure - that's why it made me a bit uneasy. If you think it's too much
> effort I don't mind too much.

At the end of each frame, st/dri waits for frame - 1. Given that, this
get_query_result call should never stall. No other in-tree state
trackers can be used in combination with st/mesa and radeonsi.

Now a couple of things:

1) I have some fixes that will improve performance a little. Those
will be sent as a separate series.

2) The current difference with DRI3 is 2.2% lower performance for
glxgears and GLMark2. It seems like a lot, but glxgears is not a
benchmark (and GLMark2 is almost as bad as glxgears). I think i can't
make it better than that.

3) DRI2 will see a bigger hit, because Mesa can't reuse back buffers
like DRI3. (DRI2 back buffers are allocated and selected by the DDX)
Do we care? Do we expect to have any DRI2 users that will use Mesa
12.1-dev or later?

Marek