[Mesa-dev] [PATCH v2 20/24] anv/cmd_buffer: Rework aux tracking

Sat Feb 3 01:36:47 UTC 2018

On Fri, Feb 02, 2018 at 02:39:25PM -0800, Jason Ekstrand wrote:
> On Fri, Feb 2, 2018 at 1:47 PM, Nanley Chery <nanleychery at gmail.com> wrote:
> 
> > On Fri, Jan 19, 2018 at 03:47:37PM -0800, Jason Ekstrand wrote:
> > > This commit completely reworks aux tracking.  This includes a number of
> > > somewhat distinct changes:
> > >
> > >  1) Since we are no longer fast-clearing multiple slices, we only need
> > >     to track one fast clear color and one fast clear type.
> > >
> > >  2) We store two bits for fast clear instead of one to let us
> > >     distinguish between zero and non-zero fast clear colors.  This is
> > >     needed so that we can do full resolves when transitioning to
> > >     PRESENT_SRC_KHR with gen9 CCS images where we allow zero clear
> > >     values in all sorts of places wouldn't normally.
> >                                    ^
> > Missing word?                      we ?
> >
> 
> Yup.  Fixed.
> 
> 
> > >
> > >  3) We now track compression state as a boolean separate from fast clear
> > >     type and this is tracked on a per-slice granularity.
> > >
> > > The previous scheme had some issues when it came to individual slices of
> > > a multi-LOD images.  In particular, we only tracked "needs resolve"
> > > per-LOD but you could do a vkCmdPipelineBarrier that would only resolve
> > > a portion of the image and would set "needs resolve" to false anyway.
> > > Also, any transition from an undefined layout would reset the clear
> > > color for the entire LOD regardless of whether or not there was some
> > > clear color on some other slice.
> > >
> > > As far as full/partial resolves go, he assumptions of the previous
> > > scheme held because the one case where we do need a full resolve when
> > > CCS_E is enabled is for window-system images.  Since we only ever
> > > allowed X-tiled window-system images, CCS was entirely disabled on gen9+
> > > and we never got CCS_E.  With the advent of Y-tiled window-system
> > > buffers, we now need to properly support doing a full resolve of images
> > > marked CCS_E.
> > > ---
> > >  src/intel/vulkan/anv_blorp.c       |   3 +-
> > >  src/intel/vulkan/anv_image.c       |  96 ++++++-----
> > >  src/intel/vulkan/anv_private.h     |  53 +++---
> > >  src/intel/vulkan/genX_cmd_buffer.c | 340 +++++++++++++++++++++++++++---
> > -------
> > >  4 files changed, 331 insertions(+), 161 deletions(-)
> > >
> > > diff --git a/src/intel/vulkan/anv_blorp.c b/src/intel/vulkan/anv_blorp.c
> > > index 3698543..594b0d8 100644
> > > --- a/src/intel/vulkan/anv_blorp.c
> > > +++ b/src/intel/vulkan/anv_blorp.c
> > > @@ -1757,8 +1757,7 @@ anv_image_ccs_op(struct anv_cmd_buffer *cmd_buffer,
> > >         * particular value and don't care about format or clear value.
> > >         */
> > >        const struct anv_address clear_color_addr =
> > > -         anv_image_get_clear_color_addr(cmd_buffer->device, image,
> > > -                                        aspect, level);
> > > +         anv_image_get_clear_color_addr(cmd_buffer->device, image,
> > aspect);
> > >        surf.clear_color_addr = anv_to_blorp_address(clear_color_addr);
> > >     }
> > >
> > > diff --git a/src/intel/vulkan/anv_image.c b/src/intel/vulkan/anv_image.c
> > > index 94b9ecb..d5f8dcf 100644
> > > --- a/src/intel/vulkan/anv_image.c
> > > +++ b/src/intel/vulkan/anv_image.c
> > > @@ -190,46 +190,54 @@ all_formats_ccs_e_compatible(const struct
> > gen_device_info *devinfo,
> > >   * fast-clear values in non-trivial cases (e.g., outside of a render
> > pass in
> > >   * which a fast clear has occurred).
> > >   *
> > > - * For the purpose of discoverability, the algorithm used to manage
> > this buffer
> > > - * is described here. A clear value in this buffer is updated when a
> > fast clear
> > > - * is performed on a subresource. One of two synchronization operations
> > is
> > > - * performed in order for a following memory access to use the
> > fast-clear
> > > - * value:
> > > - *    a. Copy the value from the buffer to the surface state object
> > used for
> > > - *       reading. This is done implicitly when the value is the clear
> > value
> > > - *       predetermined to be the default in other surface state
> > objects. This
> > > - *       is currently only done explicitly for the operation below.
> > > - *    b. Do (a) and use the surface state object to resolve the
> > subresource.
> > > - *       This is only done during layout transitions for decent
> > performance.
> > > + * In order to avoid having multiple clear colors for a single plane of
> > an
> > > + * image (hence a single RENDER_SURFACE_STATE), we only allow
> > fast-clears on
> > > + * the first slice (level 0, layer 0).  At the time of our testing (Jan
> > 17,
> > > + * 2018), there were known applications which would benefit from
> > fast-clearing
> > > + * more than just the first slice.
> > >   *
> > > - * With the above scheme, we can fast-clear whenever the hardware
> > allows except
> > > - * for two cases in which synchronization becomes impossible or
> > undesirable:
> > > - *    * The subresource is in the GENERAL layout and is cleared to a
> > value
> > > - *      other than the special default value.
> > > + * The fast clear portion of the image is laid out in the following
> > order:
> > >   *
> > > - *      Performing a synchronization operation in order to read from the
> > > - *      subresource is undesirable in this case. Firstly, b) is not an
> > option
> > > - *      because a layout transition isn't required between a write and
> > read of
> > > - *      an image in the GENERAL layout. Secondly, it's undesirable to
> > do a)
> > > - *      explicitly because it would require large infrastructural
> > changes. The
> > > - *      Vulkan API supports us in deciding not to optimize this layout
> > by
> > > - *      stating that using this layout may cause suboptimal
> > performance. NOTE:
> > > - *      the auxiliary buffer must always be enabled to support a)
> > implicitly.
> > > + *  * 1 or 4 dwords (depending on hardware generation) for the clear
> > color
> > > + *  * 1 dword for the anv_fast_clear_type of the clear color
> > > + *  * On gen9+, 1 dword per level and layer of the image (3D levels
> > count as
> > > + *    having a single layer) in level-major order for compression state.
> > >   *
> > > + * For the purpose of discoverability, the algorithm used to manage
> > > + * compression and fast-clears is described here:
> > >   *
> > > - *    * For the given miplevel, only some of the layers are cleared at
> > once.
> > > + *  * On a transition from UNDEFINED or PREINITIALIZED to a defined
> > layout,
> > > + *    all of the values in the fast clear portion of the image are
> > initialized
> > > + *    to default values.
> > >   *
> > > - *      If the user clears each layer to a different value, then tries
> > to
> > > - *      render to multiple layers at once, we have no ability to
> > perform a
> > > - *      synchronization operation in between. a) is not helpful because
> > the
> > > - *      object can only hold one clear value. b) is not an option
> > because a
> > > - *      layout transition isn't required in this case.
> > > + *  * On fast-clear, the clear value is written into surface state and
> > also
> > > + *    into the buffer and the fast clear type is set appropriately.
> > Both
> > > + *    setting the fast-clear value in the buffer and setting the
> > fast-clear
> > > + *    type happen from the GPU using MI commands.
> > > + *
> > > + *  * On pipeline barrier transitions, the worst-case transition is
> > computed
> > > + *    from the image layouts.  The command streamer inspects the fast
> > clear
> > > + *    type and compression state dwords and constructs a predicate.  The
> > > + *    worst-case resolve is performed with the given predicate and the
> > fast
> > > + *    clear and compression state is set accordingly.
> > > + *
> > > + * See anv_layout_to_aux_usage and anv_layout_to_fast_clear_type
> > functions for
> > > + * details on exactly what is allowed in what layouts.
> > > + *
> > > + * On gen7-9, we do not have a concept of indirect clear colors in
> > hardware.
> > > + * In order to deal with this, we have to do some clear color
> > management.
> > > + *
> > > + *  * For LOAD_OP_LOAD at the top of a renderpass, we have to copy the
> > clear
> > > + *    value from the buffer into the surface state with MI commands.
> > > + *
> > > + *  * For any blorp operations, we pass the address to the clear value
> > into
> > > + *    blorp and it knows to copy the clear color.
> > >   */
> > >  static void
> > > -add_fast_clear_state_buffer(struct anv_image *image,
> > > -                            VkImageAspectFlagBits aspect,
> > > -                            uint32_t plane,
> > > -                            const struct anv_device *device)
> > > +add_aux_state_tracking_buffer(struct anv_image *image,
> > > +                              VkImageAspectFlagBits aspect,
> > > +                              uint32_t plane,
> > > +                              const struct anv_device *device)
> > >  {
> > >     assert(image && device);
> > >     assert(image->planes[plane].aux_surface.isl.size > 0 &&
> > > @@ -251,20 +259,20 @@ add_fast_clear_state_buffer(struct anv_image
> > *image,
> > >               (image->planes[plane].offset + image->planes[plane].size));
> > >     }
> > >
> > > -   const unsigned entry_size = anv_fast_clear_state_entry_size(device);
> > > -   /* There's no padding between entries, so ensure that they're always
> > a
> > > -    * multiple of 32 bits in order to enable GPU memcpy operations.
> > > -    */
> > > -   assert(entry_size % 4 == 0);
> > > +   /* Clear color and fast clear type */
> > > +   unsigned state_size = device->isl_dev.ss.clear_value_size + 4;
> > >
> > > -   const unsigned plane_state_size =
> > > -      entry_size * anv_image_aux_levels(image, aspect);
> > > +   /* We only need to track compression on CCS_E surfaces.  We don't
> > consider
> > > +    * 3D images as actually having multiple array layers.
> > > +    */
> > > +   if (image->planes[plane].aux_usage == ISL_AUX_USAGE_CCS_E)
> > > +      state_size += image->levels * image->array_size;
> > >
> > >     image->planes[plane].fast_clear_state_offset =
> > >        image->planes[plane].offset + image->planes[plane].size;
> > >
> > > -   image->planes[plane].size += plane_state_size;
> > > -   image->size += plane_state_size;
> > > +   image->planes[plane].size += state_size;
> > > +   image->size += state_size;
> > >  }
> > >
> > >  /**
> > > @@ -439,7 +447,7 @@ make_surface(const struct anv_device *dev,
> > >              }
> > >
> > >              add_surface(image, &image->planes[plane].aux_surface,
> > plane);
> > > -            add_fast_clear_state_buffer(image, aspect, plane, dev);
> > > +            add_aux_state_tracking_buffer(image, aspect, plane, dev);
> > >
> > >              /* For images created without MUTABLE_FORMAT_BIT set, we
> > know that
> > >               * they will always be used with the original format.  In
> > > @@ -463,7 +471,7 @@ make_surface(const struct anv_device *dev,
> > >                                   &image->planes[plane].aux_sur
> > face.isl);
> > >        if (ok) {
> > >           add_surface(image, &image->planes[plane].aux_surface, plane);
> > > -         add_fast_clear_state_buffer(image, aspect, plane, dev);
> > > +         add_aux_state_tracking_buffer(image, aspect, plane, dev);
> > >           image->planes[plane].aux_usage = ISL_AUX_USAGE_MCS;
> > >        }
> > >     }
> > > diff --git a/src/intel/vulkan/anv_private.h
> > b/src/intel/vulkan/anv_private.h
> > > index f0251e2..3d3a773 100644
> > > --- a/src/intel/vulkan/anv_private.h
> > > +++ b/src/intel/vulkan/anv_private.h
> > > @@ -2483,50 +2483,51 @@ anv_image_aux_layers(const struct anv_image *
> > const image,
> > >     }
> > >  }
> > >
> > > -static inline unsigned
> > > -anv_fast_clear_state_entry_size(const struct anv_device *device)
> > > -{
> > > -   assert(device);
> > > -   /* Entry contents:
> > > -    *   +--------------------------------------------+
> > > -    *   | clear value dword(s) | needs resolve dword |
> > > -    *   +--------------------------------------------+
> > > -    */
> > > -
> > > -   /* Ensure that the needs resolve dword is in fact dword-aligned to
> > enable
> > > -    * GPU memcpy operations.
> > > -    */
> > > -   assert(device->isl_dev.ss.clear_value_size % 4 == 0);
> > > -   return device->isl_dev.ss.clear_value_size + 4;
> > > -}
> > > -
> > >  static inline struct anv_address
> > >  anv_image_get_clear_color_addr(const struct anv_device *device,
> > >                                 const struct anv_image *image,
> > > -                               VkImageAspectFlagBits aspect,
> > > -                               unsigned level)
> > > +                               VkImageAspectFlagBits aspect)
> > >  {
> > > +   assert(image->aspects & VK_IMAGE_ASPECT_ANY_COLOR_BIT_ANV);
> > > +
> > >     uint32_t plane = anv_image_aspect_to_plane(image->aspects, aspect);
> > >     return (struct anv_address) {
> > >        .bo = image->planes[plane].bo,
> > >        .offset = image->planes[plane].bo_offset +
> > > -                image->planes[plane].fast_clear_state_offset +
> > > -                anv_fast_clear_state_entry_size(device) * level,
> > > +                image->planes[plane].fast_clear_state_offset,
> > >     };
> > >  }
> > >
> > >  static inline struct anv_address
> > > -anv_image_get_needs_resolve_addr(const struct anv_device *device,
> > > -                                 const struct anv_image *image,
> > > -                                 VkImageAspectFlagBits aspect,
> > > -                                 unsigned level)
> > > +anv_image_get_fast_clear_type_addr(const struct anv_device *device,
> > > +                                   const struct anv_image *image,
> > > +                                   VkImageAspectFlagBits aspect)
> > >  {
> > >     struct anv_address addr =
> > > -      anv_image_get_clear_color_addr(device, image, aspect, level);
> > > +      anv_image_get_clear_color_addr(device, image, aspect);
> > >     addr.offset += device->isl_dev.ss.clear_value_size;
> > >     return addr;
> > >  }
> > >
> > > +static inline struct anv_address
> > > +anv_image_get_compression_state_addr(const struct anv_device *device,
> > > +                                     const struct anv_image *image,
> > > +                                     VkImageAspectFlagBits aspect,
> > > +                                     uint32_t level, uint32_t
> > array_layer)
> > > +{
> > > +   assert(level < anv_image_aux_levels(image, aspect));
> > > +   assert(array_layer < anv_image_aux_layers(image, aspect, level));
> > > +   UNUSED uint32_t plane = anv_image_aspect_to_plane(image->aspects,
> > aspect);
> > > +   assert(image->planes[plane].aux_usage == ISL_AUX_USAGE_CCS_E);
> > > +
> > > +   struct anv_address addr =
> > > +      anv_image_get_fast_clear_type_addr(device, image, aspect);
> > > +   addr.offset += 4; /* Go past the fast clear type */
> > > +   addr.offset += level * image->array_size;
> > > +   addr.offset += array_layer;
> > > +   return addr;
> > > +}
> > > +
> > >  /* Returns true if a HiZ-enabled depth buffer can be sampled from. */
> > >  static inline bool
> > >  anv_can_sample_with_hiz(const struct gen_device_info * const devinfo,
> > > diff --git a/src/intel/vulkan/genX_cmd_buffer.c
> > b/src/intel/vulkan/genX_cmd_buffer.c
> > > index 15e805f..4c83a5c 100644
> > > --- a/src/intel/vulkan/genX_cmd_buffer.c
> > > +++ b/src/intel/vulkan/genX_cmd_buffer.c
> > > @@ -407,27 +407,45 @@ transition_depth_buffer(struct anv_cmd_buffer
> > *cmd_buffer,
> > >  #define MI_PREDICATE_SRC0  0x2400
> > >  #define MI_PREDICATE_SRC1  0x2408
> > >
> > > -/* Manages the state of an color image subresource to ensure resolves
> > are
> > > - * performed properly.
> > > - */
> > >  static void
> > > -genX(set_image_needs_resolve)(struct anv_cmd_buffer *cmd_buffer,
> > > -                        const struct anv_image *image,
> > > -                        VkImageAspectFlagBits aspect,
> > > -                        unsigned level, bool needs_resolve)
> > > +set_image_fast_clear_state(struct anv_cmd_buffer *cmd_buffer,
> > > +                           const struct anv_image *image,
> > > +                           VkImageAspectFlagBits aspect,
> > > +                           enum anv_fast_clear_type fast_clear)
> > >  {
> > > -   assert(cmd_buffer && image);
> > > -   assert(image->aspects & VK_IMAGE_ASPECT_ANY_COLOR_BIT_ANV);
> > > -   assert(level < anv_image_aux_levels(image, aspect));
> > > -
> > > -   /* The HW docs say that there is no way to guarantee the completion
> > of
> > > -    * the following command. We use it nevertheless because it shows no
> > > -    * issues in testing is currently being used in the GL driver.
> > > -    */
> > >     anv_batch_emit(&cmd_buffer->batch, GENX(MI_STORE_DATA_IMM), sdi) {
> > > -      sdi.Address = anv_image_get_needs_resolve_ad
> > dr(cmd_buffer->device,
> > > -                                                     image, aspect,
> > level);
> > > -      sdi.ImmediateData = needs_resolve;
> > > +      sdi.Address = anv_image_get_fast_clear_type_
> > addr(cmd_buffer->device,
> > > +                                                       image, aspect);
> > > +      sdi.ImmediateData = fast_clear;
> > > +   }
> > > +}
> > > +
> > > +static void
> > > +set_image_compressed_bit(struct anv_cmd_buffer *cmd_buffer,
> > > +                         const struct anv_image *image,
> > > +                         VkImageAspectFlagBits aspect,
> > > +                         uint32_t level,
> > > +                         uint32_t base_layer, uint32_t layer_count,
> > > +                         bool compressed)
> > > +{
> > > +   /* We only have CCS_E on gen9+ */
> > > +   if (GEN_GEN < 9)
> > > +      return;
> > > +
> >
> > Is this if only here for an optimization? Perhaps update the comment
> > accordingly? Otherwise it'd seem more appropriate to assert after the
> > other return in this function.
> >
> 
> Yes.  It makes this function compile to nothing on gen8 and earlier.  It's
> not a big function so I'm happy to drop it.
> 
> 
> > > +   uint32_t plane = anv_image_aspect_to_plane(image->aspects, aspect);
> > > +
> > > +   /* We only have compression tracking for CCS_E */
> > > +   if (image->planes[plane].aux_usage != ISL_AUX_USAGE_CCS_E)
> > > +      return;
> > > +
> > > +   for (uint32_t a = 0; a < layer_count; a++) {
> > > +      uint32_t layer = base_layer + a;
> > > +      anv_batch_emit(&cmd_buffer->batch, GENX(MI_STORE_DATA_IMM), sdi)
> > {
> > > +         sdi.Address = anv_image_get_compression_stat
> > e_addr(cmd_buffer->device,
> > > +                                                            image,
> > aspect,
> > > +                                                            level,
> > layer);
> > > +         sdi.ImmediateData = compressed ? UINT32_MAX : 0;
> > > +      }
> > >     }
> > >  }
> > >
> > > @@ -451,32 +469,172 @@ mi_alu(uint32_t opcode, uint32_t operand1,
> > uint32_t operand2)
> > >  #define CS_GPR(n) (0x2600 + (n) * 8)
> > >
> > >  static void
> > > -genX(load_needs_resolve_predicate)(struct anv_cmd_buffer *cmd_buffer,
> > > -                                   const struct anv_image *image,
> > > -                                   VkImageAspectFlagBits aspect,
> > > -                                   unsigned level)
> > > +anv_cmd_predicated_ccs_resolve(struct anv_cmd_buffer *cmd_buffer,
> > > +                               const struct anv_image *image,
> > > +                               VkImageAspectFlagBits aspect,
> > > +                               uint32_t level, uint32_t array_layer,
> > > +                               enum isl_aux_op resolve_op,
> > > +                               enum anv_fast_clear_type
> > fast_clear_supported)
> > >  {
> > > -   assert(cmd_buffer && image);
> > > -   assert(image->aspects & VK_IMAGE_ASPECT_ANY_COLOR_BIT_ANV);
> > > -   assert(level < anv_image_aux_levels(image, aspect));
> > > +   struct anv_address fast_clear_type_addr =
> > > +      anv_image_get_fast_clear_type_addr(cmd_buffer->device, image,
> > aspect);
> > > +
> > > +#if GEN_GEN >= 9
> > > +   const uint32_t plane = anv_image_aspect_to_plane(image->aspects,
> > aspect);
> > > +   const bool decompress =
> > > +      resolve_op == ISL_AUX_OP_FULL_RESOLVE &&
> > > +      image->planes[plane].aux_usage == ISL_AUX_USAGE_CCS_E;
> > > +
> > > +   /* This function shouldn't get called if it isn't going to do
> > anything */
> > > +   assert(decompress || fast_clear_supported < ANV_FAST_CLEAR_ANY);
> > >
> > > -   const struct anv_address resolve_flag_addr =
> > > -      anv_image_get_needs_resolve_addr(cmd_buffer->device,
> > > -                                       image, aspect, level);
> > > +   if (level == 0 && array_layer == 0) {
> > > +      /* This is the complex case because we have to worry about
> > dealing with
> > > +       * the fast clear color.  Unfortunately, it's also the common
> > case.
> > > +       */
> > > +
> > > +      /* Poor-man's register allocation */
> > > +      int next_reg = MI_ALU_REG0;
> > > +      int pred_reg = -1;
> > > +
> > > +      /* Needed for ALU operations */
> > > +      uint32_t *dw;
> > > +
> > > +      const int image_fc = next_reg++;
> > > +      anv_batch_emit(&cmd_buffer->batch, GENX(MI_LOAD_REGISTER_MEM),
> > lrm) {
> > > +         lrm.RegisterAddress  = CS_GPR(image_fc);
> > > +         lrm.MemoryAddress    = fast_clear_type_addr;
> > > +      }
> > > +      emit_lri(&cmd_buffer->batch, CS_GPR(image_fc) + 4, 0);
> > > +
> > > +      if (fast_clear_supported < ANV_FAST_CLEAR_ANY) {
> > > +         /* We need to compute (fast_clear_supported <
> > image->fast_clear).
> > > +          * We do this by subtracting and storing the carry bit.
> > > +          */
> > > +         const int fc_imm = next_reg++;
> > > +         emit_lri(&cmd_buffer->batch, CS_GPR(fc_imm),
> > fast_clear_supported);
> > > +         emit_lri(&cmd_buffer->batch, CS_GPR(fc_imm) + 4, 0);
> > > +
> > > +         assert(pred_reg == -1);
> > > +         pred_reg = next_reg++;
> > > +
> > > +         dw = anv_batch_emitn(&cmd_buffer->batch, 5, GENX(MI_MATH));
> > > +         dw[1] = mi_alu(MI_ALU_LOAD, MI_ALU_SRCA, fc_imm);
> > > +         dw[2] = mi_alu(MI_ALU_LOAD, MI_ALU_SRCB, image_fc);
> > > +         dw[3] = mi_alu(MI_ALU_SUB, 0, 0);
> > > +         dw[4] = mi_alu(MI_ALU_STORE, pred_reg, MI_ALU_CF);
> >
> > Nice comparison algorithm.
> >
> > > +      }
> > > +
> > > +      if (decompress) {
> > > +         /* If we're doing a full resolve so we need the compression
> > state */
> >                ^                 OR          ^
> > One of these is an extra word?
> >
> 
> Yup.  Dropped the "so" and added a comma
> 
> 
> > > +         struct anv_address compression_state_addr =
> > > +            anv_image_get_compression_state_addr(cmd_buffer->device,
> > image,
> > > +                                                 aspect, level,
> > array_layer);
> > > +         if (pred_reg == -1) {
> > > +            pred_reg = next_reg++;
> > > +            anv_batch_emit(&cmd_buffer->batch,
> > GENX(MI_LOAD_REGISTER_MEM), lrm) {
> > > +               lrm.RegisterAddress  = CS_GPR(pred_reg);
> > > +               lrm.MemoryAddress    = compression_state_addr;
> > > +            }
> > > +         } else {
> > > +            /* OR the compression state into the predicate.  The
> > compression
> > > +             * state is already in 0/~0 form.
> > > +             */
> > > +            const int image_comp = next_reg++;
> > > +            anv_batch_emit(&cmd_buffer->batch,
> > GENX(MI_LOAD_REGISTER_MEM), lrm) {
> > > +               lrm.RegisterAddress  = CS_GPR(image_comp);
> > > +               lrm.MemoryAddress    = compression_state_addr;
> > > +            }
> > >
> > > -   /* Make the pending predicated resolve a no-op if one is not needed.
> > > -    * predicate = do_resolve = resolve_flag != 0;
> > > +            dw = anv_batch_emitn(&cmd_buffer->batch, 5, GENX(MI_MATH));
> > > +            dw[1] = mi_alu(MI_ALU_LOAD, MI_ALU_SRCA, pred_reg);
> > > +            dw[2] = mi_alu(MI_ALU_LOAD, MI_ALU_SRCB, image_comp);
> > > +            dw[3] = mi_alu(MI_ALU_OR, 0, 0);
> > > +            dw[4] = mi_alu(MI_ALU_STORE, pred_reg, MI_ALU_ACCU);
> > > +         }
> > > +
> > > +         anv_batch_emit(&cmd_buffer->batch, GENX(MI_STORE_DATA_IMM),
> > sdi) {
> > > +            sdi.Address = compression_state_addr;
> > > +            sdi.ImmediateData = 0;
> > > +         }
> >
> > Could we replace these batch emissions with set_image_compressed_bit()
> > to help in the readability of this function?
> >
> 
> I did it this way so that they would all be the same in this function.  We
> could use set_image_compressed_bit here.
> 
> 

Gotcha.

> > > +      }
> > > +
> > > +      /* Store the predicate */
> > > +      assert(pred_reg != -1);
> > > +      emit_lrr(&cmd_buffer->batch, MI_PREDICATE_SRC0, CS_GPR(pred_reg));
> > > +
> > > +      /* If the predicate is true, we want to write 0 to the fast clear
> > type
> > > +       * and, if it's false, leave it alone.  We can do this by writing
> > > +       *
> > > +       * clear_type = clear_type & ~predicate;
> > > +       */
> > > +      dw = anv_batch_emitn(&cmd_buffer->batch, 5, GENX(MI_MATH));
> > > +      dw[1] = mi_alu(MI_ALU_LOAD, MI_ALU_SRCA, image_fc);
> > > +      dw[2] = mi_alu(MI_ALU_LOADINV, MI_ALU_SRCB, pred_reg);
> > > +      dw[3] = mi_alu(MI_ALU_AND, 0, 0);
> > > +      dw[4] = mi_alu(MI_ALU_STORE, image_fc, MI_ALU_ACCU);
> > > +
> > > +      anv_batch_emit(&cmd_buffer->batch, GENX(MI_STORE_REGISTER_MEM),
> > srm) {
> > > +         srm.RegisterAddress  = CS_GPR(image_fc);
> > > +         srm.MemoryAddress    = fast_clear_type_addr;
> > > +      }
> >
> > Why do this preprocessing before writing to fast_clear_type_addr? Why
> > not just:
> >         set_image_fast_clear_state(..., ANV_FAST_CLEAR_NONE);
> >
> 
> See the comment above.  We only want to reset the fast clear state to NONE
> if we actually did the resolve.  Consider, for instance the following
> sequence:
> 
>  * BeginRenderPass with LOAD_OP_CLEAR and a zero clear color
>  * EndRenderPass
>  * Transition from VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL to
> VK_IMAGE_LAYOUT_PRESENT_SRC_KHR with CCS_E
> 
> In that case, COLOR_ATTACHMENT_OPTIMAL will say it supports any clear color
> but PRESENT_SRC_KHR will say DEFAULT.  This will cause us to schedule a
> full resolve.  However, because we only cleared to zero, the actual fast
> clear state in the buffer will be DEFAULT so the comparison above will
> determine that we don't need clear.  In this case, we don't want to reset
> the fast clear state to to NONE.  This is why we and with ~predicate above.
> 
> 

Ah okay, makes sense.

> > > +   } else if (decompress) {
> > > +      /* We're trying to get rid of compression but we don't care about
> > fast
> > > +       * clears so all we need is the compression predicate.
> > > +       */
> > > +      assert(resolve_op == ISL_AUX_OP_FULL_RESOLVE);
> > > +      struct anv_address compression_state_addr =
> > > +         anv_image_get_compression_state_addr(cmd_buffer->device,
> > image,
> > > +                                              aspect, level,
> > array_layer);
> > > +      anv_batch_emit(&cmd_buffer->batch, GENX(MI_LOAD_REGISTER_MEM),
> > lrm) {
> > > +         lrm.RegisterAddress  = MI_PREDICATE_SRC0;
> > > +         lrm.MemoryAddress    = compression_state_addr;
> > > +      }
> > > +      anv_batch_emit(&cmd_buffer->batch, GENX(MI_STORE_DATA_IMM), sdi)
> > {
> > > +         sdi.Address = compression_state_addr;
> > > +         sdi.ImmediateData = 0;
> > > +      }
> > > +   } else {
> > > +      /* In this case, we're trying to do a partial resolve on a slice
> > that
> > > +       * doesn't have clear color.  There's nothing to do.
> > > +       */
> > > +      return;
> > > +   }
> > > +
> > > +#else /* GEN_GEN <= 8 */
> > > +   assert(resolve_op == ISL_AUX_OP_FULL_RESOLVE);
> > > +   assert(fast_clear_supported != ANV_FAST_CLEAR_ANY);
> > > +
> > > +   /* On gen8, we don't have a concept of default clear colors because
> > we
> > > +    * can't sample from CCS surfaces.  It's enough to just load the
> > fast clear
> > > +    * state into the predicate register.
> > > +    */
> >
> > As mentioned in another email, gen7+ could support the TRANSFER_DST
> > layout with the default clear value. Could we update this comment to
> > reflect this possibility?
> >
> 
> Since blorp can do indirect clear colors now, we could handle TRANSFER_DST
> with any clear color even on gen7.
> 
> 

Even better.

> > > +   emit_lrm(&cmd_buffer->batch, MI_PREDICATE_SRC0,
> > > +            fast_clear_type_addr.bo, fast_clear_type_addr.offset);
> >
> > Why do we use the emit_lrm function here, but open code all other
> > LOAD_REGISTER_MEM commands in this patch?
> >
> 
> Probably because I missed this one.
> 
> 
> > > +#endif
> >
> > This ifdef case is missing a set_image_fast_clear_state. Did you mean to
> > place this in the first if statement which checked for the first slice?
> >
> 
> Oops...  Fixed locally.
> 
> 

Please ping me with the branch if you won't send a v3. The second change
is non-trivial in my opinion.

-Nanley

> > > +
> > > +   /* We use the first half of src0 for the actual predicate.  Set the
> > second
> > > +    * half of src0 and all of src1 to 0 as the predicate operation will
> > be
> > > +    * doing an implicit src0 != src1.
> > >      */
> > > +   emit_lri(&cmd_buffer->batch, MI_PREDICATE_SRC0 + 4, 0);
> > >     emit_lri(&cmd_buffer->batch, MI_PREDICATE_SRC1    , 0);
> > >     emit_lri(&cmd_buffer->batch, MI_PREDICATE_SRC1 + 4, 0);
> > > -   emit_lri(&cmd_buffer->batch, MI_PREDICATE_SRC0    , 0);
> > > -   emit_lrm(&cmd_buffer->batch, MI_PREDICATE_SRC0 + 4,
> > > -            resolve_flag_addr.bo, resolve_flag_addr.offset);
> > > +
> > >     anv_batch_emit(&cmd_buffer->batch, GENX(MI_PREDICATE), mip) {
> > >        mip.LoadOperation    = LOAD_LOADINV;
> > >        mip.CombineOperation = COMBINE_SET;
> > >        mip.CompareOperation = COMPARE_SRCS_EQUAL;
> > >     }
> > > +
> > > +   if (image->type == VK_IMAGE_TYPE_3D) {
> > > +      anv_image_ccs_op(cmd_buffer, image, aspect, level,
> > > +                       0, anv_minify(image->extent.depth, level),
> > > +                       resolve_op, true);
> > > +   } else {
> > > +      anv_image_ccs_op(cmd_buffer, image, aspect, level,
> > > +                       array_layer, 1, resolve_op, true);
> > > +   }
> > >  }
> > >
> > >  void
> > > @@ -490,17 +648,35 @@ genX(cmd_buffer_mark_image_written)(struct
> > anv_cmd_buffer *cmd_buffer,
> > >  {
> > >     /* The aspect must be exactly one of the image aspects. */
> > >     assert(_mesa_bitcount(aspect) == 1 && (aspect & image->aspects));
> > > +
> > > +   /* The only compression types with more than just fast-clears are
> > MCS,
> > > +    * CCS_E, and HiZ.  With HiZ we just trust the layout and don't
> > actually
> > > +    * track the current fast-clear and compression state.  This leaves
> > us
> > > +    * with just MCS and CCS_E.
> > > +    */
> > > +   if (aux_usage != ISL_AUX_USAGE_CCS_E &&
> > > +       aux_usage != ISL_AUX_USAGE_MCS)
> > > +      return;
> > > +
> > > +   if (image->type == VK_IMAGE_TYPE_3D) {
> > > +      base_layer = 0;
> > > +      layer_count = 1;
> > > +   }
> > > +
> > > +   set_image_compressed_bit(cmd_buffer, image, aspect,
> > > +                            level, base_layer, layer_count, true);
> > >  }
> > >
> > >  static void
> > > -init_fast_clear_state_entry(struct anv_cmd_buffer *cmd_buffer,
> > > -                            const struct anv_image *image,
> > > -                            VkImageAspectFlagBits aspect,
> > > -                            unsigned level)
> > > +init_fast_clear_color(struct anv_cmd_buffer *cmd_buffer,
> > > +                      const struct anv_image *image,
> > > +                      VkImageAspectFlagBits aspect)
> > >  {
> > >     assert(cmd_buffer && image);
> > >     assert(image->aspects & VK_IMAGE_ASPECT_ANY_COLOR_BIT_ANV);
> > > -   assert(level < anv_image_aux_levels(image, aspect));
> > > +
> > > +   set_image_fast_clear_state(cmd_buffer, image, aspect,
> > > +                              ANV_FAST_CLEAR_NONE);
> > >
> > >     uint32_t plane = anv_image_aspect_to_plane(image->aspects, aspect);
> > >     enum isl_aux_usage aux_usage = image->planes[plane].aux_usage;
> > > @@ -517,7 +693,7 @@ init_fast_clear_state_entry(struct anv_cmd_buffer
> > *cmd_buffer,
> > >      * values in the clear value dword(s).
> > >      */
> > >     struct anv_address addr =
> > > -      anv_image_get_clear_color_addr(cmd_buffer->device, image,
> > aspect, level);
> > > +      anv_image_get_clear_color_addr(cmd_buffer->device, image,
> > aspect);
> > >     unsigned i = 0;
> > >     for (; i < cmd_buffer->device->isl_dev.ss.clear_value_size; i += 4)
> > {
> > >        anv_batch_emit(&cmd_buffer->batch, GENX(MI_STORE_DATA_IMM), sdi)
> > {
> > > @@ -558,19 +734,17 @@ genX(copy_fast_clear_dwords)(struct
> > anv_cmd_buffer *cmd_buffer,
> > >                               struct anv_state surface_state,
> > >                               const struct anv_image *image,
> > >                               VkImageAspectFlagBits aspect,
> > > -                             unsigned level,
> > >                               bool copy_from_surface_state)
> > >  {
> > >     assert(cmd_buffer && image);
> > >     assert(image->aspects & VK_IMAGE_ASPECT_ANY_COLOR_BIT_ANV);
> > > -   assert(level < anv_image_aux_levels(image, aspect));
> > >
> > >     struct anv_bo *ss_bo =
> > >        &cmd_buffer->device->surface_state_pool.block_pool.bo;
> > >     uint32_t ss_clear_offset = surface_state.offset +
> > >        cmd_buffer->device->isl_dev.ss.clear_value_offset;
> > >     const struct anv_address entry_addr =
> > > -      anv_image_get_clear_color_addr(cmd_buffer->device, image,
> > aspect, level);
> > > +      anv_image_get_clear_color_addr(cmd_buffer->device, image,
> > aspect);
> > >     unsigned copy_size = cmd_buffer->device->isl_dev.ss
> > .clear_value_size;
> > >
> > >     if (copy_from_surface_state) {
> > > @@ -657,20 +831,11 @@ transition_color_buffer(struct anv_cmd_buffer
> > *cmd_buffer,
> > >                                 base_layer, layer_count);
> > >     }
> > >
> > > -   if (base_layer >= anv_image_aux_layers(image, aspect, base_level))
> > > -      return;
> > > -
> > > -   /* A transition of a 3D subresource works on all slices at a time. */
> > > -   if (image->type == VK_IMAGE_TYPE_3D) {
> > > +   if (image->type == VK_IMAGE_TYPE_3D)
> > >        base_layer = 0;
> > > -      layer_count = anv_minify(image->extent.depth, base_level);
> > > -   }
> > >
> > > -   /* We're interested in the subresource range subset that has aux
> > data. */
> > > -   level_count = MIN2(level_count, anv_image_aux_levels(image, aspect)
> > - base_level);
> > > -   layer_count = MIN2(layer_count,
> > > -                      anv_image_aux_layers(image, aspect, base_level) -
> > base_layer);
> > > -   last_level_num = base_level + level_count;
> > > +   if (base_layer >= anv_image_aux_layers(image, aspect, base_level))
> > > +      return;
> > >
> > >     assert(image->tiling == VK_IMAGE_TILING_OPTIMAL);
> > >
> > > @@ -684,8 +849,8 @@ transition_color_buffer(struct anv_cmd_buffer
> > *cmd_buffer,
> > >         *
> > >         * Initialize the relevant clear buffer entries.
> > >         */
> > > -      for (unsigned level = base_level; level < last_level_num; level++)
> > > -         init_fast_clear_state_entry(cmd_buffer, image, aspect, level);
> > > +      if (base_level == 0 && base_layer == 0)
> > > +         init_fast_clear_color(cmd_buffer, image, aspect);
> > >
> > >        /* Initialize the aux buffers to enable correct rendering.  In
> > order to
> > >         * ensure that things such as storage images work correctly, aux
> > buffers
> > > @@ -701,16 +866,26 @@ transition_color_buffer(struct anv_cmd_buffer
> > *cmd_buffer,
> > >         * We don't have any data to show that this is a problem, but we
> > want to
> > >         * avoid causing difficult-to-debug problems.
> > >         */
> > > +
> >
> > Intentional new line?
> >
> 
> Nope.  Dropped.
> 
> 
> > >        if (image->samples == 1) {
> > >           for (uint32_t l = 0; l < level_count; l++) {
> > >              const uint32_t level = base_level + l;
> > > -            const uint32_t level_layer_count =
> > > +            uint32_t level_layer_count =
> > >                 MIN2(layer_count, anv_image_aux_layers(image, aspect,
> > level));
> > > +
> > > +            /* A transition of a 3D subresource works on all slices. */
> > > +            if (image->type == VK_IMAGE_TYPE_3D)
> > > +               level_layer_count = anv_minify(image->extent.depth,
> > level);
> > > +
> > >              anv_image_ccs_op(cmd_buffer, image, aspect, level,
> > >                               base_layer, level_layer_count,
> > >                               ISL_AUX_OP_AMBIGUATE, false);
> > > -            genX(set_image_needs_resolve)(cmd_buffer, image,
> > > -                                          aspect, level, false);
> > > +
> > > +            if (image->planes[plane].aux_usage == ISL_AUX_USAGE_CCS_E)
> > {
> > > +               set_image_compressed_bit(cmd_buffer, image, aspect,
> > > +                                        level, base_layer,
> > level_layer_count,
> > > +                                        false);
> > > +            }
> > >           }
> > >        } else {
> > >           if (image->samples == 4 || image->samples == 16) {
> > > @@ -723,10 +898,6 @@ transition_color_buffer(struct anv_cmd_buffer
> > *cmd_buffer,
> > >           anv_image_mcs_op(cmd_buffer, image, aspect,
> > >                            base_layer, layer_count,
> > >                            ISL_AUX_OP_FAST_CLEAR, false);
> > > -         for (unsigned level = base_level; level < last_level_num;
> > level++) {
> > > -            genX(set_image_needs_resolve)(cmd_buffer, image,
> > > -                                          aspect, level, true);
> > > -         }
> > >        }
> > >        return;
> > >     }
> > > @@ -793,19 +964,14 @@ transition_color_buffer(struct anv_cmd_buffer
> > *cmd_buffer,
> > >     cmd_buffer->state.pending_pipe_bits |=
> > >        ANV_PIPE_RENDER_TARGET_CACHE_FLUSH_BIT | ANV_PIPE_CS_STALL_BIT;
> > >
> > > -   for (uint32_t level = base_level; level < last_level_num; level++) {
> > > -
> > > -      /* The number of layers changes at each 3D miplevel. */
> > > -      if (image->type == VK_IMAGE_TYPE_3D) {
> > > -         layer_count = MIN2(layer_count, anv_image_aux_layers(image,
> > aspect, level));
> > > +   for (uint32_t l = 0; l < level_count; l++) {
> > > +      uint32_t level = base_level + l;
> > > +      for (uint32_t a = 0; a < layer_count; a++) {
> > > +         uint32_t array_layer = base_layer + a;
> > > +         anv_cmd_predicated_ccs_resolve(cmd_buffer, image, aspect,
> > > +                                        level, array_layer, resolve_op,
> > > +                                        final_fast_clear);
> >
> > The above function does more than predicates a ccs resolve - it also
> > updates the fast_clear_type and compression dwords. I think it would be
> > helpful to do those operations outside of this function. We'd have to
> > add something like:
> >
> 
> See my comments above explaining why we can't just whack the clear state to
> a constant value.
> 
> --Jason
> 
> 
> >          if (level == 0 && array_layer == 0 && init_fast_clear <
> >             final_fast_clear) {
> >             set_image_fast_clear_state(..., ANV_FAST_CLEAR_NONE);
> >          }
> >          if (resolve_op == ISL_AUX_OP_FULL_RESOLVE)
> >             set_image_compressed_bit(...);
> >
> >
> > -Nanley
> >
> > >        }
> > > -
> > > -      genX(load_needs_resolve_predicate)(cmd_buffer, image, aspect,
> > level);
> > > -
> > > -      anv_image_ccs_op(cmd_buffer, image, aspect, level,
> > > -                       base_layer, layer_count, resolve_op, true);
> > > -
> > > -      genX(set_image_needs_resolve)(cmd_buffer, image, aspect, level,
> > false);
> > >     }
> > >
> > >     cmd_buffer->state.pending_pipe_bits |=
> > > @@ -3132,28 +3298,26 @@ cmd_buffer_subpass_sync_fast_clear_values(struct
> > anv_cmd_buffer *cmd_buffer)
> > >           genX(copy_fast_clear_dwords)(cmd_buffer,
> > att_state->color.state,
> > >                                        iview->image,
> > >                                        VK_IMAGE_ASPECT_COLOR_BIT,
> > > -                                      iview->planes[0].isl.base_level,
> > >                                        true /* copy from ss */);
> > >
> > >           /* Fast-clears impact whether or not a resolve will be
> > necessary. */
> > > -         if (iview->image->planes[0].aux_usage == ISL_AUX_USAGE_CCS_E
> > &&
> > > -             att_state->clear_color_is_zero) {
> > > +         if (att_state->clear_color_is_zero) {
> > >              /* This image always has the auxiliary buffer enabled. We
> > can mark
> > >               * the subresource as not needing a resolve because the
> > clear color
> > >               * will match what's in every RENDER_SURFACE_STATE object
> > when it's
> > >               * being used for sampling.
> > >               */
> > > -            genX(set_image_needs_resolve)(cmd_buffer, iview->image,
> > > -                                          VK_IMAGE_ASPECT_COLOR_BIT,
> > > -                                          iview->planes[0].isl.base_leve
> > l,
> > > -                                          false);
> > > +            set_image_fast_clear_state(cmd_buffer, iview->image,
> > > +                                       VK_IMAGE_ASPECT_COLOR_BIT,
> > > +                                       ANV_FAST_CLEAR_ZERO_ONLY);
> > >           } else {
> > > -            genX(set_image_needs_resolve)(cmd_buffer, iview->image,
> > > -                                          VK_IMAGE_ASPECT_COLOR_BIT,
> > > -                                          iview->planes[0].isl.base_leve
> > l,
> > > -                                          true);
> > > +            set_image_fast_clear_state(cmd_buffer, iview->image,
> > > +                                       VK_IMAGE_ASPECT_COLOR_BIT,
> > > +                                       ANV_FAST_CLEAR_ANY);
> > >           }
> > > -      } else if (rp_att->load_op == VK_ATTACHMENT_LOAD_OP_LOAD) {
> > > +      } else if (rp_att->load_op == VK_ATTACHMENT_LOAD_OP_LOAD &&
> > > +                 iview->planes[0].isl.base_level == 0 &&
> > > +                 iview->planes[0].isl.base_array_layer == 0) {
> > >           /* The attachment may have been fast-cleared in a previous
> > render
> > >            * pass and the value is needed now. Update the surface
> > state(s).
> > >            *
> > > @@ -3162,7 +3326,6 @@ cmd_buffer_subpass_sync_fast_clear_values(struct
> > anv_cmd_buffer *cmd_buffer)
> > >           genX(copy_fast_clear_dwords)(cmd_buffer,
> > att_state->color.state,
> > >                                        iview->image,
> > >                                        VK_IMAGE_ASPECT_COLOR_BIT,
> > > -                                      iview->planes[0].isl.base_level,
> > >                                        false /* copy to ss */);
> > >
> > >           if (need_input_attachment_state(rp_att) &&
> > > @@ -3170,7 +3333,6 @@ cmd_buffer_subpass_sync_fast_clear_values(struct
> > anv_cmd_buffer *cmd_buffer)
> > >              genX(copy_fast_clear_dwords)(cmd_buffer,
> > att_state->input.state,
> > >                                           iview->image,
> > >                                           VK_IMAGE_ASPECT_COLOR_BIT,
> > > -                                         iview->planes[0].isl.base_lev
> > el,
> > >                                           false /* copy to ss */);
> > >           }
> > >        }
> > > --
> > > 2.5.0.400.gff86faf
> > >
> > > _______________________________________________
> > > mesa-dev mailing list
> > > mesa-dev at lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >