[Mesa-dev] [PATCH 00/25] anv: A major rework of color attachment surface states

Thu Nov 10 23:58:43 UTC 2016

On Thu, Nov 10, 2016 at 02:25:37PM -0800, Jason Ekstrand wrote:
> On Thu, Nov 10, 2016 at 1:56 PM, Nanley Chery <nanleychery at gmail.com> wrote:
> 
> > On Thu, Nov 10, 2016 at 12:52:44PM -0800, Jason Ekstrand wrote:
> > > On Nov 10, 2016 11:44, "Nanley Chery" <nanleychery at gmail.com> wrote:
> > > >
> > > > On Wed, Nov 09, 2016 at 06:43:23PM -0800, Jason Ekstrand wrote:
> > > > > On Wed, Nov 9, 2016 at 5:09 PM, Nanley Chery <nanleychery at gmail.com>
> > > wrote:
> > > > >
> > > > > > On Sat, Oct 22, 2016 at 10:50:31AM -0700, Jason Ekstrand wrote:
> > > > > > > This series does some fairly major surgery on color attachment
> > > surface
> > > > > > > state allocation and fill-out in the Intel Vulkan driver.  This
> > is
> > > in
> > > > > > > preparation for doing color compression, fast-clears, and
> > > HiZ-capable
> > > > > > input
> > > > > > > attachments.  Naturally, as with everything else I've done in the
> > > last 2
> > > > > > > months, it also involves some non-trivial blorp work.
> > > > > > >
> > > > > > > Let's start off at the beginning...  For a variety of reasons, we
> > > can't
> > > > > > > really know 100% of the details of an attachment's surface state
> > at
> > > any
> > > > > > > other places than vkCmdBeginRenderPass and vkCmdNextSubpss.  The
> > > same
> > > > > > > applies for depth buffers if you consider 3DSTATE_DEPTH_BUFFER
> > and
> > > > > > friends
> > > > > > > to be the depth and stencil buffer's "surface state".  That's a
> > > fairly
> > > > > > > strong statement, but there are a couple of reasons for this:
> > > > > > >
> > > > > > >  1) In order for fast-clears to work, the surface state has to
> > > contain
> > > > > > the
> > > > > > >     clear color.  (This is it's own packet for HiZ but not for
> > > color.)
> > > > > > We
> > > > > > >     don't know the clear value until BeginRenderPass.  This
> > means we
> > > > > > can't
> > > > > > >     fully fill out the surface state in vkCmdCreateImageView.
> > > > > > >
> > > > > > >  2) The Vulkan spec requires that you be able to call
> > > > > > vkBeginCommandBuffer
> > > > > > >     on a secondary command buffer with
> > > USAGE_RENDER_PASS_CONTINUE_BIT set
> > > > > > >     but with a null framebuffer.  In this case, the secondary is
> > > supposed
> > > > > > >     to inherit the framebuffer from the primary.  (This is not
> > > something
> > > > > > we
> > > > > > >     have properly implemented until now.)  This means that
> > anything
> > > that
> > > > > > is
> > > > > > >     callable from a render-pass-continuing secondary command
> > buffer
> > > has
> > > > > > to
> > > > > > >     be able to operate without knowing any surface details that
> > > aren't
> > > > > > part
> > > > > > >     of the VkRenderPass object.  Basically, all you know is the
> > > Vulkan
> > > > > > >     format (not the isl format) and the sample count.
> > > > > > >
> > > > > > > Between the two of those, about the only two entrypoints left at
> > > which we
> > > > > > > actually know surface details are vkCmdBeginRenderPass and
> > > > > > vkCmdNextSubpass
> > > > > > > so we have to figure out how to do everything there.  As it turns
> > > out,
> > > > > > this
> > > > > > > works out surprisingly well.  The format and the sample count
> > turn
> > > out to
> > > > > > > be exactly the data we actually need in order to do all of our
> > > pipeline
> > > > > > > programming.  The only hard part is refactoring things so that it
> > > pulls
> > > > > > the
> > > > > > > data from the render pass instead of the framebuffer.  There are
> > a
> > > number
> > > > > > > of places where we were grabbing the image view for an attachment
> > > because
> > > > > > > we either wanted to shove something into blorp or because we
> > wanted
> > > the
> > > > > > > format and we were lazy.
> > > > > > >
> > > > > > > The approach taken in this patch series is the following:
> > > > > > >
> > > > > > >  1) Instead of allocating render target surface states in
> > > > > > vkCreateImageView,
> > > > > > >     we allocate them as part of render pass setup in
> > > > > > vkCmdBeginRenderPass.
> > > > > > >     All of the surface states we will ever need (including a null
> > > surface
> > > > > > >     state) are allocated up-front out of a single contiguous
> > block.
> > > > > > >
> > > > > > >  2) For secondary command buffers with
> > > USAGE_RENDER_PASS_CONTINUE_BIT
> > > > > > set,
> > > > > > >     we allocate storage for all of the surface states but don't
> > > actually
> > > > > > >     fill them out.  In the secondary command buffer, all binding
> > > tables
> > > > > > >     refer to these surface states rather than the ones in the
> > > primary.
> > > > > > >
> > > > > > >  3) A blorp entrypoint is added that performs a clear operation
> > > without
> > > > > > >     touching the depth/stencil buffer state and with a color
> > > attachment
> > > > > > >     binding table explicitly provided by the caller.  This means
> > > that
> > > > > > even
> > > > > > >     our blorp clears are using the surface states allocated in
> > > > > > >     vkCmdBeginRenderPass.  Unfortunately, this turned out to be
> > > more work
> > > > > > >     than expected because I had to add vertex shader support to
> > > blorp
> > > > > > along
> > > > > > >     the way.
> > > > > > >
> > > > > > >  4) Here's the tricky bit.  When vkCmdExecuteCommands is called
> > > during a
> > > > > > >     render pass, we use transform feedback (yeah, crazy) to copy
> > the
> > > > > > block
> > > > > > >     of surface states from the primary into the secondary right
> > > before
> > > > > > >     executing the secondary.
> > > > > >
> > > > > > Could we perform a CPU memcpy at this stage?
> > > > > >
> > > > >
> > > > > We can but only if the secondary is created with the
> > > SIMULTANEOUS_USE_BIT
> > > > > unset.  If SIMULTANEOUS_USE_BIT is set, then it may be used in
> > multiple
> > > > > primaries at the same time.  In this case, since we don't know the
> > order
> > > > > they will be submitted to the GPU, we have to do the copy at the last
> > > > > minute.  It may be a bit more performant to do the CPU memcpy in that
> > > case.
> > > > >
> > > > > I didn't do that yet because I wanted to let it bake for a while with
> > > just
> > > > > the GPU memcpy.  The CPU memcpy is guaranteed to be reliable but the
> > GPU
> > > > > memcpy isn't.  By leaving it as just GPU memcpy, we'll get better
> > tests
> > > > > coverage of the crazy path.  If we find that it's causing problems,
> > > doing a
> > > > > CPU copy is a nice little optimization to have in our back pockets.
> > > > >
> > > >
> > > > Got it. For this series, I think it would be helpful to print some sort
> > > > of warning that surface states may be invalid in aub dumps.
> > >
> > > We could, I suppose.  It would print every time someone uses a secondary
> > > but that's probably OK as long as it's only in debug builds.
> > >
> > > > We don't need to do the following now, but if we wanted to, I think we
> > > > could always do a CPU memcpy. One way to accomplish this is to make
> > > > secondaries with both SIMULTANEOUS_USE_BIT and RENDER_PASS_CONTINUE_BIT
> > > > set use a contiguous batch. We could then emit this batch into the
> > > > primary and replace the surface state block at vkCmdExecuteCommands().
> > >
> > > Replacing surface state blocks is much harder than it sound.  You not
> > only
> > > have to allocate and copy but you also have to go in and patch up every
> > > STATE_BASE_ADDRESS and probably patch up all the binding tables as well
> > > because their address relative to the center of the surface state pool
> > and
> > > with it their address relative to the surface states had changed.
> > >
> >
> > By replace the surface state block, I mean copy primary->state.render_pass_
> > states
> > into secondary->state.render_pass_states. That is what we're doing with
> > the GPU memcpy isn't it? (I'm intentionally omitting bo and offset
> > details here).
> >
> 
> Yes but we can only do a CPU memcpy of SIMULTANEOUS_USE_BIT isn't set.  If
> it is set, then the only way to do a CPU copy is if we allocate a new copy
> of render_pass_states for each primary it's used in.  If we make a copy of
> render_pass_states, then we have to update all of the binding tables to
> point to it.  This means we have to allocate new binding tables which means
> we have to patch STATE_BASE_ADDRESS and on down the rabbit hole you go.  In
> other words, we can't just patch up the one part of it from the CPU.  As
> soon as we make a new copy of one part, we're making a new copy of almost
> everything.
> 
> 
> > > So, yes, we *can* do it in the sense that is allowed by the api but it
> > > will add substantial complexity to ExecuteCommands.
> > >

Oh okay, I didn't realize/forgot that the surface state block is part of
a device-local pool, and is not local to the command buffer. That's the
reason you'd have to allocate a new one. I'm familiar with the chain
of updates you'd have to make after allocating a new block and agree
that it's overly complicated. I'm not so familiar with how the surface
state pool is set up and what acceptable changes would be required to
make the above algorithm work, so I don't have anymore comments about
this for now. Thanks for the feedback.

- Nanley

> > > > This matches the following note from the spec:
> > > >
> > > >   On some implementations, not using the
> > > >   VK_COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE_BIT bit enables command
> > > >   buffers to be patched in-place if needed, rather than creating a copy
> > > >   of the command buffer.