[Mesa-dev] [PATCH 00/25] anv: A major rework of color attachment surface states

Thu Nov 10 22:25:37 UTC 2016

On Thu, Nov 10, 2016 at 1:56 PM, Nanley Chery <nanleychery at gmail.com> wrote:

> On Thu, Nov 10, 2016 at 12:52:44PM -0800, Jason Ekstrand wrote:
> > On Nov 10, 2016 11:44, "Nanley Chery" <nanleychery at gmail.com> wrote:
> > >
> > > On Wed, Nov 09, 2016 at 06:43:23PM -0800, Jason Ekstrand wrote:
> > > > On Wed, Nov 9, 2016 at 5:09 PM, Nanley Chery <nanleychery at gmail.com>
> > wrote:
> > > >
> > > > > On Sat, Oct 22, 2016 at 10:50:31AM -0700, Jason Ekstrand wrote:
> > > > > > This series does some fairly major surgery on color attachment
> > surface
> > > > > > state allocation and fill-out in the Intel Vulkan driver.  This
> is
> > in
> > > > > > preparation for doing color compression, fast-clears, and
> > HiZ-capable
> > > > > input
> > > > > > attachments.  Naturally, as with everything else I've done in the
> > last 2
> > > > > > months, it also involves some non-trivial blorp work.
> > > > > >
> > > > > > Let's start off at the beginning...  For a variety of reasons, we
> > can't
> > > > > > really know 100% of the details of an attachment's surface state
> at
> > any
> > > > > > other places than vkCmdBeginRenderPass and vkCmdNextSubpss.  The
> > same
> > > > > > applies for depth buffers if you consider 3DSTATE_DEPTH_BUFFER
> and
> > > > > friends
> > > > > > to be the depth and stencil buffer's "surface state".  That's a
> > fairly
> > > > > > strong statement, but there are a couple of reasons for this:
> > > > > >
> > > > > >  1) In order for fast-clears to work, the surface state has to
> > contain
> > > > > the
> > > > > >     clear color.  (This is it's own packet for HiZ but not for
> > color.)
> > > > > We
> > > > > >     don't know the clear value until BeginRenderPass.  This
> means we
> > > > > can't
> > > > > >     fully fill out the surface state in vkCmdCreateImageView.
> > > > > >
> > > > > >  2) The Vulkan spec requires that you be able to call
> > > > > vkBeginCommandBuffer
> > > > > >     on a secondary command buffer with
> > USAGE_RENDER_PASS_CONTINUE_BIT set
> > > > > >     but with a null framebuffer.  In this case, the secondary is
> > supposed
> > > > > >     to inherit the framebuffer from the primary.  (This is not
> > something
> > > > > we
> > > > > >     have properly implemented until now.)  This means that
> anything
> > that
> > > > > is
> > > > > >     callable from a render-pass-continuing secondary command
> buffer
> > has
> > > > > to
> > > > > >     be able to operate without knowing any surface details that
> > aren't
> > > > > part
> > > > > >     of the VkRenderPass object.  Basically, all you know is the
> > Vulkan
> > > > > >     format (not the isl format) and the sample count.
> > > > > >
> > > > > > Between the two of those, about the only two entrypoints left at
> > which we
> > > > > > actually know surface details are vkCmdBeginRenderPass and
> > > > > vkCmdNextSubpass
> > > > > > so we have to figure out how to do everything there.  As it turns
> > out,
> > > > > this
> > > > > > works out surprisingly well.  The format and the sample count
> turn
> > out to
> > > > > > be exactly the data we actually need in order to do all of our
> > pipeline
> > > > > > programming.  The only hard part is refactoring things so that it
> > pulls
> > > > > the
> > > > > > data from the render pass instead of the framebuffer.  There are
> a
> > number
> > > > > > of places where we were grabbing the image view for an attachment
> > because
> > > > > > we either wanted to shove something into blorp or because we
> wanted
> > the
> > > > > > format and we were lazy.
> > > > > >
> > > > > > The approach taken in this patch series is the following:
> > > > > >
> > > > > >  1) Instead of allocating render target surface states in
> > > > > vkCreateImageView,
> > > > > >     we allocate them as part of render pass setup in
> > > > > vkCmdBeginRenderPass.
> > > > > >     All of the surface states we will ever need (including a null
> > surface
> > > > > >     state) are allocated up-front out of a single contiguous
> block.
> > > > > >
> > > > > >  2) For secondary command buffers with
> > USAGE_RENDER_PASS_CONTINUE_BIT
> > > > > set,
> > > > > >     we allocate storage for all of the surface states but don't
> > actually
> > > > > >     fill them out.  In the secondary command buffer, all binding
> > tables
> > > > > >     refer to these surface states rather than the ones in the
> > primary.
> > > > > >
> > > > > >  3) A blorp entrypoint is added that performs a clear operation
> > without
> > > > > >     touching the depth/stencil buffer state and with a color
> > attachment
> > > > > >     binding table explicitly provided by the caller.  This means
> > that
> > > > > even
> > > > > >     our blorp clears are using the surface states allocated in
> > > > > >     vkCmdBeginRenderPass.  Unfortunately, this turned out to be
> > more work
> > > > > >     than expected because I had to add vertex shader support to
> > blorp
> > > > > along
> > > > > >     the way.
> > > > > >
> > > > > >  4) Here's the tricky bit.  When vkCmdExecuteCommands is called
> > during a
> > > > > >     render pass, we use transform feedback (yeah, crazy) to copy
> the
> > > > > block
> > > > > >     of surface states from the primary into the secondary right
> > before
> > > > > >     executing the secondary.
> > > > >
> > > > > Could we perform a CPU memcpy at this stage?
> > > > >
> > > >
> > > > We can but only if the secondary is created with the
> > SIMULTANEOUS_USE_BIT
> > > > unset.  If SIMULTANEOUS_USE_BIT is set, then it may be used in
> multiple
> > > > primaries at the same time.  In this case, since we don't know the
> order
> > > > they will be submitted to the GPU, we have to do the copy at the last
> > > > minute.  It may be a bit more performant to do the CPU memcpy in that
> > case.
> > > >
> > > > I didn't do that yet because I wanted to let it bake for a while with
> > just
> > > > the GPU memcpy.  The CPU memcpy is guaranteed to be reliable but the
> GPU
> > > > memcpy isn't.  By leaving it as just GPU memcpy, we'll get better
> tests
> > > > coverage of the crazy path.  If we find that it's causing problems,
> > doing a
> > > > CPU copy is a nice little optimization to have in our back pockets.
> > > >
> > >
> > > Got it. For this series, I think it would be helpful to print some sort
> > > of warning that surface states may be invalid in aub dumps.
> >
> > We could, I suppose.  It would print every time someone uses a secondary
> > but that's probably OK as long as it's only in debug builds.
> >
> > > We don't need to do the following now, but if we wanted to, I think we
> > > could always do a CPU memcpy. One way to accomplish this is to make
> > > secondaries with both SIMULTANEOUS_USE_BIT and RENDER_PASS_CONTINUE_BIT
> > > set use a contiguous batch. We could then emit this batch into the
> > > primary and replace the surface state block at vkCmdExecuteCommands().
> >
> > Replacing surface state blocks is much harder than it sound.  You not
> only
> > have to allocate and copy but you also have to go in and patch up every
> > STATE_BASE_ADDRESS and probably patch up all the binding tables as well
> > because their address relative to the center of the surface state pool
> and
> > with it their address relative to the surface states had changed.
> >
>
> By replace the surface state block, I mean copy primary->state.render_pass_
> states
> into secondary->state.render_pass_states. That is what we're doing with
> the GPU memcpy isn't it? (I'm intentionally omitting bo and offset
> details here).
>

Yes but we can only do a CPU memcpy of SIMULTANEOUS_USE_BIT isn't set.  If
it is set, then the only way to do a CPU copy is if we allocate a new copy
of render_pass_states for each primary it's used in.  If we make a copy of
render_pass_states, then we have to update all of the binding tables to
point to it.  This means we have to allocate new binding tables which means
we have to patch STATE_BASE_ADDRESS and on down the rabbit hole you go.  In
other words, we can't just patch up the one part of it from the CPU.  As
soon as we make a new copy of one part, we're making a new copy of almost
everything.

> > So, yes, we *can* do it in the sense that is allowed by the api but it
> will
> > add substantial complexity to ExecuteCommands.
> >
> > > This matches the following note from the spec:
> > >
> > >   On some implementations, not using the
> > >   VK_COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE_BIT bit enables command
> > >   buffers to be patched in-place if needed, rather than creating a copy
> > >   of the command buffer.
> > >
> > > - Nanley
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20161110/190fbc02/attachment-0001.html>