Future direction of the Mesa Vulkan runtime (or "should we build a new gallium?")

Sat Jan 20 14:28:44 UTC 2024

Hello Faith and everyfrogy!

I've been developing a new Vulkan driver for Mesa — Terakan, for AMD
TeraScale Evergreen and Northern Islands GPUs — since May of 2023. You can
find it in amd/terascale/vulkan on the Terakan branch of my fork at
Triang3l/mesa. While it currently lacks many of the graphical features, the
architecture of state management, meta, and descriptors, has already
largely been implemented in its code. I'm overall relatively new to Mesa,
in the past having contributed the fragment shader interlock implementation
to RADV that included working with the state management, but never having
written a Gallium driver, or a Vulkan driver in the ANV copy-pasting era,
so this may be a somewhat fresh — although quite conservative — take on
this.

Due to various hardware and kernel driver differences (bindings being
individually loaded into fixed slots as part of the command buffer state,
the lack of command buffer chaining in the kernel resulting in having to
reapply all of the state when the size of the hardware command buffer
exceeds the HW/KMD limits), I've been designing the architecture of my
Vulkan driver largely from scratch, without using the existing Mesa drivers
as a reference.

Unfortunately, it seems like we ended up going in fundamentally opposite
directions in our designs, so I'd say that I'm much more scared about this
approach than I am excited about it.

My primary concerns about this architecture can be summarized into two
categories:

• The obligation to manage pipeline and dynamic state in the common
   representation — essentially mostly the same Vulkan function call
   arguments, but with an additional layer for processing pNext and merging
   pipeline and dynamic state — restricts the abilities of drivers to
   optimize state management for specific hardware. Most importantly, it
   hampers precompiling of state in pipeline objects.
   In state management, this would make Mesa Vulkan implementations closer
   not even to Gallium, but to the dreaded OpenGL.

• Certain parts of the common code are designed around assumptions about
   the majority of the hardware, however some devices may have large
   architectural differences in specific areas, and trying to adapt the way
   of programming such hardware subsystems results in having to write
   suboptimal algorithms, as well as sometimes artificially restricting the
   VkPhysicalDeviceLimits the device can report.
   An example from my driver is the meaning of a pipeline layout on
   fixed-slot TeraScale. Because it uses flat binding indices throughout all
   sets (sets don't exist in the hardware at all), it needs base offsets for
   each set within the stage's bindings — which are precomputed at pipeline
   layout creation. This is fundamentally incompatible with MR !27024's
   direction to remove the concept of a pipeline layout — and if the common
   vkCmdBindDescriptorSets makes the VK_KHR_maintenance6 layout-object-less
   path the only available one, it would add a lot of overhead by making it
   necessary to recompute the offsets at every bind.

I think what we need to consider about pipeline state (in the broader
sense, including both state objects and dynamic state) is that it
inherently has very different properties from anything the common runtime
already covers. What most of the current objects in the common runtime have
in common is that they:

• Are largely hardware-independent and can work everywhere the same way.
• Either:
   • Provide a complex solution to a large-scale problem, essentially being
     sort of advanced "middleware". Examples are WSI, synchronization,
     pipeline cache, secondary command buffer emulation, render pass
     emulation.
   • Or, solve a trivial task in a way that's non-intrusive towards
     algorithms employed by the drivers — such as managing object handles,
     invoking allocators, reference-counting descriptor set and pipeline
     layouts, pooling VkCommandBuffer instances.
• Rarely influence the design of "hot path" functions, such as changes to
   pipeline state and bindings.

On the other hand, pipeline state:

1. Is entirely hardware-specific.
2. Is modified very frequently — making up the majority of command buffer
    recording time.
3. Can be precompiled in pipeline objects — and that's highly desirable due
    to the previous point.

Because of 1, there's almost nothing in the pipeline state that the common
runtime can help share between drivers. Yes, it can potentially be used to
automate running some NIR passes for baking static state into shaders, but
currently it looks like the runtime is going in a somewhat different
direction, and that needs only some helper functions invoked at pipeline
creation time. Aside from that, I can't see it being able to be useful for
anything other than merging static and dynamic state into a single
structure. For drivers where developers would prefer this approach for
various reasons (prototyping simplicity, or staying at the
near-original-Vulkan level of abstraction is sufficient for them and their
target hardware in this area), this functionality for merging and marking
state as dirty can be provided in the "toolbox" way with usage being
optional, via composition rather than inheritance (such as using callbacks
for getting the static state structure from the pipeline, and the dynamic
state structure from the command buffer), and a layer of vkCmdSet* entry
point fallbacks, as well as functions to call from vkCmdBindPipeline (or a
default implementation).

As for 2 and 3, I don't think merely the amount of code is a reason solid
enough for Mesa to start making it more and more uncomfortable for drivers
to take advantage of precompilation of static state in graphics pipelines.
We should not forget that whole point of pipeline objects, along with
cross-stage optimizations, is to make command buffer recording cheaper —
which is also a large part of the idea of Vulkan itself. And if drivers can
utilize parts of the API to make applications run faster… we should be
encouraging that, not demoralize driver developers striving to do that.

I don't believe we should be "deprecating" monolithic pipelines in our
architectural decisions. While translation layers for some source APIs have
constraints related to data availability that make ESO a more optimal
approach, native Vulkan games and apps often use monolithic pipelines — I
know World War Z uses monolithic with only viewport, depth bias and stencil
reference being dynamic, early Vulkan games like Doom too, and probably
many more. That has always been the recommended path.

In Terakan specifically, along with performing shader linkage, I strongly
want to be able to precompile the following state if it's static — and I've
already been implementing all states in this precompiling way since the
very beginning:

• All vertex input bindings and attributes (with unused ones skipped if the
   pre-rasterization part of the pipeline is available) into a pointer to
   the "fetch shader subroutine" (with instance index divisor ALU code
   scheduled for the VLIW5/VLIW4 ALU architecture), a bitfield of used
   bindings, and strides for them.
• Although static viewports are rare, but viewports into scales/offsets,
   implicit scissors, and registers related to depth range and clamping.
• Rasterization state:
   • Polygon mode, cull mode, front face, depth bias toggle, provoking
     vertex into a 32-bit AND-NOT (keeping dynamic fields) and a 32-bit OR
     mask for the PA_SU_SC_MODE_CNTL register.
   • Clipping space parameters into ANDNOT/OR masks for 32-bit
     PA_CL_CLIP_CNTL.
• Custom MSAA sample locations into packed 4-bit values.
• All depth/stencil state into ANDNOT/OR masks for DB_DEPTH_CONTROL,
   DB_STENCILREFMASK and DB_STENCILREFMASK_BF 32-bit registers.
• The entire blending equation for each color attachment into a 32-bit
   CB_BLEND#_CONTROL.

This list includes most of VkGraphicsPipelineCreateInfo. Almost all of the
static state in my driver goes through some preprocessing, and there's zero
Vulkan enum parsing triggered by vkCmdBindPipeline in it.

With this, I simply not only don't need to use the merging logic from the
new vk_pipeline, but I don't even need to store essentially a copy of
VkGraphicsPipelineCreateInfo inside the pipeline object.

The merging of static and dynamic state is done at a different level of
abstraction in my driver — in a highly close-to-registers representation.
I already have custom implementations of vkCmdSet* functions converting
directly to that representation skipping Mesa's vk_dynamic_graphics_state
(this means that the vk_dynamic_graphics_state in vk_command_buffer objects
is already in an undefined state in my driver though — and it can be safely
removed to save space), and that doesn't require much effort from me, in
part because I try to reuse conversion functions between
vkCreateGraphicsPipelines and vkCmdSet* wherever possible.

To summarize what I feel about writing state management code:

• Am I okay with copying a bit of code (usually 5-6 lines per entry point)
   from vkCreateGraphicsPipelines and vkCmdBindPipeline to vkCmdSet*?
   Totally yes.
• Would I be okay with doing 49 dyn->dirty BITSET_TESTs for every draw
   command, many of which lead to some vk_to_nv9097? I understand why other
   people may prefer this approach, but for me personally, that equals to
   asking whether I would happily disfigure my (hypothetical) child with my
   own hands.

And even when a driver does preprocessing of static state, if the common
pipeline state logic is forced upon drivers, the parts already handled by
preprocessing will still be wasting execution time going through the common
logic only to never actually be used by the driver — we kind of end up
summing the cost of both, not subtracting one from the other.

For additional context, here's what my state architecture looks like:

• terakan_pipeline_graphics:
   • Either a monolithic pipeline, or a library, or a pipeline constructed
     from libraries.
   • Separated into `struct`s for GPL parts (vertex input,
     pre-rasterization, fragment shader, fragment output, plus some shared
     parts like multisampling) from day 1.
   • Examples of state elements within those structures are hardware
     registers (full or partial), pre-converted viewports, vertex fetch
     subroutine, vertex binding strides.
     • If a part of a hardware register is dynamic, it's excluded from the
       32-bit replacement mask for that register.
   • Bitset of which state elements are static, bitscanned when binding.

• terakan_state_draw ("software" state):
   • Modified by vkCmdBindPipeline or vkCmdSet*.
   • State elements are close to hardware registers, but application of them
     may include minor postprocessing, such as intersecting
     viewport-implicit and application-provided scissor rectangles, or
     reindexing (compacting) of hardware color attachments due to the
     hardware D3D11-OMSetRenderTargetsAndUnorderedAccessViews-like
     requirements for storage resource binding.
   • In some cases there may be dependencies between state elements — this
     is the "intermediate" representation with somewhat relaxed rules for
     that.
   • Bitset of "pending" state, bitscanned before application's draws to
     invoke apply callbacks.
   • Only stores the application-provided state — (custom) meta draws skip
     this level, but mark touched state here as "pending" for it to be
     restored the next time the application wants to draw something.

• terakan_hw_state_draw ("hardware" state):
   • Modified by terakan_state_draw applying and by meta draws.
   • State elements are very close to hardware registers, and each of them
     is entirely atomic.
   • Due to the lack of command buffer chaining in the kernel driver, this
     is the part that handles switching to the new hardware command buffer:
     • This is the only place where hardware commands for changing the
       graphics state are emitted, so their result is not lost.
     • When starting a new hardware command buffer, all state that has ever
       been set is re-emitted with the same callbacks as normally.
   • Bitset of modified state, bitscanned before application's or meta draws
     to invoke emit callbacks.
   • Additionally, this is where the resource binding slots are managed,
     including deduplication of unchanged bindings, and arbitration of
     hardware LS and ES+VS binding slot space usage between Vulkan VS and
     TES stages.

Additional note regarding the common meta code is that I'm unable to use
the functionality that involves writing to images from compute shaders in
it. It's written in AMD's AddrLib that on some of my target hardware
(Cypress, the earliest Evergreen chip), using a VK_IMAGE_TILING_LINEAR
image as a storage image causes a failure/hang, so for copying to images,
I have to use rasterization at least in some cases. Also some meta
operations can benefit from hardware-specific functionality, such as MSAA
resolves that can be done by drawing a rectangle with a special output
merger hardware configuration.

Regarding implementing new features in the common code, I think where
that's possible, the toolbox approach is enough, but where it's not, making
the common runtime more intrusive won't help either way.

GPL and ESO will still need a lot of driver-specific code for the (non-)
preprocessing reasons above (and also in RADV, dynamic state in some cases
even results in less optimal register setup, setting registers to more
"generic" values compared to the same state specified as static). Minor
things like allowing null descriptors in more areas would still require
handling of those VK_NULL_HANDLE cases inside each driver.

On the other hand, if, like I mentioned earlier, the common runtime starts,
for example, normalizing vkCmdBindDescriptorSets to the VK_KHR_maintenance6
layout == VK_NULL_HANDLE version with a descriptor set layout array instead
of the pipeline layout even if one was provided by the application, rather
than treating that as a special case, that's essentially a big NAK (not a
compiler in this context) within Terakan that will have to calculate prefix
sums every vkCmdBindDescriptorSets call.

Overall, MR !27024 not only did not make me want to adopt any of the new
common bases it introduces, but it actually had the reverse effect on me —
particularly, it's now a high-priority task for me to _get rid_ of the
common vk_pipeline_layout that I'm already using in my driver in favor of a
custom implementation.

The issue I have with vk_pipeline_layout is that it contains a fixed-size
32-element array for the descriptor set layouts. However, in my driver,
descriptor sets are purely a CPU-side abstraction as the hardware has fixed
binding slots, and the cost of vkBindDescriptorSets scales with the number
of individual bindings involved, not sets. So for my target hardware,
binding a large descriptor set is very unoptimal if you only need to
actually change one or two bindings in it. And I want to reflect that in
the device properties — report maxBoundDescriptorSets = 1054 (the total
number of exposed hardware slots of all types), and try to take advantage
of this qualitative property in a fork of DXVK and maybe in Xenia. For
that, I replaced that static array with dynamic allocation in pipeline
layout creation on my branch. However, !27024 makes things more complicated
for my driver there by adding more fixed-size arrays of descriptor set
layouts (thankfully I don't need any of them… at least until I'm forced to
need them).

Of course such a large maxBoundDescriptorSets is not something that
existing software will take advantage of, the only precedent of something
similar was MoltenVK with 1 billion of sets, which also used fixed slots
before argument buffers were added to Metal. But it feels like that's only
the beginning, and in the future we're going to see the common runtime and
its limit/feature/format (the latter, for example, potentially affecting
something like separate depth and stencil images — something very useful
for Scaleform GFx stencil masks, for instance) support holding drivers back
more and more. Something like maxBoundDescriptorSets = 32 combined with
maxPushDescriptors = 32 is definitely NAK-worthy for me, as that would make
it simply impossible for software to take advantage of the flat binding
model when running on top of my driver. I can also handle
maxPushConstantsSize closer to 64 KB just fine — in the hardware, push
constants are just yet another UBO.

If that's the future of Mesa, I don't know at all how I'll be able to
maintain my driver in the upstream without progressively slaughtering its
functionality and performance to the core.

(My new workaround idea for the vk_pipeline_layout fixed size issue is to
derive vk_pipeline_layout, vk_descriptor_set_layout and my custom
terakan_pipeline_layout from some common vk_refcounted_object_base, and in
vk_cmd_enqueue_CmdBindDescriptorSets, call vk_refcounted_object_base_ref
instead of vk_pipeline_layout_ref directly. This will make it possible for
me to provide a custom pipeline layout implementation, while also having
the ability to use Mesa's common secondary command buffer emulation, as in
my driver targeting hardware without virtual memory and command buffer
chaining, it's cheaper to record Vulkan commands themselves than to merge
hardware commands between command buffers patching relocations and
inherited objects, or to put even the smallest secondary command buffers in
separate submissions with all full state resets. But again, this goes in
the direction opposite to increasing the common runtime's intrusiveness.)

Even knowing that my driver will never be able to run VKD3D-Proton properly
or Doom Eternal is nowhere as demotivating to me as what this may entail.
Fighting with the GPU's internals and the specification is fun.
The prospect of eternally fighting with merge request comments suggesting
adopting NVK's architectural patterns and losing performance and limit
values to them, when you can trivially avoid that in the code, is a
different thing — and with my driver being something niche rather than a
popular RADV, NVK or ANV, I honestly don't have high hopes about the
attention that will be paid to the distinct properties of my target
hardware by the common runtime in the future.

— Tri