Future direction of the Mesa Vulkan runtime (or "should we build a new gallium?")

Wed Jan 24 20:02:16 UTC 2024

I'll agree with Jose about Vulkan being a low-level abstraction, and to me
the "opt-in" way seems like a much more balanced approach to achieving our
goals — not only balanced between the goals themselves (code amount and
time to implement aren't our only criteria to optimize), but also across
the variety of hardware — as if something goes wrong with the watertight
abstraction for a certain implementation, not only it'd take more time to
find a solution, but issues of one driver risk wasting time of everyone as
it'd often be necessary to make debatable changes to interfaces used by all
drivers.

I also need to further clarify my point regardless the design of what we
want to encourage drivers to use, specifically about pipeline objects and
dynamic state/ESO.

Vulkan, as I see from all the perspectives I'm regularly interacting with
it from — as an RHI programmer at a game studio, a translation layer
developer (the Xenia Xbox 360 emulator), and now creating a driver for it —
has not grown much thicker than it originally was. What has increased is
its surface area — but where it's actually important: letting applications
more precisely convey their intentions.

I'd say it's even thinner and more transparent from this point of view now.
We got nice things like inline uniform blocks, host image copy, push
descriptors, descriptor buffers, and of course dynamic state — and they all
pretty much directly correspond to some hardware concepts, that apps can
utilize to do what they want with less indirection between their actual
architecture and the hardware.

Essentially, the application and the driver (and the rest of the chain —
the specification, I'd like to retract my statement about "fighting" it, by
the way, and the hardware controlled by that driver) can work more
cooperatively now, towards their common goal of delivering what the app
developer wants to provide to the user with as high quality and speed as
realistically possible. They now have more ways of helping each other by
communicating their intentions and capabilities to each other more
completely and accurately.

And it's important for us not to go *backwards*.

This is why I think it's just fundamentally wrong to encourage drivers to
layer pipeline objects and static state on top of dynamic state.

An application would typically use static state when it:
  • Knows the potentially needed state setups in advance (like in a game
    with demands of materials preprocessed, or in a non-gaming/non-DCC app).
  • Wants to quickly apply a complete state configuration.
  • Maybe doesn't care much about the state used by previously done work,
    like drawing wildly different kinds of objects in a scene.
At the same time, it'd choose dynamic if it:
  • Doesn't have upfront knowledge of possible states (like in an OpenGL/
    D3D9/D3D11 translation layer or a console emulator, or with a highly
    flexible art pipeline in the game).
  • Wants to quickly make small, incremental state changes.
  • Maybe wills to mix state variables updated at different frequencies.

Their use cases, and application's intentions they convey, are as opposite
as the antonymous words "static" and "dynamic" they're called. Treating one
like a specialization of the other is making the driver blind in the same
way as back in 2016 when applications had no other option but to reduce
everything to static state.

(Of course with state spanning so many pipeline stages, applications would
usually not just be picking one of the two extremes, and instead may want
static for some cases/stages and dynamic for the other. This is also where
the route Vulkan's development over the 8 years has taken is very wise:
instead of forcing Escobar's axiom of choice upon applications, let them
specify their intentions on a per-variable basis, and choose the
appropriate amount of state grouping among monolithic pipelines, GPL with
libraries containing one or multiple parts of a pipeline, and ESO.)

The primary rule of game optimization is, if you can avoid doing something
every frame, or, even worse, hundreds or thousands of times per frame, do
whatever reuse you can to avoid that. If we know that's what the game wants
to do — by providing a pipeline object with the state it wants to be
static, a pipeline layout object — we should be aiding it. Just like if the
the game tells us that it can't precompile something, the graphics stack
should do the best it can in this situation — it would be wrong to add the
overhead of running a time machine to 2016 to its draws either. After all,
the driver's draw call code and the game's draw call code are both just
draw call code with one common goal.

So, it's important that whichever solution we end up with, it must not be a
"broken telephone" degrading the cooperation between the application and
the driver. And we should not forget that the communication between them is
two-way, which includes:
  • Interface calls done by the app.
  • Limits and features exposed by the driver.

Having accurate information about the other party is important for both to
be able to make optimal decisions considering the real strong points and
the real constraints of the two. And note that the reason why I'm talking
about interface calls and limits collectively is because the application's
Vulkan usage approaches essentially represent the "limits" of the
application as well — like whether it has sufficient information to
precompile pipeline state.

When the common runtime just gets in the way here, it means that it's
basically acting *against* the goal of the two… green sus 🐸

If NVK developers consider that for their target hardware, the
near-unprocessed representation of the state behind both static and dynamic
interfaces is sufficiently optimal, it's fine. But other drivers should not
be punished for essentially doing what the application and the
specification are enabling and even expecting them to do. Like baking
immutable samplers into shader code. Or taking advantage of static,
non-update-after-bind descriptors to assign UBOs to a fast hardware path in
a more straightforward way. Or preprocessing static state in a pipeline
object.

Maybe it wouldn't have been a very big deal from the performance
perspective in reality if in my driver Terakan, calling vkCmdBindPipeline
with static blending equation state would have resulted in 6 enum
translations and some shifts/ORs, instead of just one 32-bit assignment.
Or, with my target hardware having fixed binding slots, if every
vkCmdBindDescriptorSets call ran a += loop until firstSet instead of
looking up the base slots in the pipeline layout.

However, the conceptual thing here is that I'm not trying to make small
improvements over some "default" behavior. There's no "default" here.
I'm not supposed to implement static on top of dynamic in the first place,
as I said, they are not only different concepts, but even opposite ones.

Of course there are always exceptions on a case-by-case, driver-by-driver
basis. For instance, due to the constraints of the memory and binding
architectures of my target hardware, the kernel driver and the microcode,
it's more optimal for my driver to record secondary command buffers on the
level of Vulkan commands using Mesa's common encoder. But this kind of
cherry-picking aligns much more closely with the "opt-in" approach than the
"watertight" one.

On the topic of limits, I also think the best we can do is to actually be
honest about the specific driver and the specific hardware, and to view
them from the perspective of enabling richer communication with the
application. Blatantly lying (at least without an environment variable
switch), like in the scary old days when, as I heard, some drivers resorted
to being LLVMpipe having met a repeating NPOT texture, is definitely not
contributing to productive communication. But at the same time, if AMD see
how apps can take advantage of the explicit cubemap 3D>2D transformation
from VK_AMD_gcn_shader, or there are potential scenarios for something like
VK_ARM_render_pass_striped… why can't I just tell the app that spreading
descriptors across sets more granularly costs my driver nothing and report
maxBoundDescriptorSets = UINT32_MAX or at least an integer-overflow-safer
(maxSamplers + maxUB + maxSB + maxSampled + maxSI) * 6 + maxIA, giving it
one more close-to-metal tool for cases where it may be useful?

----
P.S.: So far, the list of architectural concepts I'm not willing to
sacrifice in Terakan, which I'd consider a loss of a major regression,
includes:
  • Pipeline objects with pretranslated fixed-function state, as well as
    everything needed to enable that (like storing the current state in a
    close-to-hardware representation, which may require custom vkCmdSet*
    implementations).
  • Pipeline layout objects where already available, most importantly in
    vkCmdBindDescriptorSets and vkCmdPushDescriptorSetKHR.
  • maxBoundDescriptorSets, maxPushDescriptors, maxPushConstantsSize
    significantly higher than on hardware with root-signature-based binding.
  • Inside a VkCommandPool, separate pooling of entities allocated at
    different frequencies: hardware command buffer portions, command
    encoders (containing things like the current state that is pretty large
    due to fixed binding slots, relocation hash maps), BOs with push
    constants and dynamic vertex fetch subroutines.

— Triang3l