Future direction of the Mesa Vulkan runtime (or "should we build a new gallium?")

Thu Jan 25 19:09:46 UTC 2024

On 24/01/2024 18:26, Faith Ekstrand wrote:

 > So far, we've been trying to build those components in terms of the
 > Vulkan API itself with calls jumping back into the dispatch table to
 > try and get inside the driver.

To me, it looks like the "opt-in" approach would still be well-applicable
to the goal of cleaning up "implementing Vulkan in Vulkan", and gradual
changes diverging from the usual Vulkan specification behavior can be
implemented and maintained in existing and new drivers more efficiently
compared to a whole new programming model.

I think it's important that the scale of our solution should be appropriate
to the scale of the problem, otherwise we risk creating large issues in
other areas. Currently there are pretty few places where Mesa implements
Vulkan on top of Vulkan:
  • WSI,
  • Emulated render passes,
  • Emulated secondary command buffers,
  • Meta.

For WSI, render passes and secondary command buffers, I don't think there's
anything that needs to be done, as those already have little to none driver
backend involvement or interference with application's calls — render pass
and secondary command buffer emulation interacts with the hardware driver
entirely within the framework of the Vulkan specification, only storing a
few fields in vk_command_buffer which are already handled fully in common
code.

Common meta, on the other hand, yes, is extremely intrusive — overriding
the application's pipeline state, bindings, and passing shaders directly in
NIR bypassing SPIR-V.

But with meta being such a different beast, I think we shouldn't even be
trying to tame it with the same interfaces as everything else. If we're
going to handle meta's special cases throughout our common "Gallium2"
framework, it feels like we'll simply be turning our "Vulkan on Vulkan"
issue into the problem of "implementing Gallium2 on Gallium2".

Instead, I think the cleanest solution in the common meta would be sending
commands to the driver through a separate callback interface specifically
for meta instead of trying to make meta mimic application code. That would
allow drivers to clearly negotiate the details of applying/reverting state
changes, shader compilation, while letting their developers assume that
everything else is written for the most part purely against the Vulkan
specification.

It would still be okay for meta to make calls to vkGetPhysicalDevice*,
vkCreate*/vkDestroy*, as long as they're done within the rules of the
Vulkan specification, to require certain extensions, as well as to do some
less-intrusive, non-hot-path interaction with the driver's internals
directly — such as requiring that every VkImage is a vk_image and pulling
the needed create info fields from there. However, everything interacting
with the state/bindings, as well as things going beyond the specification
like creating image views with incompatible formats, would be going through
those new callbacks.

NVK-style drivers would be able to share a common implementation of those
callbacks. Drivers that want to take advantage of more direct-to-hardware
paths would need to provide what's friendly to them (maybe even with
lighter handling of compute-based meta operations compared to graphics
ones). That'd probably be not a single flat list of callbacks, but a bunch
of ones — like it'd be possible for a driver to use the common command
buffer callbacks, but to specialize some view/descriptor-related ones (it
may not be possible to make those common at all, by the way). And if a
driver doesn't need the common meta at all, none of that would be bothering
it.

The other advantages I see in this separate meta API approach are:
  • In the rest of the code, driver developers in most cases will need to
    refer to only a single authority — the massively detailed Vulkan
    specification, and there are risks regarding rolling our own interface
    for everything:
    • Driver developers will have to spend more time carefully looking up
      what they need to do in two places rather than largely just one.
    • We're much more prone to leaving gaps in our interface and to writing
      lacking documentation. I can't see this effort not being rushed, with
      us having to catch up to 10 years of XGL/Vulkan development, while
      moving many drivers alongside working on other tasks, and with varying
      levels of enthusiasm of driver developers towards this. Unless zmike's
      10 years estimate is our actual target 🤷
    • Having to deal with a new large-scale API may raise the barrier for
      new contributors and discourage them.
      Unlike with OpenGL with all the resource renaming stuff, except for
      shader compilation, the experience I got from developing applications
      on Vulkan was enough for me to start comfortably implementing it.
      When zmike showed me an R600g issue about some relation of vertex
      buffer bindings and CSOs, I just didn't have anything useful to say.
  • Faster iteration inside the common meta code, with the meta interface
    not having to take the demands of regular draws into account as much.
    And vice versa, of course — especially when it comes to implementing new
    extensions, many of which would still need handling in every driver with
    Gallium2, but also in the Gallium2 interface itself in addition.
  • Breaking changes to the meta-specific interface would only require
    adjusting meta handling in affected drivers.
    Breaking changes to something used by everyone across a vast code
    surface… Maybe you, Faith, are already well used to doing them, but
    that's still a very special kind of fun 😜

— Triang3l