[Mesa-dev] NLNet Funded development of a software/hardware MESA driver for the Libre GPGPU

Thu Jan 9 05:18:35 UTC 2020

On Wed, Jan 8, 2020 at 7:55 PM Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

>
>
> On Thursday, January 9, 2020, Jason Ekstrand <jason at jlekstrand.net> wrote:
>
>> Drive-by comment:
>>
>
> really appreciate the feedback.
>
>
>>  I don't think you actually want to base any decisions an a vec4
>> architecture. Nearly every company in the graphics industry thought that
>> was a good idea and designed vec4 processors. Over the course of the last
>> 15 years or so they have all, one by one, realized that it was a bad idea
>> and stopped doing it. Instead, they all parallelize the other way and their
>> SIMD instructions and on scalar values across 8, 16, 32, or 64 invocations
>> (vertices, pixels, etc.) Of the shader program.
>>
>
> for simplicity (not outlining nearly 18 months of Vector Architecture ISA
> development) i missed out that we have designed a vector-plus-subvector
> architecture, where the subvectors may be of any length between 1 and 4,
> and there is an additional vector loop around that which may be from length
> 1 (scalar) to 64
>
> individual predicate mask bits may be applied to vector however they may
> only be applied (one bit) per subvector.
>
> discussions have been ongoing for around 2 years now on the LLVM dev lists
> to support this type of concept.
>
> on the libre soc lists we have also had detailed discissions on how to do
> swizzles at the subvector level.
>
> AMDGPU, NVIDIA, MALI, they all support these capabilities.
>
> to be clear: we have *not* designed an architecture or an ISA which
> critically and exclusively depends on vec4 and vec4 alone.
>
> now, whether it is a bad idea or not to have vec2, vec3 snd vec4
> capability, the way that i see it is that Vulkan supports them, as does the
> SPIRV compiler in e.g. AMDVLK, and we would be asking for trouble
> (performance penalties, compiler complexity due to having to add predicated
> autovectorisation) if we did not support them.
>

Now that I'm at a keyboard and not a phone, I can provide a more thorough
explanation.  Hopefully that will be helpful.  First of all, let's start
with what SPIR-V supporting vec2/3/4 means.  It is true that a vec3 is a
native SPIR-V type; it is also a native type in GLSL and HLSL.  Those
languages also support matrices, write-masking of results, and swizzling on
sources.  This stuff is all very useful when writing graphics shaders
because 70% of what you do in graphics is some sort of vector math.  Having
constructs built directly into the language is really nice.  When it comes
to SPIR-V specifically, you should think of it much more like a high-level
language than like an IR; it's intentionally designed to lose as little
high-level information as possible.  Most of the early shader architectures
also had vec2/3/4 as native data types and had swizzling and write-masking
as core concepts; partly because it seemed like a good idea and partly
because that's the way Microsoft's D3D9 bytecode format worked.  The reason
why NIR supports vec4 is because there is a lot of hardware out there
(including Intel from 5-6 years ago) which needs vec4 and some of that
hardware is still very actively supported in Mesa.

On all modern architectures I'm familiar with (this includes Intel, NVIDIA,
AMD, ARM, Imagination, and Qualcomm), everything is scalarized so a vec3 +
vec3 add operation is turned into three scalar add operations.  In our
NIR-based compilers, the scalarization usually happens almost immediately
once you're in NIR.  We then run up to 64 invocations (vertex, pixel, etc.)
at a time with wide hardware instructions.  When control-flow diverges
(some invocations go one way and some another), both paths are executed and
predication is used to disable the SIMD lanes for the invocations that took
the other path.  On Intel and several other architectures, this happens
fairly automatically.  On AMD, their management of predicates is much more
manual.

What about vec4?  As I've said a couple of times, basically everyone in the
industry (at least Intel, NVIDIA, AMD, ARM, Imagination, and Qualcomm) has
done it at some point in the past.  Let's take Intel as a concrete
example.  It's a good example because a) I'm familiar with it, b) it's all
publicly documented and c) they did scalar and vec4 at the same time in the
same ISA so it's really easy to look at the trade-offs.  On Intel,
everything runs 8-wide (I'm simplifying a bit but it's good enough for this
discussion).  Older Intel hardware could run in one of two modes depending
on shader stage: SIMD8 or SIMD4x2.  In SIMD8 mode, each of the 8 lanes
corresponds to a different shader invocation and each instruction acts on 8
scalars, one from each invocation.  In SIMD4x2 mode, it runs 2 invocations
with 4 lanes per invocation.  The ISA has swizzles and write-masks so those
4 lanes can operate on an entire vec4 at a time.  There were even fancy
cross-lane opcodes for things like dot products.  For a lot of simple
operations, the SIMD4x2 mode was really slick.  If, for instance, you want
to multiply a vec4 by a mat4x4, it's just 4 dot instructions or a MUL and 3
MAD depending on which way you're doing the multiply.

So why did everyone leave vec4?  It all comes down to ALU utilization.
Because of the way that predication is used to deal with divergent
control-flow, you can end up executing more instructions than any single
invocation needs.  As a worst-case example, suppose your shader starts with
a big switch statement where each case is some non-trivial piece of code.
Further suppose that you have some shader thread where each invocation
takes a different case in the switch.  Thanks to the way divergent
control-flow has to be handled, you would end up effectively executing each
invocation separately and nothing would be parallelized.  Depending on how
well your compiler handles divergence and re-convergence and depending on
the workload, the worst case is that you end up spending Nx the cycles
where N is the number of invocations you run at once.  For Intel, that
means as much as 8x in SIMD8 mode and as much as 2x in SIMD4x2 mode.  It
looks like SIMD4x2 is winning, right?  Well, vec4 has its own utilization
problems.  Even though the source language natively works in vec4, in
practice shaders often end up doing a lot of scalar or vec2 calculations.
Every time you do a scalar calculation, you have 2 lanes which are doing
work and 6 which are idle (using Intel SIMD4x2 as the example here). Yes,
you can write a vectorizer to try and pack stuff together better but it
turns out that's much harder than you'd think.  It's also harder to
optimize in vec4 in general so those shaders are likely to use less
efficient math.  I don't have hard numbers here but, if I had to give you a
gut feeling from experience, I'd say that, for complex shaders (doing more
than a vec4 x mat4x4), it's hard to get an average utilization better than
50-70%.   So why did everyone leave vec4?  Because, even though it's a bit
counter-intuitive, divergence makes for less of a HW utilization problem
than vec4.

Hopefully that makes the trade-off make more sense.  One other important
factor is that, even if vec4 could, in theory, be more efficient, it's way
easier to write a compiler if you scalarize everything.

however, there are two nice things:
>
> 1. we are at an early phase, therefore we *can* evaluate valuable
> "headsup" warnings such as the one you give, jason (so thank you)
>

Hooray!

> 2. as a flexible Vector Processor, soft-programmable, then over time if
> the industry moves to dropping vec4, so can we.
>

That's very nice.  My primary reason for sending the first e-mail was that
SwiftShader vs. Mesa is a pretty big decision that's hard to reverse after
someone has poured several months into working on a driver and the argument
you gave in favor of Mesa was that it supports vec4.  If you assume vec4
isn't an issue (which I'm arguing you should), maybe you would make a
different decision.  Personally, I'm still a fan of Mesa. :-)  I also won't
hold it against anyone if they like SwiftShader; it's a neat project too.

--Jason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20200108/d97af19a/attachment-0001.htm>