[Mesa-dev] [RFC PATCH 00/16] A new IR for Mesa

Wed Aug 20 10:29:47 PDT 2014

On Wed, Aug 20, 2014 at 12:11 PM, Francisco Jerez <currojerez at riseup.net> wrote:
> Connor Abbott <cwabbott0 at gmail.com> writes:
>
>> On Wed, Aug 20, 2014 at 7:01 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>> Connor Abbott <cwabbott0 at gmail.com> writes:
>>>
>>>> On Tue, Aug 19, 2014 at 11:33 PM, Francisco Jerez <currojerez at riseup.net> wrote:
>>>>> Connor Abbott <cwabbott0 at gmail.com> writes:
>>>>>
>>>>>> On Tue, Aug 19, 2014 at 11:40 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>>>>>> Tom Stellard <tom at stellard.net> writes:
>>>>>>>
>>>>>>>> On Tue, Aug 19, 2014 at 11:04:59AM -0400, Connor Abbott wrote:
>>>>>>>>> On Mon, Aug 18, 2014 at 8:52 PM, Michel Dänzer <michel at daenzer.net> wrote:
>>>>>>>>> > On 19.08.2014 01:28, Connor Abbott wrote:
>>>>>>>>> >> On Mon, Aug 18, 2014 at 4:32 AM, Michel Dänzer <michel at daenzer.net> wrote:
>>>>>>>>> >>> On 16.08.2014 09:12, Connor Abbott wrote:
>>>>>>>>> >>>> I know what you might be thinking right now. "Wait, *another* IR? Don't
>>>>>>>>> >>>> we already have like 5 of those, not counting all the driver-specific
>>>>>>>>> >>>> ones? Isn't this stuff complicated enough already?" Well, there are some
>>>>>>>>> >>>> pretty good reasons to start afresh (again...). In the years we've been
>>>>>>>>> >>>> using GLSL IR, we've come to realize that, in fact, it's not what we
>>>>>>>>> >>>> want *at all* to do optimizations on.
>>>>>>>>> >>>
>>>>>>>>> >>> Did you evaluate using LLVM IR instead of inventing yet another one?
>>>>>>>>> >>>
>>>>>>>>> >>>
>>>>>>>>> >>> --
>>>>>>>>> >>> Earthling Michel Dänzer            |                  http://www.amd.com
>>>>>>>>> >>> Libre software enthusiast          |                Mesa and X developer
>>>>>>>>> >>
>>>>>>>>> >> Yes. See
>>>>>>>>> >>
>>>>>>>>> >> http://lists.freedesktop.org/archives/mesa-dev/2014-February/053502.html
>>>>>>>>> >>
>>>>>>>>> >> and
>>>>>>>>> >>
>>>>>>>>> >> http://lists.freedesktop.org/archives/mesa-dev/2014-February/053522.html
>>>>>>>>> >
>>>>>>>>> > I know Ian can't deal with LLVM for some reason. I was wondering if
>>>>>>>>> > *you* evaluated it, and if so, why you rejected it.
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > --
>>>>>>>>> > Earthling Michel Dänzer            |                  http://www.amd.com
>>>>>>>>> > Libre software enthusiast          |                Mesa and X developer
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Well, first of all, the fact that Ian and Ken don't want to use it
>>>>>>>>> means that any plan to use LLVM for the Intel driver is dead in the
>>>>>>>>> water anyways - you can translate NIR into LLVM if you want, but for
>>>>>>>>> i965 we want to share optimizations between our 2 backends (FS and
>>>>>>>>> vec4) that we can't do today in GLSL IR so this is what we want to use
>>>>>>>>> for that, and since nobody else does anything with the core GLSL
>>>>>>>>> compiler except when they have to, when we start moving things out of
>>>>>>>>> GLSL IR this will probably replace GLSL IR as the infrastructure that
>>>>>>>>> all Mesa drivers use. But with that in mind, here are a few reasons
>>>>>>>>> why we wouldn't want to use LLVM:
>>>>>>>>>
>>>>>>>>> * LLVM wasn't built to understand structured CFG's, meaning that you
>>>>>>>>> need to re-structurize it using a pass that's fragile and prone to
>>>>>>>>> break if some other pass "optimizes" the shader in a way that makes it
>>>>>>>>> non-structured (i.e. not expressible in terms of loops and if
>>>>>>>>> statements). This loss of information also means that passes that need
>>>>>>>>> to know things like, for example, the loop nesting depth need to do an
>>>>>>>>> analysis pass whereas with NIR you can just walk up the control flow
>>>>>>>>> tree and count the number of loops we hit.
>>>>>>>>>
>>>>>>>>
>>>>>>>> LLVM has a pass to structurize the CFG.  We use it in the radeon
>>>>>>>> drivers, and it is run after all of the other LLVM optimizations which have
>>>>>>>> no concept of structured CFG.  It's not bug free, but it works really
>>>>>>>> well even with all of the complex OpenCL kernels we throw at it.
>>>>>>>>
>>>>>>>> Your point about losing information when the CFG is de-structurized is
>>>>>>>> valid, but for things like loop depth, I'm not sure why we couldn't write an
>>>>>>>> LLVM analysis pass for this (if one doesn't already exist).
>>>>>>>>
>>>>>>>
>>>>>>> I don't think this is such a big deal either.  At least the
>>>>>>> structurization pass used on newer AMD hardware isn't "fragile" in the
>>>>>>> way you seem to imply -- AFAIK (unlike the old AMDIL heuristic
>>>>>>> algorithm) it's guaranteed to give you a valid structurized output no
>>>>>>> matter what the previous optimization passes have done to the CFG,
>>>>>>> modulo bugs.  I admit that the situation is nevertheless suboptimal.
>>>>>>> Ideally this information wouldn't get lost along the way.  For the long
>>>>>>> term we may want to represent structured control flow directly in the IR
>>>>>>> as you say, I just don't see how reinventing the IR saves us any work if
>>>>>>> we could just fix the existing one.
>>>>>>
>>>>>> It seems to me that something like how we represent control flow is a
>>>>>> pretty fundamental part of the IR - it affects any optimization pass
>>>>>> that needs to do anything beyond adding and removing instructions. How
>>>>>> would you fix that, especially given that LLVM is primarily designed
>>>>>> for CPU's where you don't want to be restricted to structured control
>>>>>> flow at all? It seems like our goals (preserve the structure) conflict
>>>>>> with the way LLVM has been designed.
>>>>>>
>>>>> I think we can fix this by introducing new structured variants of the
>>>>> branch instruction in a way that doesn't alter the fundamental structure
>>>>> of the IR.  E.g. an if branch could look like:
>>>>>
>>>>> ifbr i1 <cond>, label <iftrue>, label <iffalse>, label <join>
>>>>>
>>>>> Where both branches are guaranteed to converge at <join>.  Sure, this
>>>>> will require fixing many assumptions, but on the one hand it's not
>>>>> immediately required (as we can address this problem for the time being
>>>>> using the same solution AMD uses) and on the other hand it's still less
>>>>> work than starting from scratch.
>>>>
>>>> I disagree with the "less work than starting from scratch" part,
>>>> especially since it involves modifying it in a pretty invasive way,
>>>> when we won't even need half of the things that it does for us. LLVM
>>>> just isn't a solution to everything - there is no one-size-fits-all
>>>> compiler.
>>>>
>>>
>>> *Shrug* That's quite a strong statement.  Honestly I haven't ruled out
>>> the possibility of coming up with a decent IR by ourselves yet, but at
>>> this point I feel like improving the LLVM framework to make it more
>>> suitable for GPUs would be a much more promising use of my time than
>>> working on NIR -- Even if starting from scratch sounds like a lot more
>>> fun.
>>>
>>>>>
>>>>>>>
>>>>>>>>> * LLVM doesn't do modifiers, meaning that we can't do optimizations
>>>>>>>>> like "clamp(x, 0.0, 1.0) => mov.sat x" and "clamp(x, 0.25, 1.0) =>
>>>>>>>>> max.sat(x, .25)" in a generic fashion.
>>>>>>>>>
>>>>>>>>
>>>>>>>> The way to handle this with LLVM would be to add intrinsics to represent
>>>>>>>> the various modifiers and then fold them into instructions during
>>>>>>>> instruction selection.
>>>>>>>>
>>>>>>>
>>>>>>> IMHO this is a feature.  One of the things I don't like about NIR is
>>>>>>> that it's still vec4-centric.  Most drivers are going to want something
>>>>>>> else and different to each other, we cannot please all of them with one
>>>>>>> single vector addressing model built into the core instruction set, so
>>>>>>> I'd rather have modifiers, writemasks and swizzles represented as the
>>>>>>> composition of separate instructions/intrinsics with simple and
>>>>>>> well-defined semantics, which can be coalesced back into the real
>>>>>>> instruction as Tom says (easy even if you don't use LLVM's instruction
>>>>>>> selector as long as it's SSA form).
>>>>>>
>>>>>> While NIR is vec4-centric, nothing's stopping you from splitting up
>>>>>> instructions and doing optimizations at the scalar level for scalar
>>>>>> ISA's - in fact, that's what I expect to happen. And for backends that
>>>>>> really do need to have swizzles and writemasks, coalescing these
>>>>>> things back into the original instruction is not at all trivial
>>>>>
>>>>> It's a simple peephole optimization AFAICT:
>>>>>
>>>>> val2 = alu-op(modifier(val1)) -> hardware-specific-extended-alu-op(val)
>>>>> val2 = shuffle(val2, alu-op(val1)) -> hardware-specific-alu-op-with-writemask(val2, val1)
>>>>
>>>> No, it's not. Imagine something like:
>>>>
>>>> vec4 foo = ...
>>>> vec4 bar = ...
>>>> vec4 baz = vec4(foo.xy, bar.zw)
>>>> ... = foo
>>>> ... = bar
>>>> ... = baz
>>>>
>>>> where the vec4() is the shuffle instruction. In this case, you can't
>>>> eliminate the shuffle - you need to insert writemasked moves when you
>>>> come out of SSA:
>>>>
>>>> vec4 foo = ...
>>>> vec4 bar = ...
>>>> baz.xy = foo.xy
>>>> baz.zw = bar.zw
>>>>
>>>> This basically comes down to something analogous to a register
>>>> allocation problem, where in this case the scalar components that we
>>>> want to put into a single vec4 (foo, bar, and baz) can't fit - we need
>>>> to "spill" by inserting copies. Then, once we've done this, we have to
>>>> convert it into a non-SSA form with registers, writemasks, and
>>>> swizzles - something that would be easy to do in the IR -> backend
>>>> translation, if it really were just a simple peephole, but in this
>>>> case it's not and so you either have to consult the result of your
>>>> analysis during the translation or have an IR that can represent
>>>> swizzles, writemasks, and non-SSA registers for you like NIR does. Of
>>>> course, LLVM will help with none of this because it's vectorization
>>>> model is built around CPU vector processors like SSE, NEON, etc. and
>>>> so AFAIK it has no concept of per-component liveness, and even if it
>>>> did, this stuff is intimately tied to the out-of-SSA process itself so
>>>> we would basically have to write it from scratch anyways.
>>>>
>>>
>>> I think you keep mixing two unrelated problems:
>>> 1/ How we represent vector addressing, writemasks and modifiers in the
>>>    core IR.
>>> 2/ How we bring vector operations back into non-SSA form.
>>>
>>> Re 1 you propose making the vec4 model a central part of the IR rather
>>> than using composition of simpler operations.  Whatever we do, going
>>> From one representation to the other is a simple peephole, which I never
>>> meant would be a solution for 2.
>>>
>>> Re 2 I agree with you that it would ideally be taken care of by a shared
>>> transformation pass because of its complexity, but I disagree that a
>>> vec4-centric IR is required for this purpose, or even especially useful,
>>> because different hardware has wildly different vector models with
>>> different constraints and requiring a different representation, so I
>>> think ideally we would have some mechanism for back-ends to provide
>>> their own representation in the form of machine-specific instructions
>>> accompanied with some machine-specific logic.
>>
>> I don't see why it's necessarily a bad idea to support the most
>> flexible vector addressing model and then have backends that don't
>> support it lower it to something they do support, or do their own
>> transformation pass instead of the standard one which will lower to
>> the normal model (full swizzling and writemasking).
>
> Vec4 is by no means the most flexible model (it's just flexible enough
> to be annoying to deal with IMHO).  Just look at intel's SIMD4x2
> register addressing modes, you can do dozens of tricks with them that
> you cannot represent in terms of vec4 (using align1 vs align16 access
> modes, differing horizontal and vertical strides, three different modes
> of indirect addressing, bit-casting across components, etc.) -- It would
> be crazy IMHO to design the core IR around that (or around some sort of
> lowest common denominator of all the vector addressing models out there)
> as you can always express the same semantics as a combination of simpler
> blocks, and the backend needs some serious pattern-matching
> infrastructure anyway to make full use of the hardware flexibility.

Yes, I am well aware of SIMD4x2 - I did work at Intel for 2 months ;).
I agree with you that we shouldn't try to model things like all the
different Intel addressing modes directly in the IR - indeed that
would be crazy. But I don't think it can really be represented by "a
combination of simpler basic blocks" either; that sounds rather naive
to me, and IMHO it's just silly to try and model anything like that at
all in a driver-independent IR. The most we can do is write a backend
IR that does model those things (which we sort-of have now with i965
fs, although it does need a lot of work) and make the middle IR do as
much common optimization as possible so that the backend doesn't have
to do as much work with the unwieldier representation.

What I meant by "most flexible" is that it can represent anything GLSL
and D3D assembly can represent without losing any information that
backends might want. I certainly agree with you that by itself it's
rather annoying to work with, but with SSA and therefore no more
writemasks, I think it's rather bearable. The swizzle and modifiers
basically become things that modify the inputs and outputs, and AFAIK
all the easy things are still easy - certainly, I know copy
propagation and DCE are still trivial since I already wrote the code
for them. If you want, we can even make separate abs, neg, and sat
opcodes and then in SSA make it invalid to use those modifiers like we
already do with writemasks, and you could even make a
nir_swizzle_instr (although IMHO that makes things even more
complicated than they already are) to get your "basic building
blocks," but I still think we should keep those things around in
non-SSA form for backends like vec4 so that we can do the out-of-SSA
translation and inlining writemasks and modifiers for them. That way,
they can continue to generate at least not-terrible code without
having to do much optimization since, as you said, it's a rather
annoying model to work with, and the complications won't leak out to
the rest of the things using the IR.

Btw, it sounds like you have a lot of ideas about how the IR should
work - nothing's stopping you from making an LLVM-based prototype, and
if the other Intel folks like it better than NIR I'd be happy to use
that instead. I don't think they want to use LLVM now, but if you
could convince them I'd certainly go with it.

Connor

>
>> And yes, it is certainly possible for backends to add their own
>> machine-specific opcodes and intrinsics - it's a lot easier than it
>> was with GLSL IR, as there's only one spot (nir_opcodes.h) that it has
>> to be added to.
>