[Mesa-dev] [RFC PATCH 00/16] A new IR for Mesa

Wed Aug 20 07:34:54 PDT 2014

On Wed, Aug 20, 2014 at 5:57 AM, Christian König
<deathsimple at vodafone.de> wrote:
> Am 20.08.2014 um 14:33 schrieb Connor Abbott:
>
>> On Tue, Aug 19, 2014 at 11:57 PM, Christian König
>> <deathsimple at vodafone.de> wrote:
>>>
>>> I think we can fix this by introducing new structured variants of the
>>> branch instruction in a way that doesn't alter the fundamental structure
>>> of the IR.  E.g. an if branch could look like:
>>>
>>> ifbr i1 <cond>, label <iftrue>, label <iffalse>, label <join>
>>>
>>> Where both branches are guaranteed to converge at <join>.  Sure, this
>>> will require fixing many assumptions, but on the one hand it's not
>>> immediately required (as we can address this problem for the time being
>>> using the same solution AMD uses) and on the other hand it's still less
>>> work than starting from scratch.
>>>
>>> Well, I've wrote the structurizer pass in LLVM you are talking about here
>>> and from my experience you really don't want any structured form of
>>> control
>>> flow in the IR.
>>>
>>> Structured control flow is just a specialized form of unstructured
>>> control
>>> flow and even if it looks rather awkward at first glance it is indeed
>>> simpler to destructurize the compiler generated control flow for
>>> optimization and structurize again for instruction selection.
>>
>> That's interesting. I still think that with the right infrastructure,
>> having structured control flow really isn't that bad, and it prevents
>> optimizations from doing work like optimizing "if (foo) { break; }"
>> into a single conditional branch when clearly that's not very
>> productive. I would suspect that LLVM just isn't very good at
>> structured control flow since it wasn't designed that way, and that's
>> why it seems hard to work with.
>
>
> Well, maybe I should note that a lot of closed source driver are using LLVM
> for their internal IR representation and as far as I know they have more or
> less all a rather structured way of control flow.
>
> The problem with LLVM really isn't it's IR, because it's not designed CPU
> centric like you obviously think, but rather more that LLVM doesn't have a
> stable interface and is a rather fast moving project.
>
> Actually for example for R600 you do want to optimize a pattern like "if
> (foo) { break; }" into a conditional branch, cause if you look at the ISA
> you see that the LOOP_BREAK pattern is able to take an additional condition
> to apply to the current execution mask.
>
> When you design an hardware independent IR looking at the backend hardware
> level like you do right now is actually the completely wrong approach. What
> you need to do is making the IR as simple as possible and then allow to do
> specialized operations on it to translate it into the desired machine code.

I'm not looking at the backend hardware level here, but at other
languages (in this case D3D bytecode) that support the same thing, and
therefore it's something that the HW probably has/can do efficiently
and something that app developers (especially those translating D3D
bytecode into GLSL, of which there are quite a lot) expect. NIR
obviously doesn't support every HW's strange restrictions on swizzling
and modifiers, backends can do the lowering for that themselves.

>
> In other words the logic necessary for code generation shouldn't be inside
> the IR, cause then the IR is specialized to this specific problem. Instead
> the logic needs to be in the tools that surround the IR.
>
> Regards,
> Christian.

These are all good points, and frankly I don't think it would be too
bad if we switched to LLVM. Unfortunately, though, I think that the
Intel driver won't be using LLVM in the near future, if nothing else
for various not-technical reasons I'm not at liberty to discuss, but
certainly making the switch to a flat SSA-based IR, in addition to
being an improvement over the current state of things, will help us
move closer to LLVM and see if it's something we would want to pursue.

Connor

>
>
>>
>>> The only reason I've annotated the LLVM IR with specialized intrinsics
>>> for
>>> the SI backend was laziness and I wouldn't do that again given the
>>> chance.
>>>
>>> And it's very likely that these backends, which probably aren't using
>>> SSA due to the aforementioned difficulties, will also benefit from
>>> having modifiers already folded for them - this is something that's
>>> already a problem for i965 vec4 backend and that NIR will help a lot.
>>>
>>> Well, I have the impression that much of the reason why the i965 vec4
>>> backend has lagged behind so much in comparison with the fs backend is
>>> precisely because it's so annoying to optimize vec4 code.  It seems
>>> painful to me that you have this built into the core instruction set so
>>> generic optimization passes will have to be explicitly aware of it.  I
>>> wouldn't be surprised if the i965 vec4 benefited at least as much from
>>> scalarizing the code, performing optimizations there, and re-vectorizing
>>> afterwards.
>>
>> We thought about doing something like that, but I don't think it's
>> really that much of a burden when it comes to the rest of the IR. Most
>> of the difficulty of working with a vec4 representation comes from the
>> fact that instructions can partially update their outputs, and once we
>> convert to SSA that problem goes away since there are no partial
>> updates in SSA. Coming out of SSA is where the difficulty lies, but I
>> still think that's a solvable problem, just a difficult one. Plus,
>> there's the problem of how to do the vectorization - you could do it
>> in SSA, but then you still have the hard bit of coming out of SSA and
>> so you're back to square one, or you could do it once you're out of
>> SSA but then it's a lot harder to reason about since you're back to
>> having partial updates.
>>
>>>
>>> Completely agree.
>>>
>>> Being able to do vectorization in an IR is important, but you shouldn't
>>> try
>>> to handle backend specific swizzle operations and vectorizing
>>> restrictions
>>> in the IR. Just looking at the swizzle restrictions of R600 for example
>>> and
>>> I really can't imagine that you want to represent this in a common IR
>>> between all different drivers.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 20.08.2014 um 08:33 schrieb Francisco Jerez:
>>>
>>> Connor Abbott <cwabbott0 at gmail.com> writes:
>>>
>>> On Tue, Aug 19, 2014 at 11:40 AM, Francisco Jerez <currojerez at riseup.net>
>>> wrote:
>>>
>>> Tom Stellard <tom at stellard.net> writes:
>>>
>>> On Tue, Aug 19, 2014 at 11:04:59AM -0400, Connor Abbott wrote:
>>>
>>> On Mon, Aug 18, 2014 at 8:52 PM, Michel Dänzer <michel at daenzer.net>
>>> wrote:
>>>
>>> On 19.08.2014 01:28, Connor Abbott wrote:
>>>
>>> On Mon, Aug 18, 2014 at 4:32 AM, Michel Dänzer <michel at daenzer.net>
>>> wrote:
>>>
>>> On 16.08.2014 09:12, Connor Abbott wrote:
>>>
>>> I know what you might be thinking right now. "Wait, *another* IR? Don't
>>> we already have like 5 of those, not counting all the driver-specific
>>> ones? Isn't this stuff complicated enough already?" Well, there are some
>>> pretty good reasons to start afresh (again...). In the years we've been
>>> using GLSL IR, we've come to realize that, in fact, it's not what we
>>> want *at all* to do optimizations on.
>>>
>>> Did you evaluate using LLVM IR instead of inventing yet another one?
>>>
>>>
>>> --
>>> Earthling Michel Dänzer            |                  http://www.amd.com
>>> Libre software enthusiast          |                Mesa and X developer
>>>
>>> Yes. See
>>>
>>> http://lists.freedesktop.org/archives/mesa-dev/2014-February/053502.html
>>>
>>> and
>>>
>>> http://lists.freedesktop.org/archives/mesa-dev/2014-February/053522.html
>>>
>>> I know Ian can't deal with LLVM for some reason. I was wondering if
>>> *you* evaluated it, and if so, why you rejected it.
>>>
>>>
>>> --
>>> Earthling Michel Dänzer            |                  http://www.amd.com
>>> Libre software enthusiast          |                Mesa and X developer
>>>
>>> Well, first of all, the fact that Ian and Ken don't want to use it
>>> means that any plan to use LLVM for the Intel driver is dead in the
>>> water anyways - you can translate NIR into LLVM if you want, but for
>>> i965 we want to share optimizations between our 2 backends (FS and
>>> vec4) that we can't do today in GLSL IR so this is what we want to use
>>> for that, and since nobody else does anything with the core GLSL
>>> compiler except when they have to, when we start moving things out of
>>> GLSL IR this will probably replace GLSL IR as the infrastructure that
>>> all Mesa drivers use. But with that in mind, here are a few reasons
>>> why we wouldn't want to use LLVM:
>>>
>>> * LLVM wasn't built to understand structured CFG's, meaning that you
>>> need to re-structurize it using a pass that's fragile and prone to
>>> break if some other pass "optimizes" the shader in a way that makes it
>>> non-structured (i.e. not expressible in terms of loops and if
>>> statements). This loss of information also means that passes that need
>>> to know things like, for example, the loop nesting depth need to do an
>>> analysis pass whereas with NIR you can just walk up the control flow
>>> tree and count the number of loops we hit.
>>>
>>> LLVM has a pass to structurize the CFG.  We use it in the radeon
>>> drivers, and it is run after all of the other LLVM optimizations which
>>> have
>>> no concept of structured CFG.  It's not bug free, but it works really
>>> well even with all of the complex OpenCL kernels we throw at it.
>>>
>>> Your point about losing information when the CFG is de-structurized is
>>> valid, but for things like loop depth, I'm not sure why we couldn't write
>>> an
>>> LLVM analysis pass for this (if one doesn't already exist).
>>>
>>> I don't think this is such a big deal either.  At least the
>>> structurization pass used on newer AMD hardware isn't "fragile" in the
>>> way you seem to imply -- AFAIK (unlike the old AMDIL heuristic
>>> algorithm) it's guaranteed to give you a valid structurized output no
>>> matter what the previous optimization passes have done to the CFG,
>>> modulo bugs.  I admit that the situation is nevertheless suboptimal.
>>> Ideally this information wouldn't get lost along the way.  For the long
>>> term we may want to represent structured control flow directly in the IR
>>> as you say, I just don't see how reinventing the IR saves us any work if
>>> we could just fix the existing one.
>>>
>>> It seems to me that something like how we represent control flow is a
>>> pretty fundamental part of the IR - it affects any optimization pass
>>> that needs to do anything beyond adding and removing instructions. How
>>> would you fix that, especially given that LLVM is primarily designed
>>> for CPU's where you don't want to be restricted to structured control
>>> flow at all? It seems like our goals (preserve the structure) conflict
>>> with the way LLVM has been designed.
>>>
>>> I think we can fix this by introducing new structured variants of the
>>> branch instruction in a way that doesn't alter the fundamental structure
>>> of the IR.  E.g. an if branch could look like:
>>>
>>> ifbr i1 <cond>, label <iftrue>, label <iffalse>, label <join>
>>>
>>> Where both branches are guaranteed to converge at <join>.  Sure, this
>>> will require fixing many assumptions, but on the one hand it's not
>>> immediately required (as we can address this problem for the time being
>>> using the same solution AMD uses) and on the other hand it's still less
>>> work than starting from scratch.
>>>
>>> * LLVM doesn't do modifiers, meaning that we can't do optimizations
>>> like "clamp(x, 0.0, 1.0) => mov.sat x" and "clamp(x, 0.25, 1.0) =>
>>> max.sat(x, .25)" in a generic fashion.
>>>
>>> The way to handle this with LLVM would be to add intrinsics to represent
>>> the various modifiers and then fold them into instructions during
>>> instruction selection.
>>>
>>> IMHO this is a feature.  One of the things I don't like about NIR is
>>> that it's still vec4-centric.  Most drivers are going to want something
>>> else and different to each other, we cannot please all of them with one
>>> single vector addressing model built into the core instruction set, so
>>> I'd rather have modifiers, writemasks and swizzles represented as the
>>> composition of separate instructions/intrinsics with simple and
>>> well-defined semantics, which can be coalesced back into the real
>>> instruction as Tom says (easy even if you don't use LLVM's instruction
>>> selector as long as it's SSA form).
>>>
>>> While NIR is vec4-centric, nothing's stopping you from splitting up
>>> instructions and doing optimizations at the scalar level for scalar
>>> ISA's - in fact, that's what I expect to happen. And for backends that
>>> really do need to have swizzles and writemasks, coalescing these
>>> things back into the original instruction is not at all trivial
>>>
>>> It's a simple peephole optimization AFAICT:
>>>
>>> val2 = alu-op(modifier(val1)) -> hardware-specific-extended-alu-op(val)
>>> val2 = shuffle(val2, alu-op(val1)) ->
>>> hardware-specific-alu-op-with-writemask(val2, val1)
>>>
>>> - in fact, going into and out of SSA without introducing extra copies
>>> even in situations like:
>>>
>>> foo.xyz = ...
>>> ... = foo
>>> foo.x = ...
>>>
>>> is a problem that hasn't been solved yet publicly (it seems doable,
>>> but difficult).
>>>
>>> This problem is orthogonal to the mechanism you use to represent
>>> swizzles and writemasks AFAICT.  How could having these modifiers built
>>> into the core ISA help you with transforming vector ops in and out of
>>> SSA?
>>>
>>> So while we might not need swizzles and writemasks for most backends,
>>> for the few that do need it (like, for example, the i965 vec4 backend)
>>> it will be very nice to have one common lowering pass that solves this
>>> hard problem, which would be impossible to do without having swizzles
>>> and writemasks in the IR.
>>>
>>> I disagree.  It would be possible if the IR is extensible enough for
>>> back-ends to be able to represent their exotic vector addressing modes
>>> as driver-defined machine instructions in a way that generic
>>> optimization passes can still deal with them.
>>>
>>> And it's very likely that these backends, which probably aren't using
>>> SSA due to the aforementioned difficulties, will also benefit from
>>> having modifiers already folded for them - this is something that's
>>> already a problem for i965 vec4 backend and that NIR will help a lot.
>>>
>>> Well, I have the impression that much of the reason why the i965 vec4
>>> backend has lagged behind so much in comparison with the fs backend is
>>> precisely because it's so annoying to optimize vec4 code.  It seems
>>> painful to me that you have this built into the core instruction set so
>>> generic optimization passes will have to be explicitly aware of it.  I
>>> wouldn't be surprised if the i965 vec4 benefited at least as much from
>>> scalarizing the code, performing optimizations there, and re-vectorizing
>>> afterwards.
>>>
>>> [...]
>>>
>>>
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>
>>>
>