[Mesa-dev] [RFC PATCH 00/16] A new IR for Mesa

Fri Aug 22 13:56:30 PDT 2014

On 08/20/2014 09:11 AM, Francisco Jerez wrote:
> Connor Abbott <cwabbott0 at gmail.com> writes:
> 
>> On Wed, Aug 20, 2014 at 7:01 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>> Connor Abbott <cwabbott0 at gmail.com> writes:
>>>
>>>> On Tue, Aug 19, 2014 at 11:33 PM, Francisco Jerez <currojerez at riseup.net> wrote:
>>>>> Connor Abbott <cwabbott0 at gmail.com> writes:
>>>>>
>>>>>> On Tue, Aug 19, 2014 at 11:40 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>>>>>> Tom Stellard <tom at stellard.net> writes:
>>>>>>>
>>>>>>>> On Tue, Aug 19, 2014 at 11:04:59AM -0400, Connor Abbott wrote:
>>>>>>>>> On Mon, Aug 18, 2014 at 8:52 PM, Michel Dänzer <michel at daenzer.net> wrote:
>>>>>>>>>> On 19.08.2014 01:28, Connor Abbott wrote:
>>>>>>>>>>> On Mon, Aug 18, 2014 at 4:32 AM, Michel Dänzer <michel at daenzer.net> wrote:
>>>>>>>>>>>> On 16.08.2014 09:12, Connor Abbott wrote:
>>>>>>>>>>>>> I know what you might be thinking right now. "Wait, *another* IR? Don't
>>>>>>>>>>>>> we already have like 5 of those, not counting all the driver-specific
>>>>>>>>>>>>> ones? Isn't this stuff complicated enough already?" Well, there are some
>>>>>>>>>>>>> pretty good reasons to start afresh (again...). In the years we've been
>>>>>>>>>>>>> using GLSL IR, we've come to realize that, in fact, it's not what we
>>>>>>>>>>>>> want *at all* to do optimizations on.
>>>>>>>>>>>>
>>>>>>>>>>>> Did you evaluate using LLVM IR instead of inventing yet another one?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Earthling Michel Dänzer            |                  http://www.amd.com
>>>>>>>>>>>> Libre software enthusiast          |                Mesa and X developer
>>>>>>>>>>>
>>>>>>>>>>> Yes. See
>>>>>>>>>>>
>>>>>>>>>>> http://lists.freedesktop.org/archives/mesa-dev/2014-February/053502.html
>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>>> http://lists.freedesktop.org/archives/mesa-dev/2014-February/053522.html
>>>>>>>>>>
>>>>>>>>>> I know Ian can't deal with LLVM for some reason. I was wondering if
>>>>>>>>>> *you* evaluated it, and if so, why you rejected it.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Earthling Michel Dänzer            |                  http://www.amd.com
>>>>>>>>>> Libre software enthusiast          |                Mesa and X developer
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Well, first of all, the fact that Ian and Ken don't want to use it
>>>>>>>>> means that any plan to use LLVM for the Intel driver is dead in the
>>>>>>>>> water anyways - you can translate NIR into LLVM if you want, but for
>>>>>>>>> i965 we want to share optimizations between our 2 backends (FS and
>>>>>>>>> vec4) that we can't do today in GLSL IR so this is what we want to use
>>>>>>>>> for that, and since nobody else does anything with the core GLSL
>>>>>>>>> compiler except when they have to, when we start moving things out of
>>>>>>>>> GLSL IR this will probably replace GLSL IR as the infrastructure that
>>>>>>>>> all Mesa drivers use. But with that in mind, here are a few reasons
>>>>>>>>> why we wouldn't want to use LLVM:
>>>>>>>>>
>>>>>>>>> * LLVM wasn't built to understand structured CFG's, meaning that you
>>>>>>>>> need to re-structurize it using a pass that's fragile and prone to
>>>>>>>>> break if some other pass "optimizes" the shader in a way that makes it
>>>>>>>>> non-structured (i.e. not expressible in terms of loops and if
>>>>>>>>> statements). This loss of information also means that passes that need
>>>>>>>>> to know things like, for example, the loop nesting depth need to do an
>>>>>>>>> analysis pass whereas with NIR you can just walk up the control flow
>>>>>>>>> tree and count the number of loops we hit.
>>>>>>>>>
>>>>>>>>
>>>>>>>> LLVM has a pass to structurize the CFG.  We use it in the radeon
>>>>>>>> drivers, and it is run after all of the other LLVM optimizations which have
>>>>>>>> no concept of structured CFG.  It's not bug free, but it works really
>>>>>>>> well even with all of the complex OpenCL kernels we throw at it.
>>>>>>>>
>>>>>>>> Your point about losing information when the CFG is de-structurized is
>>>>>>>> valid, but for things like loop depth, I'm not sure why we couldn't write an
>>>>>>>> LLVM analysis pass for this (if one doesn't already exist).
>>>>>>>>
>>>>>>>
>>>>>>> I don't think this is such a big deal either.  At least the
>>>>>>> structurization pass used on newer AMD hardware isn't "fragile" in the
>>>>>>> way you seem to imply -- AFAIK (unlike the old AMDIL heuristic
>>>>>>> algorithm) it's guaranteed to give you a valid structurized output no
>>>>>>> matter what the previous optimization passes have done to the CFG,
>>>>>>> modulo bugs.  I admit that the situation is nevertheless suboptimal.
>>>>>>> Ideally this information wouldn't get lost along the way.  For the long
>>>>>>> term we may want to represent structured control flow directly in the IR
>>>>>>> as you say, I just don't see how reinventing the IR saves us any work if
>>>>>>> we could just fix the existing one.
>>>>>>
>>>>>> It seems to me that something like how we represent control flow is a
>>>>>> pretty fundamental part of the IR - it affects any optimization pass
>>>>>> that needs to do anything beyond adding and removing instructions. How
>>>>>> would you fix that, especially given that LLVM is primarily designed
>>>>>> for CPU's where you don't want to be restricted to structured control
>>>>>> flow at all? It seems like our goals (preserve the structure) conflict
>>>>>> with the way LLVM has been designed.
>>>>>>
>>>>> I think we can fix this by introducing new structured variants of the
>>>>> branch instruction in a way that doesn't alter the fundamental structure
>>>>> of the IR.  E.g. an if branch could look like:
>>>>>
>>>>> ifbr i1 <cond>, label <iftrue>, label <iffalse>, label <join>
>>>>>
>>>>> Where both branches are guaranteed to converge at <join>.  Sure, this
>>>>> will require fixing many assumptions, but on the one hand it's not
>>>>> immediately required (as we can address this problem for the time being
>>>>> using the same solution AMD uses) and on the other hand it's still less
>>>>> work than starting from scratch.
>>>>
>>>> I disagree with the "less work than starting from scratch" part,
>>>> especially since it involves modifying it in a pretty invasive way,
>>>> when we won't even need half of the things that it does for us. LLVM
>>>> just isn't a solution to everything - there is no one-size-fits-all
>>>> compiler.
>>>>
>>>
>>> *Shrug* That's quite a strong statement.  Honestly I haven't ruled out
>>> the possibility of coming up with a decent IR by ourselves yet, but at
>>> this point I feel like improving the LLVM framework to make it more
>>> suitable for GPUs would be a much more promising use of my time than
>>> working on NIR -- Even if starting from scratch sounds like a lot more
>>> fun.
>>>
>>>>>
>>>>>>>
>>>>>>>>> * LLVM doesn't do modifiers, meaning that we can't do optimizations
>>>>>>>>> like "clamp(x, 0.0, 1.0) => mov.sat x" and "clamp(x, 0.25, 1.0) =>
>>>>>>>>> max.sat(x, .25)" in a generic fashion.
>>>>>>>>>
>>>>>>>>
>>>>>>>> The way to handle this with LLVM would be to add intrinsics to represent
>>>>>>>> the various modifiers and then fold them into instructions during
>>>>>>>> instruction selection.
>>>>>>>>
>>>>>>>
>>>>>>> IMHO this is a feature.  One of the things I don't like about NIR is
>>>>>>> that it's still vec4-centric.  Most drivers are going to want something
>>>>>>> else and different to each other, we cannot please all of them with one
>>>>>>> single vector addressing model built into the core instruction set, so
>>>>>>> I'd rather have modifiers, writemasks and swizzles represented as the
>>>>>>> composition of separate instructions/intrinsics with simple and
>>>>>>> well-defined semantics, which can be coalesced back into the real
>>>>>>> instruction as Tom says (easy even if you don't use LLVM's instruction
>>>>>>> selector as long as it's SSA form).
>>>>>>
>>>>>> While NIR is vec4-centric, nothing's stopping you from splitting up
>>>>>> instructions and doing optimizations at the scalar level for scalar
>>>>>> ISA's - in fact, that's what I expect to happen. And for backends that
>>>>>> really do need to have swizzles and writemasks, coalescing these
>>>>>> things back into the original instruction is not at all trivial
>>>>>
>>>>> It's a simple peephole optimization AFAICT:
>>>>>
>>>>> val2 = alu-op(modifier(val1)) -> hardware-specific-extended-alu-op(val)
>>>>> val2 = shuffle(val2, alu-op(val1)) -> hardware-specific-alu-op-with-writemask(val2, val1)
>>>>
>>>> No, it's not. Imagine something like:
>>>>
>>>> vec4 foo = ...
>>>> vec4 bar = ...
>>>> vec4 baz = vec4(foo.xy, bar.zw)
>>>> ... = foo
>>>> ... = bar
>>>> ... = baz
>>>>
>>>> where the vec4() is the shuffle instruction. In this case, you can't
>>>> eliminate the shuffle - you need to insert writemasked moves when you
>>>> come out of SSA:
>>>>
>>>> vec4 foo = ...
>>>> vec4 bar = ...
>>>> baz.xy = foo.xy
>>>> baz.zw = bar.zw
>>>>
>>>> This basically comes down to something analogous to a register
>>>> allocation problem, where in this case the scalar components that we
>>>> want to put into a single vec4 (foo, bar, and baz) can't fit - we need
>>>> to "spill" by inserting copies. Then, once we've done this, we have to
>>>> convert it into a non-SSA form with registers, writemasks, and
>>>> swizzles - something that would be easy to do in the IR -> backend
>>>> translation, if it really were just a simple peephole, but in this
>>>> case it's not and so you either have to consult the result of your
>>>> analysis during the translation or have an IR that can represent
>>>> swizzles, writemasks, and non-SSA registers for you like NIR does. Of
>>>> course, LLVM will help with none of this because it's vectorization
>>>> model is built around CPU vector processors like SSE, NEON, etc. and
>>>> so AFAIK it has no concept of per-component liveness, and even if it
>>>> did, this stuff is intimately tied to the out-of-SSA process itself so
>>>> we would basically have to write it from scratch anyways.
>>>>
>>>
>>> I think you keep mixing two unrelated problems:
>>> 1/ How we represent vector addressing, writemasks and modifiers in the
>>>    core IR.
>>> 2/ How we bring vector operations back into non-SSA form.
>>>
>>> Re 1 you propose making the vec4 model a central part of the IR rather
>>> than using composition of simpler operations.  Whatever we do, going
>>> From one representation to the other is a simple peephole, which I never
>>> meant would be a solution for 2.
>>>
>>> Re 2 I agree with you that it would ideally be taken care of by a shared
>>> transformation pass because of its complexity, but I disagree that a
>>> vec4-centric IR is required for this purpose, or even especially useful,
>>> because different hardware has wildly different vector models with
>>> different constraints and requiring a different representation, so I
>>> think ideally we would have some mechanism for back-ends to provide
>>> their own representation in the form of machine-specific instructions
>>> accompanied with some machine-specific logic.
>>
>> I don't see why it's necessarily a bad idea to support the most
>> flexible vector addressing model and then have backends that don't
>> support it lower it to something they do support, or do their own
>> transformation pass instead of the standard one which will lower to
>> the normal model (full swizzling and writemasking).
> 
> Vec4 is by no means the most flexible model (it's just flexible enough
> to be annoying to deal with IMHO).  Just look at intel's SIMD4x2
> register addressing modes, you can do dozens of tricks with them that
> you cannot represent in terms of vec4 (using align1 vs align16 access
> modes, differing horizontal and vertical strides, three different modes
> of indirect addressing, bit-casting across components, etc.) -- It would
> be crazy IMHO to design the core IR around that (or around some sort of
> lowest common denominator of all the vector addressing models out there)
> as you can always express the same semantics as a combination of simpler
> blocks, and the backend needs some serious pattern-matching
> infrastructure anyway to make full use of the hardware flexibility.

I think this underscores the expectation that each backend will have its
own low-level IR and the expectation that each backend will perform
additional optimizations on that low-level IR.

>> And yes, it is certainly possible for backends to add their own
>> machine-specific opcodes and intrinsics - it's a lot easier than it
>> was with GLSL IR, as there's only one spot (nir_opcodes.h) that it has
>> to be added to.
> 
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev