[Mesa-dev] [RFC PATCH 00/16] A new IR for Mesa

Wed Aug 20 07:17:22 PDT 2014

And don't forget that explicit vec4 becomes immensely amusing once you
add fp64/double to the problem.

  OG.

On Wed, Aug 20, 2014 at 4:01 PM, Francisco Jerez <currojerez at riseup.net> wrote:
> Connor Abbott <cwabbott0 at gmail.com> writes:
>
>> On Tue, Aug 19, 2014 at 11:33 PM, Francisco Jerez <currojerez at riseup.net> wrote:
>>> Connor Abbott <cwabbott0 at gmail.com> writes:
>>>
>>>> On Tue, Aug 19, 2014 at 11:40 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>>>> Tom Stellard <tom at stellard.net> writes:
>>>>>
>>>>>> On Tue, Aug 19, 2014 at 11:04:59AM -0400, Connor Abbott wrote:
>>>>>>> On Mon, Aug 18, 2014 at 8:52 PM, Michel Dänzer <michel at daenzer.net> wrote:
>>>>>>> > On 19.08.2014 01:28, Connor Abbott wrote:
>>>>>>> >> On Mon, Aug 18, 2014 at 4:32 AM, Michel Dänzer <michel at daenzer.net> wrote:
>>>>>>> >>> On 16.08.2014 09:12, Connor Abbott wrote:
>>>>>>> >>>> I know what you might be thinking right now. "Wait, *another* IR? Don't
>>>>>>> >>>> we already have like 5 of those, not counting all the driver-specific
>>>>>>> >>>> ones? Isn't this stuff complicated enough already?" Well, there are some
>>>>>>> >>>> pretty good reasons to start afresh (again...). In the years we've been
>>>>>>> >>>> using GLSL IR, we've come to realize that, in fact, it's not what we
>>>>>>> >>>> want *at all* to do optimizations on.
>>>>>>> >>>
>>>>>>> >>> Did you evaluate using LLVM IR instead of inventing yet another one?
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>> --
>>>>>>> >>> Earthling Michel Dänzer            |                  http://www.amd.com
>>>>>>> >>> Libre software enthusiast          |                Mesa and X developer
>>>>>>> >>
>>>>>>> >> Yes. See
>>>>>>> >>
>>>>>>> >> http://lists.freedesktop.org/archives/mesa-dev/2014-February/053502.html
>>>>>>> >>
>>>>>>> >> and
>>>>>>> >>
>>>>>>> >> http://lists.freedesktop.org/archives/mesa-dev/2014-February/053522.html
>>>>>>> >
>>>>>>> > I know Ian can't deal with LLVM for some reason. I was wondering if
>>>>>>> > *you* evaluated it, and if so, why you rejected it.
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > Earthling Michel Dänzer            |                  http://www.amd.com
>>>>>>> > Libre software enthusiast          |                Mesa and X developer
>>>>>>>
>>>>>>>
>>>>>>> Well, first of all, the fact that Ian and Ken don't want to use it
>>>>>>> means that any plan to use LLVM for the Intel driver is dead in the
>>>>>>> water anyways - you can translate NIR into LLVM if you want, but for
>>>>>>> i965 we want to share optimizations between our 2 backends (FS and
>>>>>>> vec4) that we can't do today in GLSL IR so this is what we want to use
>>>>>>> for that, and since nobody else does anything with the core GLSL
>>>>>>> compiler except when they have to, when we start moving things out of
>>>>>>> GLSL IR this will probably replace GLSL IR as the infrastructure that
>>>>>>> all Mesa drivers use. But with that in mind, here are a few reasons
>>>>>>> why we wouldn't want to use LLVM:
>>>>>>>
>>>>>>> * LLVM wasn't built to understand structured CFG's, meaning that you
>>>>>>> need to re-structurize it using a pass that's fragile and prone to
>>>>>>> break if some other pass "optimizes" the shader in a way that makes it
>>>>>>> non-structured (i.e. not expressible in terms of loops and if
>>>>>>> statements). This loss of information also means that passes that need
>>>>>>> to know things like, for example, the loop nesting depth need to do an
>>>>>>> analysis pass whereas with NIR you can just walk up the control flow
>>>>>>> tree and count the number of loops we hit.
>>>>>>>
>>>>>>
>>>>>> LLVM has a pass to structurize the CFG.  We use it in the radeon
>>>>>> drivers, and it is run after all of the other LLVM optimizations which have
>>>>>> no concept of structured CFG.  It's not bug free, but it works really
>>>>>> well even with all of the complex OpenCL kernels we throw at it.
>>>>>>
>>>>>> Your point about losing information when the CFG is de-structurized is
>>>>>> valid, but for things like loop depth, I'm not sure why we couldn't write an
>>>>>> LLVM analysis pass for this (if one doesn't already exist).
>>>>>>
>>>>>
>>>>> I don't think this is such a big deal either.  At least the
>>>>> structurization pass used on newer AMD hardware isn't "fragile" in the
>>>>> way you seem to imply -- AFAIK (unlike the old AMDIL heuristic
>>>>> algorithm) it's guaranteed to give you a valid structurized output no
>>>>> matter what the previous optimization passes have done to the CFG,
>>>>> modulo bugs.  I admit that the situation is nevertheless suboptimal.
>>>>> Ideally this information wouldn't get lost along the way.  For the long
>>>>> term we may want to represent structured control flow directly in the IR
>>>>> as you say, I just don't see how reinventing the IR saves us any work if
>>>>> we could just fix the existing one.
>>>>
>>>> It seems to me that something like how we represent control flow is a
>>>> pretty fundamental part of the IR - it affects any optimization pass
>>>> that needs to do anything beyond adding and removing instructions. How
>>>> would you fix that, especially given that LLVM is primarily designed
>>>> for CPU's where you don't want to be restricted to structured control
>>>> flow at all? It seems like our goals (preserve the structure) conflict
>>>> with the way LLVM has been designed.
>>>>
>>> I think we can fix this by introducing new structured variants of the
>>> branch instruction in a way that doesn't alter the fundamental structure
>>> of the IR.  E.g. an if branch could look like:
>>>
>>> ifbr i1 <cond>, label <iftrue>, label <iffalse>, label <join>
>>>
>>> Where both branches are guaranteed to converge at <join>.  Sure, this
>>> will require fixing many assumptions, but on the one hand it's not
>>> immediately required (as we can address this problem for the time being
>>> using the same solution AMD uses) and on the other hand it's still less
>>> work than starting from scratch.
>>
>> I disagree with the "less work than starting from scratch" part,
>> especially since it involves modifying it in a pretty invasive way,
>> when we won't even need half of the things that it does for us. LLVM
>> just isn't a solution to everything - there is no one-size-fits-all
>> compiler.
>>
>
> *Shrug* That's quite a strong statement.  Honestly I haven't ruled out
> the possibility of coming up with a decent IR by ourselves yet, but at
> this point I feel like improving the LLVM framework to make it more
> suitable for GPUs would be a much more promising use of my time than
> working on NIR -- Even if starting from scratch sounds like a lot more
> fun.
>
>>>
>>>>>
>>>>>>> * LLVM doesn't do modifiers, meaning that we can't do optimizations
>>>>>>> like "clamp(x, 0.0, 1.0) => mov.sat x" and "clamp(x, 0.25, 1.0) =>
>>>>>>> max.sat(x, .25)" in a generic fashion.
>>>>>>>
>>>>>>
>>>>>> The way to handle this with LLVM would be to add intrinsics to represent
>>>>>> the various modifiers and then fold them into instructions during
>>>>>> instruction selection.
>>>>>>
>>>>>
>>>>> IMHO this is a feature.  One of the things I don't like about NIR is
>>>>> that it's still vec4-centric.  Most drivers are going to want something
>>>>> else and different to each other, we cannot please all of them with one
>>>>> single vector addressing model built into the core instruction set, so
>>>>> I'd rather have modifiers, writemasks and swizzles represented as the
>>>>> composition of separate instructions/intrinsics with simple and
>>>>> well-defined semantics, which can be coalesced back into the real
>>>>> instruction as Tom says (easy even if you don't use LLVM's instruction
>>>>> selector as long as it's SSA form).
>>>>
>>>> While NIR is vec4-centric, nothing's stopping you from splitting up
>>>> instructions and doing optimizations at the scalar level for scalar
>>>> ISA's - in fact, that's what I expect to happen. And for backends that
>>>> really do need to have swizzles and writemasks, coalescing these
>>>> things back into the original instruction is not at all trivial
>>>
>>> It's a simple peephole optimization AFAICT:
>>>
>>> val2 = alu-op(modifier(val1)) -> hardware-specific-extended-alu-op(val)
>>> val2 = shuffle(val2, alu-op(val1)) -> hardware-specific-alu-op-with-writemask(val2, val1)
>>
>> No, it's not. Imagine something like:
>>
>> vec4 foo = ...
>> vec4 bar = ...
>> vec4 baz = vec4(foo.xy, bar.zw)
>> ... = foo
>> ... = bar
>> ... = baz
>>
>> where the vec4() is the shuffle instruction. In this case, you can't
>> eliminate the shuffle - you need to insert writemasked moves when you
>> come out of SSA:
>>
>> vec4 foo = ...
>> vec4 bar = ...
>> baz.xy = foo.xy
>> baz.zw = bar.zw
>>
>> This basically comes down to something analogous to a register
>> allocation problem, where in this case the scalar components that we
>> want to put into a single vec4 (foo, bar, and baz) can't fit - we need
>> to "spill" by inserting copies. Then, once we've done this, we have to
>> convert it into a non-SSA form with registers, writemasks, and
>> swizzles - something that would be easy to do in the IR -> backend
>> translation, if it really were just a simple peephole, but in this
>> case it's not and so you either have to consult the result of your
>> analysis during the translation or have an IR that can represent
>> swizzles, writemasks, and non-SSA registers for you like NIR does. Of
>> course, LLVM will help with none of this because it's vectorization
>> model is built around CPU vector processors like SSE, NEON, etc. and
>> so AFAIK it has no concept of per-component liveness, and even if it
>> did, this stuff is intimately tied to the out-of-SSA process itself so
>> we would basically have to write it from scratch anyways.
>>
>
> I think you keep mixing two unrelated problems:
> 1/ How we represent vector addressing, writemasks and modifiers in the
>    core IR.
> 2/ How we bring vector operations back into non-SSA form.
>
> Re 1 you propose making the vec4 model a central part of the IR rather
> than using composition of simpler operations.  Whatever we do, going
> From one representation to the other is a simple peephole, which I never
> meant would be a solution for 2.
>
> Re 2 I agree with you that it would ideally be taken care of by a shared
> transformation pass because of its complexity, but I disagree that a
> vec4-centric IR is required for this purpose, or even especially useful,
> because different hardware has wildly different vector models with
> different constraints and requiring a different representation, so I
> think ideally we would have some mechanism for back-ends to provide
> their own representation in the form of machine-specific instructions
> accompanied with some machine-specific logic.
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>