[Mesa-dev] [RFC PATCH 00/16] A new IR for Mesa

Wed Aug 20 07:01:50 PDT 2014

Connor Abbott <cwabbott0 at gmail.com> writes:

> On Tue, Aug 19, 2014 at 11:33 PM, Francisco Jerez <currojerez at riseup.net> wrote:
>> Connor Abbott <cwabbott0 at gmail.com> writes:
>>
>>> On Tue, Aug 19, 2014 at 11:40 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>>> Tom Stellard <tom at stellard.net> writes:
>>>>
>>>>> On Tue, Aug 19, 2014 at 11:04:59AM -0400, Connor Abbott wrote:
>>>>>> On Mon, Aug 18, 2014 at 8:52 PM, Michel Dänzer <michel at daenzer.net> wrote:
>>>>>> > On 19.08.2014 01:28, Connor Abbott wrote:
>>>>>> >> On Mon, Aug 18, 2014 at 4:32 AM, Michel Dänzer <michel at daenzer.net> wrote:
>>>>>> >>> On 16.08.2014 09:12, Connor Abbott wrote:
>>>>>> >>>> I know what you might be thinking right now. "Wait, *another* IR? Don't
>>>>>> >>>> we already have like 5 of those, not counting all the driver-specific
>>>>>> >>>> ones? Isn't this stuff complicated enough already?" Well, there are some
>>>>>> >>>> pretty good reasons to start afresh (again...). In the years we've been
>>>>>> >>>> using GLSL IR, we've come to realize that, in fact, it's not what we
>>>>>> >>>> want *at all* to do optimizations on.
>>>>>> >>>
>>>>>> >>> Did you evaluate using LLVM IR instead of inventing yet another one?
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> --
>>>>>> >>> Earthling Michel Dänzer            |                  http://www.amd.com
>>>>>> >>> Libre software enthusiast          |                Mesa and X developer
>>>>>> >>
>>>>>> >> Yes. See
>>>>>> >>
>>>>>> >> http://lists.freedesktop.org/archives/mesa-dev/2014-February/053502.html
>>>>>> >>
>>>>>> >> and
>>>>>> >>
>>>>>> >> http://lists.freedesktop.org/archives/mesa-dev/2014-February/053522.html
>>>>>> >
>>>>>> > I know Ian can't deal with LLVM for some reason. I was wondering if
>>>>>> > *you* evaluated it, and if so, why you rejected it.
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > Earthling Michel Dänzer            |                  http://www.amd.com
>>>>>> > Libre software enthusiast          |                Mesa and X developer
>>>>>>
>>>>>>
>>>>>> Well, first of all, the fact that Ian and Ken don't want to use it
>>>>>> means that any plan to use LLVM for the Intel driver is dead in the
>>>>>> water anyways - you can translate NIR into LLVM if you want, but for
>>>>>> i965 we want to share optimizations between our 2 backends (FS and
>>>>>> vec4) that we can't do today in GLSL IR so this is what we want to use
>>>>>> for that, and since nobody else does anything with the core GLSL
>>>>>> compiler except when they have to, when we start moving things out of
>>>>>> GLSL IR this will probably replace GLSL IR as the infrastructure that
>>>>>> all Mesa drivers use. But with that in mind, here are a few reasons
>>>>>> why we wouldn't want to use LLVM:
>>>>>>
>>>>>> * LLVM wasn't built to understand structured CFG's, meaning that you
>>>>>> need to re-structurize it using a pass that's fragile and prone to
>>>>>> break if some other pass "optimizes" the shader in a way that makes it
>>>>>> non-structured (i.e. not expressible in terms of loops and if
>>>>>> statements). This loss of information also means that passes that need
>>>>>> to know things like, for example, the loop nesting depth need to do an
>>>>>> analysis pass whereas with NIR you can just walk up the control flow
>>>>>> tree and count the number of loops we hit.
>>>>>>
>>>>>
>>>>> LLVM has a pass to structurize the CFG.  We use it in the radeon
>>>>> drivers, and it is run after all of the other LLVM optimizations which have
>>>>> no concept of structured CFG.  It's not bug free, but it works really
>>>>> well even with all of the complex OpenCL kernels we throw at it.
>>>>>
>>>>> Your point about losing information when the CFG is de-structurized is
>>>>> valid, but for things like loop depth, I'm not sure why we couldn't write an
>>>>> LLVM analysis pass for this (if one doesn't already exist).
>>>>>
>>>>
>>>> I don't think this is such a big deal either.  At least the
>>>> structurization pass used on newer AMD hardware isn't "fragile" in the
>>>> way you seem to imply -- AFAIK (unlike the old AMDIL heuristic
>>>> algorithm) it's guaranteed to give you a valid structurized output no
>>>> matter what the previous optimization passes have done to the CFG,
>>>> modulo bugs.  I admit that the situation is nevertheless suboptimal.
>>>> Ideally this information wouldn't get lost along the way.  For the long
>>>> term we may want to represent structured control flow directly in the IR
>>>> as you say, I just don't see how reinventing the IR saves us any work if
>>>> we could just fix the existing one.
>>>
>>> It seems to me that something like how we represent control flow is a
>>> pretty fundamental part of the IR - it affects any optimization pass
>>> that needs to do anything beyond adding and removing instructions. How
>>> would you fix that, especially given that LLVM is primarily designed
>>> for CPU's where you don't want to be restricted to structured control
>>> flow at all? It seems like our goals (preserve the structure) conflict
>>> with the way LLVM has been designed.
>>>
>> I think we can fix this by introducing new structured variants of the
>> branch instruction in a way that doesn't alter the fundamental structure
>> of the IR.  E.g. an if branch could look like:
>>
>> ifbr i1 <cond>, label <iftrue>, label <iffalse>, label <join>
>>
>> Where both branches are guaranteed to converge at <join>.  Sure, this
>> will require fixing many assumptions, but on the one hand it's not
>> immediately required (as we can address this problem for the time being
>> using the same solution AMD uses) and on the other hand it's still less
>> work than starting from scratch.
>
> I disagree with the "less work than starting from scratch" part,
> especially since it involves modifying it in a pretty invasive way,
> when we won't even need half of the things that it does for us. LLVM
> just isn't a solution to everything - there is no one-size-fits-all
> compiler.
>

*Shrug* That's quite a strong statement.  Honestly I haven't ruled out
the possibility of coming up with a decent IR by ourselves yet, but at
this point I feel like improving the LLVM framework to make it more
suitable for GPUs would be a much more promising use of my time than
working on NIR -- Even if starting from scratch sounds like a lot more
fun.

>>
>>>>
>>>>>> * LLVM doesn't do modifiers, meaning that we can't do optimizations
>>>>>> like "clamp(x, 0.0, 1.0) => mov.sat x" and "clamp(x, 0.25, 1.0) =>
>>>>>> max.sat(x, .25)" in a generic fashion.
>>>>>>
>>>>>
>>>>> The way to handle this with LLVM would be to add intrinsics to represent
>>>>> the various modifiers and then fold them into instructions during
>>>>> instruction selection.
>>>>>
>>>>
>>>> IMHO this is a feature.  One of the things I don't like about NIR is
>>>> that it's still vec4-centric.  Most drivers are going to want something
>>>> else and different to each other, we cannot please all of them with one
>>>> single vector addressing model built into the core instruction set, so
>>>> I'd rather have modifiers, writemasks and swizzles represented as the
>>>> composition of separate instructions/intrinsics with simple and
>>>> well-defined semantics, which can be coalesced back into the real
>>>> instruction as Tom says (easy even if you don't use LLVM's instruction
>>>> selector as long as it's SSA form).
>>>
>>> While NIR is vec4-centric, nothing's stopping you from splitting up
>>> instructions and doing optimizations at the scalar level for scalar
>>> ISA's - in fact, that's what I expect to happen. And for backends that
>>> really do need to have swizzles and writemasks, coalescing these
>>> things back into the original instruction is not at all trivial
>>
>> It's a simple peephole optimization AFAICT:
>>
>> val2 = alu-op(modifier(val1)) -> hardware-specific-extended-alu-op(val)
>> val2 = shuffle(val2, alu-op(val1)) -> hardware-specific-alu-op-with-writemask(val2, val1)
>
> No, it's not. Imagine something like:
>
> vec4 foo = ...
> vec4 bar = ...
> vec4 baz = vec4(foo.xy, bar.zw)
> ... = foo
> ... = bar
> ... = baz
>
> where the vec4() is the shuffle instruction. In this case, you can't
> eliminate the shuffle - you need to insert writemasked moves when you
> come out of SSA:
>
> vec4 foo = ...
> vec4 bar = ...
> baz.xy = foo.xy
> baz.zw = bar.zw
>
> This basically comes down to something analogous to a register
> allocation problem, where in this case the scalar components that we
> want to put into a single vec4 (foo, bar, and baz) can't fit - we need
> to "spill" by inserting copies. Then, once we've done this, we have to
> convert it into a non-SSA form with registers, writemasks, and
> swizzles - something that would be easy to do in the IR -> backend
> translation, if it really were just a simple peephole, but in this
> case it's not and so you either have to consult the result of your
> analysis during the translation or have an IR that can represent
> swizzles, writemasks, and non-SSA registers for you like NIR does. Of
> course, LLVM will help with none of this because it's vectorization
> model is built around CPU vector processors like SSE, NEON, etc. and
> so AFAIK it has no concept of per-component liveness, and even if it
> did, this stuff is intimately tied to the out-of-SSA process itself so
> we would basically have to write it from scratch anyways.
>

I think you keep mixing two unrelated problems:
1/ How we represent vector addressing, writemasks and modifiers in the
   core IR.
2/ How we bring vector operations back into non-SSA form.

Re 1 you propose making the vec4 model a central part of the IR rather
than using composition of simpler operations.  Whatever we do, going
From one representation to the other is a simple peephole, which I never
meant would be a solution for 2.

Re 2 I agree with you that it would ideally be taken care of by a shared
transformation pass because of its complexity, but I disagree that a
vec4-centric IR is required for this purpose, or even especially useful,
because different hardware has wildly different vector models with
different constraints and requiring a different representation, so I
think ideally we would have some mechanism for back-ends to provide
their own representation in the form of machine-specific instructions
accompanied with some machine-specific logic.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20140820/eb7d516d/attachment-0001.sig>