[Lima] Update on lima dev

Mon Jun 11 05:59:24 UTC 2018

On Sun, Jun 10, 2018 at 6:56 PM, Qiang Yu <yuq825 at gmail.com> wrote:
> Hi Erico,
>
> Add lima ML and Connor.
>
>> 1) about register spilling in ppir: apparently what needs to be done
>> is to turn ra_allocate(g) in ppir into a loop and implement
>> "choose_spill_node" and "spill_node" ppir functions in case
>> ra_allocate fails. Like is done in
>> src/mesa/src/broadcom/compiler/vir_register_allocate.c .
>> If I understand correctly about ppir, we could have multiple
>> operations in a single pp instruction but we are not doing that for
>> now.
> In fact I do merge two node into same instr when ^vmul or ^fmul
> can be used to save a reg. And yes, we need to add a merge stage
> latter for optimization
>
>> So my first implementation would be simply to add a load/store
>> temporary before and after every use of the spilled registers. From
>> what I understood it is not the best performance-wise but could solve
>> the problem.
>> Is this solution ok for now?
> I'm OK with this.
>
>>
>> 2) about converting scalar to vec4. I spent less time on this one, but
>> so far seems like a hard problem. Do you think it is even possible to
>> implement a generic solution (for example iterating the list of noces
>> and reconstructing vectors in loads and alu operations, or
>> implementing some new algorithm), or do you already expect we will use
>> a new mesa "cap" to turn off conversion to scalar?
> I was also thinking the two choice you mentioned. I think the scalar
> to vec4 method is preferred if can be implemented because NIR link
> time optimization need scalar. As if it is implementable I don't have a
> good algorithm now. @Connor, do you have any idea with this?

There was already a patch series floating around on mesa-dev to add a
cap and disable splitting for vector architectures:

https://lists.freedesktop.org/archives/mesa-dev/2018-June/197132.html

You should apply that to your mesa fork and return scalar for vertex
shaders and vector for fragment shaders. If NIR link time
optimizations require scalar i/o, that can be fixed later. If no one
else has commit rights, I can commit it with the minor change
suggested by Marek.

>
>>
>> 3) I noticed that gpir doesn't use ra_allocate() and the
>> register_allocate infrastructure from mesa, while pp does. Is there a
>> reason for that, or should gpir move to also use mesa's
>> register_allocate later?
>>
> It's OK for gpir reg alloc implemented like this. It's called linear
> scan reg alloc I heard from Connor. It's simpler and faster than
> graph coloring based reg alloc. Besides mesa's reg alloc is for HW
> has vec regs which introduce vec4/vec3/vec2/vec1 conflict problem.
> gp is just scalar HW, so no need to use mesa's ra. BTW. there're
> two ra pass in gpir, first is for value reg, second is for real reg.
> I haven't implemented spill for real reg alloc, but it's not triggered
> yet.

Well, it's optimal for the case where there's only one basic block,
but not for multiple basic blocks, so you still need to use mesa's ra
when gpir supports more than one basic block.

I should mention that I've been working on rewriting the scheduler. I
noticed (and I think yuq did too) that even if the shader uses at most
11 value registers, you still might need to spill some values to
actual registers in order to avoid an excessive number of mov's, and
only the final scheduler really has enough information to make a good
choice for what to spill. So with the new scheduler implementation, we
don't do the linear scan regalloc beforehand. Instead, we do
more-or-less the same thing inside the scheduler. We keep track of the
live physical registers and live value registers (i.e. the
ready/partially ready nodes) before the current instruction, and since
we don't have the linear scan regalloc, we have to check if placing a
node would increase the value register pressure to more than 11, and
try to spill a live value register to a physical register or fail to
place it. Other than that, the core algorithm is pretty similar. With
this, we only have to ensure that we never have a register pressure of
more than 11+64 in order to guarantee that scheduling succeeds,
although physical registers now count against that limit. After the
new scheduler and support for multiple basic blocks, the backend would
consist of these steps:

1. Do a pre-RA register pressure minimizing schedule pass, just like
now. Although we'd want to count physical registers too.
2. Do global register allocation using mesa's RA. There are two
register classes, one for physical registers that contains 64
registers, and one for value registers that contains 64+11 registers,
where the first 64 conflict with the physical registers and the last
11 don't. Register loads/stores are treated as mov's for the purposes
of register allocation. Register loads/stores get assigned their
actual physical register by RA, but for value registers, we just note
what register RA picked.
3. Similar to before, we insert fake dependencies, but now they're
integrated with physical register dependencies. For example, say
there's a store to register 42, followed by a node that uses a value
that RA assigned to register 42. We'd add a read-after-write
dependency as if the node reads register 42, even if the value doesn't
actually get spilled to register 42. By doing this, we guarantee that
there are no more than 64+11 values and physical registers live at the
same time during the scheduler.
4. Run the post-RA scheduler, which actually decides which nodes get
spilled to registers.

I've been implementing the scheduler part with the kmscube shader as
my test case, although I haven't got it working yet (it still asserts
somewhere). Hopefully the result will have an instruction count close
to the blob's.

Connor

>
> Regards,
> Qiang