[Lima] Update on lima dev

Mon Jun 11 08:12:05 UTC 2018

On Mon, Jun 11, 2018 at 1:59 PM Connor Abbott <cwabbott0 at gmail.com> wrote:
>
> On Sun, Jun 10, 2018 at 6:56 PM, Qiang Yu <yuq825 at gmail.com> wrote:
> > Hi Erico,
> >
> > Add lima ML and Connor.
> >
> >> 1) about register spilling in ppir: apparently what needs to be done
> >> is to turn ra_allocate(g) in ppir into a loop and implement
> >> "choose_spill_node" and "spill_node" ppir functions in case
> >> ra_allocate fails. Like is done in
> >> src/mesa/src/broadcom/compiler/vir_register_allocate.c .
> >> If I understand correctly about ppir, we could have multiple
> >> operations in a single pp instruction but we are not doing that for
> >> now.
> > In fact I do merge two node into same instr when ^vmul or ^fmul
> > can be used to save a reg. And yes, we need to add a merge stage
> > latter for optimization
> >
> >> So my first implementation would be simply to add a load/store
> >> temporary before and after every use of the spilled registers. From
> >> what I understood it is not the best performance-wise but could solve
> >> the problem.
> >> Is this solution ok for now?
> > I'm OK with this.
> >
> >>
> >> 2) about converting scalar to vec4. I spent less time on this one, but
> >> so far seems like a hard problem. Do you think it is even possible to
> >> implement a generic solution (for example iterating the list of noces
> >> and reconstructing vectors in loads and alu operations, or
> >> implementing some new algorithm), or do you already expect we will use
> >> a new mesa "cap" to turn off conversion to scalar?
> > I was also thinking the two choice you mentioned. I think the scalar
> > to vec4 method is preferred if can be implemented because NIR link
> > time optimization need scalar. As if it is implementable I don't have a
> > good algorithm now. @Connor, do you have any idea with this?
>
> There was already a patch series floating around on mesa-dev to add a
> cap and disable splitting for vector architectures:
>
> https://lists.freedesktop.org/archives/mesa-dev/2018-June/197132.html
>
> You should apply that to your mesa fork and return scalar for vertex
> shaders and vector for fragment shaders.
OK, this is what we needed.

> If NIR link time
> optimizations require scalar i/o, that can be fixed later.
Yeah, we need to reverse nir_lower_io_to_scalar_early after the link
time optimization. This should be easier to implement.

> If no one
> else has commit rights, I can commit it with the minor change
> suggested by Marek.
Please do it.

>
> >
> >>
> >> 3) I noticed that gpir doesn't use ra_allocate() and the
> >> register_allocate infrastructure from mesa, while pp does. Is there a
> >> reason for that, or should gpir move to also use mesa's
> >> register_allocate later?
> >>
> > It's OK for gpir reg alloc implemented like this. It's called linear
> > scan reg alloc I heard from Connor. It's simpler and faster than
> > graph coloring based reg alloc. Besides mesa's reg alloc is for HW
> > has vec regs which introduce vec4/vec3/vec2/vec1 conflict problem.
> > gp is just scalar HW, so no need to use mesa's ra. BTW. there're
> > two ra pass in gpir, first is for value reg, second is for real reg.
> > I haven't implemented spill for real reg alloc, but it's not triggered
> > yet.
>
> Well, it's optimal for the case where there's only one basic block,
> but not for multiple basic blocks, so you still need to use mesa's ra
> when gpir supports more than one basic block.
>
> I should mention that I've been working on rewriting the scheduler. I
> noticed (and I think yuq did too) that even if the shader uses at most
> 11 value registers, you still might need to spill some values to
> actual registers in order to avoid an excessive number of mov's, and
> only the final scheduler really has enough information to make a good
> choice for what to spill. So with the new scheduler implementation, we
> don't do the linear scan regalloc beforehand. Instead, we do
> more-or-less the same thing inside the scheduler. We keep track of the
> live physical registers and live value registers (i.e. the
> ready/partially ready nodes) before the current instruction, and since
> we don't have the linear scan regalloc, we have to check if placing a
> node would increase the value register pressure to more than 11, and
> try to spill a live value register to a physical register or fail to
> place it. Other than that, the core algorithm is pretty similar. With
> this, we only have to ensure that we never have a register pressure of
> more than 11+64 in order to guarantee that scheduling succeeds,
> although physical registers now count against that limit. After the
> new scheduler and support for multiple basic blocks, the backend would
> consist of these steps:
>
> 1. Do a pre-RA register pressure minimizing schedule pass, just like
> now. Although we'd want to count physical registers too.
> 2. Do global register allocation using mesa's RA. There are two
> register classes, one for physical registers that contains 64
> registers, and one for value registers that contains 64+11 registers,
> where the first 64 conflict with the physical registers and the last
> 11 don't. Register loads/stores are treated as mov's for the purposes
> of register allocation. Register loads/stores get assigned their
> actual physical register by RA, but for value registers, we just note
> what register RA picked.
I'm not clear with this step, why two reg set? First reg set is for reg
already in the code? Is 64+11 reg set each value associated with
actual reg num or it will be determined when scheduler?

> 3. Similar to before, we insert fake dependencies, but now they're
> integrated with physical register dependencies. For example, say
> there's a store to register 42, followed by a node that uses a value
> that RA assigned to register 42. We'd add a read-after-write
> dependency as if the node reads register 42, even if the value doesn't
> actually get spilled to register 42. By doing this, we guarantee that
> there are no more than 64+11 values and physical registers live at the
> same time during the scheduler.
> 4. Run the post-RA scheduler, which actually decides which nodes get
> spilled to registers.
>
> I've been implementing the scheduler part with the kmscube shader as
> my test case, although I haven't got it working yet (it still asserts
> somewhere). Hopefully the result will have an instruction count close
> to the blob's.
Look forward for the implementation and result.

Regards,
Qiang