[Mesa-dev] Ideas on loop unrolling, loop examples, and my GSoC-blog

Mon Jun 1 03:15:15 PDT 2015

Hi,

On 05/29/2015 07:04 PM, Connor Abbott wrote:
> On Fri, May 29, 2015 at 6:23 AM, Eero Tamminen
> <eero.t.tamminen at intel.com> wrote:
>> On 05/28/2015 10:19 PM, Thomas Helland wrote:
>>>
>>> One more thing;
>>> Is there a limit where the loop body gets so large that we
>>> want to decide that "gah, this sucks, no point in unrolling this"?
>>> I imagine as the loops get large there will be a case of
>>> diminishing returns from the unrolling?
>>
>>
>> I think only backend can say something to that.   You e.g. give backend
>> unrolled and non-unrolled versions and backend decides which one is better
>> (after applying additional optimizations)...
>
> I don't really think it's going to be too good of an idea to do that,
> mainly because it means you'd be duplicating a lot of work for the
> normal vs. unrolled versions, and there might be some really large
> loops where generating the unrolled version is going to kill your CPU
> -- doing any amount of work that's proportional to the number of times
> the loop runs, without any limit, seems like a recipe for disaster.

Sure it should have sanity bounds, but my point was more that it depends 
on many factors and even backend doesn't necessarily know about all the 
factors up front either, because some of them depend on the passes done 
by the backend.

> In GLSL IR, we've been fairly lax about figuring out when unrolling is
> helpful and unhelpful -- we just have a simple "node count" plus a
> threshold (as well as a few other heuristics). In NIR, we could
> similarly have an instruction count plus a threshold and port over the
> heuristics to whatever extent possible.  We do have some logic for
> figuring out if an array access is constant after unrolling, and it
> seems like we'd want to keep that around. The next level of
> sophistication, I guess, is to give the backend a callback to give an
> estimation of the execution cost of certain operations. For example,
> unless a negate/absolute value instruction is used by something that
> can't handle the modifier, then on i965 the cost of those instructions
> would be 0. I think that would get us most of the way there to
> something accurate, without needing to do an undue amount of work (in
> terms of CPU time and man-effort).

Some factors affecting whether to unroll or not:
- which one can make pull into push
- which one allows using higher SIMD mode
- which one can do better latency compensation / scheduling
   for memory accesses (e.g. texture fetches)
- instruction count
- instruction cache size
- cycles (when they differ between instructions)

How much of this information frontend has or can request from backend 
without it needing to actually compile both versions?

	- Eero

(In offline compiler compilation CPU usage would be less of an issue.)