[Mesa-dev] [PATCH] r300/compiler: Prevent the fragmentation of TEX blocks in the pair scheduler.

Thu Jun 3 04:51:29 PDT 2010

On Thu, Jun 3, 2010 at 7:45 AM, Tom Stellard <tstellar at gmail.com> wrote:
> On Wed, Jun 02, 2010 at 11:39:57AM +0200, Nicolai Haehnle wrote:
>> On Wed, Jun 2, 2010 at 6:53 AM, Tom Stellard <tstellar at gmail.com> wrote:
>> > On Tue, Jun 01, 2010 at 12:00:16PM +0200, Nicolai Haehnle wrote:
>> >> On Tue, Jun 1, 2010 at 7:41 AM, tstellar <tstellar at gmail.com> wrote:
>> >> > From: Tom Stellard <tstellar at gmail.com>
>> >> >
>> >> > This fixes bug:
>> >> >
>> >> > https://bugs.freedesktop.org/show_bug.cgi?id=25109
>> >>
>> >> Is this really correct? If I understand your patch correctly, what it
>> >> does is that TEX instructions that depend on earlier TEX instructions
>> >> will be emitted in the same TEX group on R300. So if you have
>> >> dependent texture reads like this:
>> >>
>> >>  MOV r0, something;
>> >>  TEX r1, r0, ...;
>> >>  TEX r2, r1, ...;
>> >>
>> >> You will now emit both TEX in the same TEX indirection block, which I
>> >> thought was wrong, because the second TEX instruction will actually
>> >> use the contents of r1 *before* the first TEX instruction as
>> >> coordinates. At least that's how I thought the TEX hardware works:
>> >>
>> >> 1) Fetch all coordinates for all TEX instructions in an indirection block
>> >> 2) Execute all TEX instructions in parallel
>> >> 3) Store all results in the respective destination registers
>> >>
>> >> If my understanding is correct, then I believe your change miscompiles
>> >> the above shader fragment. Can you clarify this?
>> >>
>> >
>> > It looks like I am the one who misunderstood how TEX instructions work.
>> > You are right, the patch does miscompile your example.  The shader
>> > I was having problems with looked like this:
>> >
>> > 10: TEX temp[13].xyz, temp[12].xy__, 2D[0];
>> > 11: TEX temp[12].xyz, temp[11].xy__, 2D[0];
>> > 12: TEX temp[11].xyz, temp[10].xy__, 2D[0];
>> > 13: TEX temp[10].xyz, temp[9].xy__, 2D[0];
>> > 14: TEX temp[9].xyz, temp[8].xy__, 2D[0];
>> > 15: TEX temp[8].xyz, temp[7].xy__, 2D[0];
>> > 16: TEX temp[7].xyz, input[4].xy__, 2D[0];
>> > 17: TEX temp[6].xyz, temp[6].xy__, 2D[0];
>> > 18: TEX temp[5].xyz, temp[5].xy__, 2D[0];
>> > 19: TEX temp[4].xyz, temp[4].xy__, 2D[0];
>> > 20: TEX temp[3].xyz, temp[3].xy__, 2D[0];
>> >
>> > I think the bug is actually in the pair scheduler's dataflow analysis.
>> > Currently, the pair scheduler marks #10 as a dependency of #11, which
>> > would be OK if these were ALU instructions, but it shouldn't be a
>> > dependency for TEX instructions.
>>
>> Yes, I think what the dataflow analysis sees is that #10 reads
>> temp[12] while #11 overwrites temp[12], so its belief is that #10 must
>> be executed before #11. That's indeed a curious problem.
>>
>> So is this done after register allocation? Because it seems to me like
>> the register dealiasing should rename one of the occurences of
>> temp[12] in this example...
>>
>
> Currently, the pair scheduler runs before the register allocation pass.  I have
> attached the compiler output for this shader to the original bug if you
> want to take a look at it:
> https://bugs.freedesktop.org/show_bug.cgi?id=25109

Thanks. I think I simply misremembered what's happening there. At some
point, the compiler had a pass that would dealias register names, so
that one of the temp[12]s would have been changed to something else,
but clearly that's no longer the case.

I'm unsure what the best solution is here. Reintroducing the
dealiasing would certainly work. It would be interesting to know the
exact semantics of the hardware unit. Is it really: First read all
registers, then tex lookup, then write all results? In other words, is
there a guarantee that all coordinates will be read before the first
result is written? In that case, one could separate the "commit" phase
of texture units: first commit all the coordinate reads in one batch,
allowing new TEX instructions to become ready. Then finalize the block
and commit all the writes.

I've never done experiments to figure out what the exact semantics of
the TEX blocks are, and I haven't seen any documentation that
clarifies it. Back when I worked on the code I've always gone the safe
route, i.e. assuming that both parallel and serial execution of TEX
instruction is possible if the GPU feels like it, which reduced errors
but unfortunately means that I'm none the wiser to this day.

cu,
Nicolai