[Mesa-dev] [PATCH] r300/compiler: Prevent the fragmentation of TEX blocks in the pair scheduler.

Thu Jun 3 06:41:58 PDT 2010

On 03.06.2010 13:51, Nicolai Haehnle wrote:
> On Thu, Jun 3, 2010 at 7:45 AM, Tom Stellard <tstellar at gmail.com> wrote:
>> On Wed, Jun 02, 2010 at 11:39:57AM +0200, Nicolai Haehnle wrote:
>>> On Wed, Jun 2, 2010 at 6:53 AM, Tom Stellard <tstellar at gmail.com> wrote:
>>>> On Tue, Jun 01, 2010 at 12:00:16PM +0200, Nicolai Haehnle wrote:
>>>>> On Tue, Jun 1, 2010 at 7:41 AM, tstellar <tstellar at gmail.com> wrote:
>>>>>> From: Tom Stellard <tstellar at gmail.com>
>>>>>>
>>>>>> This fixes bug:
>>>>>>
>>>>>> https://bugs.freedesktop.org/show_bug.cgi?id=25109
>>>>> Is this really correct? If I understand your patch correctly, what it
>>>>> does is that TEX instructions that depend on earlier TEX instructions
>>>>> will be emitted in the same TEX group on R300. So if you have
>>>>> dependent texture reads like this:
>>>>>
>>>>>  MOV r0, something;
>>>>>  TEX r1, r0, ...;
>>>>>  TEX r2, r1, ...;
>>>>>
>>>>> You will now emit both TEX in the same TEX indirection block, which I
>>>>> thought was wrong, because the second TEX instruction will actually
>>>>> use the contents of r1 *before* the first TEX instruction as
>>>>> coordinates. At least that's how I thought the TEX hardware works:
>>>>>
>>>>> 1) Fetch all coordinates for all TEX instructions in an indirection block
>>>>> 2) Execute all TEX instructions in parallel
>>>>> 3) Store all results in the respective destination registers
>>>>>
>>>>> If my understanding is correct, then I believe your change miscompiles
>>>>> the above shader fragment. Can you clarify this?
>>>>>
>>>> It looks like I am the one who misunderstood how TEX instructions work.
>>>> You are right, the patch does miscompile your example.  The shader
>>>> I was having problems with looked like this:
>>>>
>>>> 10: TEX temp[13].xyz, temp[12].xy__, 2D[0];
>>>> 11: TEX temp[12].xyz, temp[11].xy__, 2D[0];
>>>> 12: TEX temp[11].xyz, temp[10].xy__, 2D[0];
>>>> 13: TEX temp[10].xyz, temp[9].xy__, 2D[0];
>>>> 14: TEX temp[9].xyz, temp[8].xy__, 2D[0];
>>>> 15: TEX temp[8].xyz, temp[7].xy__, 2D[0];
>>>> 16: TEX temp[7].xyz, input[4].xy__, 2D[0];
>>>> 17: TEX temp[6].xyz, temp[6].xy__, 2D[0];
>>>> 18: TEX temp[5].xyz, temp[5].xy__, 2D[0];
>>>> 19: TEX temp[4].xyz, temp[4].xy__, 2D[0];
>>>> 20: TEX temp[3].xyz, temp[3].xy__, 2D[0];
>>>>
>>>> I think the bug is actually in the pair scheduler's dataflow analysis.
>>>> Currently, the pair scheduler marks #10 as a dependency of #11, which
>>>> would be OK if these were ALU instructions, but it shouldn't be a
>>>> dependency for TEX instructions.
>>> Yes, I think what the dataflow analysis sees is that #10 reads
>>> temp[12] while #11 overwrites temp[12], so its belief is that #10 must
>>> be executed before #11. That's indeed a curious problem.
>>>
>>> So is this done after register allocation? Because it seems to me like
>>> the register dealiasing should rename one of the occurences of
>>> temp[12] in this example...
>>>
>> Currently, the pair scheduler runs before the register allocation pass.  I have
>> attached the compiler output for this shader to the original bug if you
>> want to take a look at it:
>> https://bugs.freedesktop.org/show_bug.cgi?id=25109
> 
> Thanks. I think I simply misremembered what's happening there. At some
> point, the compiler had a pass that would dealias register names, so
> that one of the temp[12]s would have been changed to something else,
> but clearly that's no longer the case.
> 
> I'm unsure what the best solution is here. Reintroducing the
> dealiasing would certainly work. It would be interesting to know the
> exact semantics of the hardware unit. Is it really: First read all
> registers, then tex lookup, then write all results?
No you can't rely on that. I know because I made that error for the r300
planar xv support - worked though but was fixed by Alex :-).
(commit 9091b3f5f13dbea83ffd89679dac600e9f280bb2).

> In other words, is
> there a guarantee that all coordinates will be read before the first
> result is written? In that case, one could separate the "commit" phase
> of texture units: first commit all the coordinate reads in one batch,
> allowing new TEX instructions to become ready. Then finalize the block
> and commit all the writes.
> 
> I've never done experiments to figure out what the exact semantics of
> the TEX blocks are, and I haven't seen any documentation that
> clarifies it. Back when I worked on the code I've always gone the safe
> route, i.e. assuming that both parallel and serial execution of TEX
> instruction is possible if the GPU feels like it, which reduced errors
> but unfortunately means that I'm none the wiser to this day.

I guess it could very well be some mix of parallel and serial execution
in hardware. Maybe (due to hardware pipelining) it is even possible that
using an overwritten reg can still be used as source for, say, the next
3 tex instructions, but that is highly speculative, might be asic
dependent and not even be true :-). (Interestingly, in the case it won't
work, this would imply you could have more than 4 dependent tex lookups,
as long as the tex result is consumed directly by another tex
instruction without any arithmetic in-between, by bundling them (far
apart?) in the same phase...). In any case, without some more
documentation, playing safe seems to be the only choice.

Roland