[Mesa-dev] [PATCH] r300/compiler: Prevent the fragmentation of TEX blocks in the pair scheduler.

Thu Jun 3 07:13:46 PDT 2010

On Thu, Jun 3, 2010 at 9:41 AM, Roland Scheidegger <sroland at vmware.com> wrote:
> On 03.06.2010 13:51, Nicolai Haehnle wrote:
>> On Thu, Jun 3, 2010 at 7:45 AM, Tom Stellard <tstellar at gmail.com> wrote:
>>> On Wed, Jun 02, 2010 at 11:39:57AM +0200, Nicolai Haehnle wrote:
>>>> On Wed, Jun 2, 2010 at 6:53 AM, Tom Stellard <tstellar at gmail.com> wrote:
>>>>> On Tue, Jun 01, 2010 at 12:00:16PM +0200, Nicolai Haehnle wrote:
>>>>>> On Tue, Jun 1, 2010 at 7:41 AM, tstellar <tstellar at gmail.com> wrote:
>>>>>>> From: Tom Stellard <tstellar at gmail.com>
>>>>>>>
>>>>>>> This fixes bug:
>>>>>>>
>>>>>>> https://bugs.freedesktop.org/show_bug.cgi?id=25109
>>>>>> Is this really correct? If I understand your patch correctly, what it
>>>>>> does is that TEX instructions that depend on earlier TEX instructions
>>>>>> will be emitted in the same TEX group on R300. So if you have
>>>>>> dependent texture reads like this:
>>>>>>
>>>>>>  MOV r0, something;
>>>>>>  TEX r1, r0, ...;
>>>>>>  TEX r2, r1, ...;
>>>>>>
>>>>>> You will now emit both TEX in the same TEX indirection block, which I
>>>>>> thought was wrong, because the second TEX instruction will actually
>>>>>> use the contents of r1 *before* the first TEX instruction as
>>>>>> coordinates. At least that's how I thought the TEX hardware works:
>>>>>>
>>>>>> 1) Fetch all coordinates for all TEX instructions in an indirection block
>>>>>> 2) Execute all TEX instructions in parallel
>>>>>> 3) Store all results in the respective destination registers
>>>>>>
>>>>>> If my understanding is correct, then I believe your change miscompiles
>>>>>> the above shader fragment. Can you clarify this?
>>>>>>
>>>>> It looks like I am the one who misunderstood how TEX instructions work.
>>>>> You are right, the patch does miscompile your example.  The shader
>>>>> I was having problems with looked like this:
>>>>>
>>>>> 10: TEX temp[13].xyz, temp[12].xy__, 2D[0];
>>>>> 11: TEX temp[12].xyz, temp[11].xy__, 2D[0];
>>>>> 12: TEX temp[11].xyz, temp[10].xy__, 2D[0];
>>>>> 13: TEX temp[10].xyz, temp[9].xy__, 2D[0];
>>>>> 14: TEX temp[9].xyz, temp[8].xy__, 2D[0];
>>>>> 15: TEX temp[8].xyz, temp[7].xy__, 2D[0];
>>>>> 16: TEX temp[7].xyz, input[4].xy__, 2D[0];
>>>>> 17: TEX temp[6].xyz, temp[6].xy__, 2D[0];
>>>>> 18: TEX temp[5].xyz, temp[5].xy__, 2D[0];
>>>>> 19: TEX temp[4].xyz, temp[4].xy__, 2D[0];
>>>>> 20: TEX temp[3].xyz, temp[3].xy__, 2D[0];
>>>>>
>>>>> I think the bug is actually in the pair scheduler's dataflow analysis.
>>>>> Currently, the pair scheduler marks #10 as a dependency of #11, which
>>>>> would be OK if these were ALU instructions, but it shouldn't be a
>>>>> dependency for TEX instructions.
>>>> Yes, I think what the dataflow analysis sees is that #10 reads
>>>> temp[12] while #11 overwrites temp[12], so its belief is that #10 must
>>>> be executed before #11. That's indeed a curious problem.
>>>>
>>>> So is this done after register allocation? Because it seems to me like
>>>> the register dealiasing should rename one of the occurences of
>>>> temp[12] in this example...
>>>>
>>> Currently, the pair scheduler runs before the register allocation pass.  I have
>>> attached the compiler output for this shader to the original bug if you
>>> want to take a look at it:
>>> https://bugs.freedesktop.org/show_bug.cgi?id=25109
>>
>> Thanks. I think I simply misremembered what's happening there. At some
>> point, the compiler had a pass that would dealias register names, so
>> that one of the temp[12]s would have been changed to something else,
>> but clearly that's no longer the case.
>>
>> I'm unsure what the best solution is here. Reintroducing the
>> dealiasing would certainly work. It would be interesting to know the
>> exact semantics of the hardware unit. Is it really: First read all
>> registers, then tex lookup, then write all results?
> No you can't rely on that. I know because I made that error for the r300
> planar xv support - worked though but was fixed by Alex :-).
> (commit 9091b3f5f13dbea83ffd89679dac600e9f280bb2).
>
>> In other words, is
>> there a guarantee that all coordinates will be read before the first
>> result is written? In that case, one could separate the "commit" phase
>> of texture units: first commit all the coordinate reads in one batch,
>> allowing new TEX instructions to become ready. Then finalize the block
>> and commit all the writes.
>>
>> I've never done experiments to figure out what the exact semantics of
>> the TEX blocks are, and I haven't seen any documentation that
>> clarifies it. Back when I worked on the code I've always gone the safe
>> route, i.e. assuming that both parallel and serial execution of TEX
>> instruction is possible if the GPU feels like it, which reduced errors
>> but unfortunately means that I'm none the wiser to this day.
>
> I guess it could very well be some mix of parallel and serial execution
> in hardware. Maybe (due to hardware pipelining) it is even possible that
> using an overwritten reg can still be used as source for, say, the next
> 3 tex instructions, but that is highly speculative, might be asic
> dependent and not even be true :-). (Interestingly, in the case it won't
> work, this would imply you could have more than 4 dependent tex lookups,
> as long as the tex result is consumed directly by another tex
> instruction without any arithmetic in-between, by bundling them (far
> apart?) in the same phase...). In any case, without some more
> documentation, playing safe seems to be the only choice.

I'll ask the hw folks today.

Alex