[Mesa-dev] [PATCH] r300/compiler: Prevent the fragmentation of TEX blocks in the pair scheduler.

Thu Jun 3 12:01:41 PDT 2010

On Thu, Jun 3, 2010 at 10:13 AM, Alex Deucher <alexdeucher at gmail.com> wrote:
> On Thu, Jun 3, 2010 at 9:41 AM, Roland Scheidegger <sroland at vmware.com> wrote:
>> On 03.06.2010 13:51, Nicolai Haehnle wrote:
>>> On Thu, Jun 3, 2010 at 7:45 AM, Tom Stellard <tstellar at gmail.com> wrote:
>>>> On Wed, Jun 02, 2010 at 11:39:57AM +0200, Nicolai Haehnle wrote:
>>>>> On Wed, Jun 2, 2010 at 6:53 AM, Tom Stellard <tstellar at gmail.com> wrote:
>>>>>> On Tue, Jun 01, 2010 at 12:00:16PM +0200, Nicolai Haehnle wrote:
>>>>>>> On Tue, Jun 1, 2010 at 7:41 AM, tstellar <tstellar at gmail.com> wrote:
>>>>>>>> From: Tom Stellard <tstellar at gmail.com>
>>>>>>>>
>>>>>>>> This fixes bug:
>>>>>>>>
>>>>>>>> https://bugs.freedesktop.org/show_bug.cgi?id=25109
>>>>>>> Is this really correct? If I understand your patch correctly, what it
>>>>>>> does is that TEX instructions that depend on earlier TEX instructions
>>>>>>> will be emitted in the same TEX group on R300. So if you have
>>>>>>> dependent texture reads like this:
>>>>>>>
>>>>>>>  MOV r0, something;
>>>>>>>  TEX r1, r0, ...;
>>>>>>>  TEX r2, r1, ...;
>>>>>>>
>>>>>>> You will now emit both TEX in the same TEX indirection block, which I
>>>>>>> thought was wrong, because the second TEX instruction will actually
>>>>>>> use the contents of r1 *before* the first TEX instruction as
>>>>>>> coordinates. At least that's how I thought the TEX hardware works:
>>>>>>>
>>>>>>> 1) Fetch all coordinates for all TEX instructions in an indirection block
>>>>>>> 2) Execute all TEX instructions in parallel
>>>>>>> 3) Store all results in the respective destination registers
>>>>>>>
>>>>>>> If my understanding is correct, then I believe your change miscompiles
>>>>>>> the above shader fragment. Can you clarify this?
>>>>>>>
>>>>>> It looks like I am the one who misunderstood how TEX instructions work.
>>>>>> You are right, the patch does miscompile your example.  The shader
>>>>>> I was having problems with looked like this:
>>>>>>
>>>>>> 10: TEX temp[13].xyz, temp[12].xy__, 2D[0];
>>>>>> 11: TEX temp[12].xyz, temp[11].xy__, 2D[0];
>>>>>> 12: TEX temp[11].xyz, temp[10].xy__, 2D[0];
>>>>>> 13: TEX temp[10].xyz, temp[9].xy__, 2D[0];
>>>>>> 14: TEX temp[9].xyz, temp[8].xy__, 2D[0];
>>>>>> 15: TEX temp[8].xyz, temp[7].xy__, 2D[0];
>>>>>> 16: TEX temp[7].xyz, input[4].xy__, 2D[0];
>>>>>> 17: TEX temp[6].xyz, temp[6].xy__, 2D[0];
>>>>>> 18: TEX temp[5].xyz, temp[5].xy__, 2D[0];
>>>>>> 19: TEX temp[4].xyz, temp[4].xy__, 2D[0];
>>>>>> 20: TEX temp[3].xyz, temp[3].xy__, 2D[0];
>>>>>>
>>>>>> I think the bug is actually in the pair scheduler's dataflow analysis.
>>>>>> Currently, the pair scheduler marks #10 as a dependency of #11, which
>>>>>> would be OK if these were ALU instructions, but it shouldn't be a
>>>>>> dependency for TEX instructions.
>>>>> Yes, I think what the dataflow analysis sees is that #10 reads
>>>>> temp[12] while #11 overwrites temp[12], so its belief is that #10 must
>>>>> be executed before #11. That's indeed a curious problem.
>>>>>
>>>>> So is this done after register allocation? Because it seems to me like
>>>>> the register dealiasing should rename one of the occurences of
>>>>> temp[12] in this example...
>>>>>
>>>> Currently, the pair scheduler runs before the register allocation pass.  I have
>>>> attached the compiler output for this shader to the original bug if you
>>>> want to take a look at it:
>>>> https://bugs.freedesktop.org/show_bug.cgi?id=25109
>>>
>>> Thanks. I think I simply misremembered what's happening there. At some
>>> point, the compiler had a pass that would dealias register names, so
>>> that one of the temp[12]s would have been changed to something else,
>>> but clearly that's no longer the case.
>>>
>>> I'm unsure what the best solution is here. Reintroducing the
>>> dealiasing would certainly work. It would be interesting to know the
>>> exact semantics of the hardware unit. Is it really: First read all
>>> registers, then tex lookup, then write all results?
>> No you can't rely on that. I know because I made that error for the r300
>> planar xv support - worked though but was fixed by Alex :-).
>> (commit 9091b3f5f13dbea83ffd89679dac600e9f280bb2).
>>
>>> In other words, is
>>> there a guarantee that all coordinates will be read before the first
>>> result is written? In that case, one could separate the "commit" phase
>>> of texture units: first commit all the coordinate reads in one batch,
>>> allowing new TEX instructions to become ready. Then finalize the block
>>> and commit all the writes.
>>>
>>> I've never done experiments to figure out what the exact semantics of
>>> the TEX blocks are, and I haven't seen any documentation that
>>> clarifies it. Back when I worked on the code I've always gone the safe
>>> route, i.e. assuming that both parallel and serial execution of TEX
>>> instruction is possible if the GPU feels like it, which reduced errors
>>> but unfortunately means that I'm none the wiser to this day.
>>
>> I guess it could very well be some mix of parallel and serial execution
>> in hardware. Maybe (due to hardware pipelining) it is even possible that
>> using an overwritten reg can still be used as source for, say, the next
>> 3 tex instructions, but that is highly speculative, might be asic
>> dependent and not even be true :-). (Interestingly, in the case it won't
>> work, this would imply you could have more than 4 dependent tex lookups,
>> as long as the tex result is consumed directly by another tex
>> instruction without any arithmetic in-between, by bundling them (far
>> apart?) in the same phase...). In any case, without some more
>> documentation, playing safe seems to be the only choice.
>
> I'll ask the hw folks today.

The word is:
Registers should be fully read before they are written so you can use
the same register for the result of the read, but you cannot do
dependant textures in one clause.

Alex