[Mesa-dev] [PATCH] radeonsi: Enable VGPR spilling for all shader types v3

Tue Jan 20 05:39:10 PST 2015

The problem with CPDMA (DMA_DATA and WRITE_DATA) is that the ordering
of flushes must be correct. First, partial flushes must be done, so
that the shaders are idle. Then you can use CP DMA to update the
binary. After that, ICACHE should be invalidated. This is similar to
how si_clear_buffer and si_copy_buffer work.

However, the biggest issue is that it doesn't map very well to our
state management mechanisms, so it must be done separately (just like
si_clear_buffer). Doing it on the CPU just seems a lot easier.

The problem with mapping VRAM can be avoided by keeping a CPU copy of
the binary from the beginning. We would only need a CPU copy of those
shaders that use the scratch buffer. Then, you wouldn't have to read
VRAM at all, which would make it even simpler.

Marek

On Tue, Jan 20, 2015 at 4:14 AM, Michel Dänzer <michel at daenzer.net> wrote:
> On 20.01.2015 00:07, Marek Olšák wrote:
>>
>> 3) Since the CP runs in parallel with the graphics engine, shaders can
>> be busy while WRITE_DATA is changing their code. (no good can come out
>> of that) Therefore, the partial flushes (= partial "WAIT_UNTIL") must
>> be emitted before the WRITE_DATA packet. The other thing is that the
>> code is missing checking if there is enough space in the CS for the
>> WRITE_DATA packets. In order to simplify everything, I recommend this
>> solution:
>>
>> - Forget about partial flushes, forget about WRITE_DATA.
>> - If the shader relocs need updating, copy the current shader bo to a
>> new shader bo using the CPU (map the current shader bo with
>> read+unsynchronized flags).
>> - Update the relocs using the CPU in the new shader bo.
>> - Then replace the current shader bo with the new one and mark the
>> whole shader state as dirty.
>
> Doing this with the CPU has some drawbacks though: Accessing a BO in
> VRAM with the CPU forces it into the first 256MB of VRAM, which could
> stall the graphics pipeline if the BO wasn't there before.
>
> I think basically the same scheme of allocating new BOs for reloc
> patching should be doable with CPDMA, which should remove the need to
> worry about flushing things in particular orders.
>
> It's probably fine to do it with the CPU first and with CPDMA (also for
> the initial shader code write) as an optimization later.
>
>
> --
> Earthling Michel Dänzer               |               http://www.amd.com
> Libre software enthusiast             |             Mesa and X developer