[Mesa-dev] [PATCH 3/3] radeon/winsys: add async dma infrastructure

Mon Jan 7 08:22:10 PST 2013

On 07.01.2013 17:03, Marek Olšák wrote:
> On Mon, Jan 7, 2013 at 3:45 PM, Christian König <deathsimple at vodafone.de> wrote:
>> On 07.01.2013 01:24, Marek Olšák wrote:
>>> On Sun, Jan 6, 2013 at 11:58 PM, Jerome Glisse <j.glisse at gmail.com> wrote:
>>>> On Sun, Jan 6, 2013 at 4:00 PM, Marek Olšák <maraeo at gmail.com> wrote:
>>>>> I agree with Christian. You can use a separate instance of
>>>>> radeon_winsys_cs for the DMA CS. The winsys exposes all the functions
>>>>> you need (except one) for you to coordinate work between 2 command
>>>>> streams in the pipe driver. You may only need to expose one additional
>>>>> winsys function to the driver for synchronization, it's called
>>>>> "radeon_drm_cs_sync_flush". I'm confident that this can be implemented
>>>>> and layered on top of the winsys, presumably with fewer lines of code
>>>>> and cleaner.
>>>> The relocation add function need to access both the dma ring and the
>>>> cs ring no matter on which ring the relocation is added. Doing the
>>>> sync in the pipe driver would increase the code, each call site of
>>>> add_reloc would need to check if the bo is referenced by the other
>>>> ring and flush the other ring if so. Which also means that there is a
>>>> higher likelyhood that someone adding an add reloc forget about the
>>>> flushing.
>>> Well, in that case, you can define a new set of functions in the pipe
>>> driver, which are layered on top of radeon_winsys_cs and the existing
>>> interface radeon_winsys::cs_*.
>>>
>>> If you want to be super clean, you can add a new module that defines
>>> this command stream pair:
>>>
>>> struct r600_cs_with_dma {
>>>      struct radeon_winsys_cs *cs_main, *cs_dma;
>>> };
>>>
>>> And define a set of functions which work with that, reimplementing all
>>> the cs_* functions by calling the existing functions of radeon_winsys.
>>> The pipe driver would then use the new CS functions everywhere instead
>>> of radeon_winsys.
>>>
>>> To me, the best design decision here is not to try to *hack* the
>>> existing winsys code to make it do what you want without giving it
>>> another thought. Adding another layer is preferable, because it keeps
>>> both parts simple and separated.
>>
>> Well thinking about it more and more I don't think add_reloc is the right
>> place to do the sync anyway.
>>
>> Imagine a loop that wants to handle a bunch of buffers, first they are zero
>> cleared and then rendered to. Those buffers are unique, so we can zero clear
>> them all at once. In an ideal world they should all end up in the same DMA
>> command stream.
>>
>> Now comes a buffer that is first rendered to and then copied around (for
>> example), in this moment the DMA command stream needs to be flushed, cause
>> now a new DMA command stream starts that actually needs to run after the
>> rendering command stream.
>>
>> So instead of flushing when we see that a buffer gets added to a command
>> stream we need to remember in which oder the command stream needs to get
>> submitted and only flush when this order is going to change.
> I agree with all your points. add_reloc is a bad place for
> synchronization for yet another reason: you don't really know anything
> about what the driver is trying to do and what commands and
> relocations are likely to come next, as opposed to e.g. a write
> transfer where you are 100% sure that:
> - the source resource isn't referenced by a CS nor is it busy
> - the destination resource will likely be used pretty soon as a source
> for rendering
>
> In addition to that, I believe that using the async DMA is useless for
> anything but write-only transfers with a staging resource. Every other
> case is always synchronous with rendering and therefore defeats the
> purpose of the *async* DMA. I therefore propose:
>
> 1) Let's not use the async DMA in resource_copy_region *at all*.
>
> 2) Let's replace all the resource_copy_region and copy_buffer
> occurencies in transfer_unmap with async DMA copies. The function
> implementing the copy should do all the necessary synchronization
> between command streams by itself.

Yeah, that pretty much matches with what I had in mind.

Additional to that we should consider removing the duplicate check from 
add_reloc. It only makes sense for the GFX command stream, only limited 
sense for compute and UVD command streams, and really hurts us for DMA 
command streams.

Feel free to submit code for this. I would volunteer to code it myself, 
but unfortunately don't have time for it at the moment.

Christian.

>
> Marek
>