[Mesa-dev] [PATCH 3/3] radeon/winsys: add async dma infrastructure

Mon Jan 7 08:22:12 PST 2013

On Mon, Jan 7, 2013 at 11:03 AM, Marek Olšák <maraeo at gmail.com> wrote:
> On Mon, Jan 7, 2013 at 3:45 PM, Christian König <deathsimple at vodafone.de> wrote:
>> On 07.01.2013 01:24, Marek Olšák wrote:
>>>
>>> On Sun, Jan 6, 2013 at 11:58 PM, Jerome Glisse <j.glisse at gmail.com> wrote:
>>>>
>>>> On Sun, Jan 6, 2013 at 4:00 PM, Marek Olšák <maraeo at gmail.com> wrote:
>>>>>
>>>>> I agree with Christian. You can use a separate instance of
>>>>> radeon_winsys_cs for the DMA CS. The winsys exposes all the functions
>>>>> you need (except one) for you to coordinate work between 2 command
>>>>> streams in the pipe driver. You may only need to expose one additional
>>>>> winsys function to the driver for synchronization, it's called
>>>>> "radeon_drm_cs_sync_flush". I'm confident that this can be implemented
>>>>> and layered on top of the winsys, presumably with fewer lines of code
>>>>> and cleaner.
>>>>
>>>> The relocation add function need to access both the dma ring and the
>>>> cs ring no matter on which ring the relocation is added. Doing the
>>>> sync in the pipe driver would increase the code, each call site of
>>>> add_reloc would need to check if the bo is referenced by the other
>>>> ring and flush the other ring if so. Which also means that there is a
>>>> higher likelyhood that someone adding an add reloc forget about the
>>>> flushing.
>>>
>>> Well, in that case, you can define a new set of functions in the pipe
>>> driver, which are layered on top of radeon_winsys_cs and the existing
>>> interface radeon_winsys::cs_*.
>>>
>>> If you want to be super clean, you can add a new module that defines
>>> this command stream pair:
>>>
>>> struct r600_cs_with_dma {
>>>     struct radeon_winsys_cs *cs_main, *cs_dma;
>>> };
>>>
>>> And define a set of functions which work with that, reimplementing all
>>> the cs_* functions by calling the existing functions of radeon_winsys.
>>> The pipe driver would then use the new CS functions everywhere instead
>>> of radeon_winsys.
>>>
>>> To me, the best design decision here is not to try to *hack* the
>>> existing winsys code to make it do what you want without giving it
>>> another thought. Adding another layer is preferable, because it keeps
>>> both parts simple and separated.
>>
>>
>> Well thinking about it more and more I don't think add_reloc is the right
>> place to do the sync anyway.
>>
>> Imagine a loop that wants to handle a bunch of buffers, first they are zero
>> cleared and then rendered to. Those buffers are unique, so we can zero clear
>> them all at once. In an ideal world they should all end up in the same DMA
>> command stream.
>>
>> Now comes a buffer that is first rendered to and then copied around (for
>> example), in this moment the DMA command stream needs to be flushed, cause
>> now a new DMA command stream starts that actually needs to run after the
>> rendering command stream.
>>
>> So instead of flushing when we see that a buffer gets added to a command
>> stream we need to remember in which oder the command stream needs to get
>> submitted and only flush when this order is going to change.
>
> I agree with all your points. add_reloc is a bad place for
> synchronization for yet another reason: you don't really know anything
> about what the driver is trying to do and what commands and
> relocations are likely to come next, as opposed to e.g. a write
> transfer where you are 100% sure that:
> - the source resource isn't referenced by a CS nor is it busy
> - the destination resource will likely be used pretty soon as a source
> for rendering
>
> In addition to that, I believe that using the async DMA is useless for
> anything but write-only transfers with a staging resource. Every other
> case is always synchronous with rendering and therefore defeats the
> purpose of the *async* DMA. I therefore propose:
>
> 1) Let's not use the async DMA in resource_copy_region *at all*.
>
> 2) Let's replace all the resource_copy_region and copy_buffer
> occurencies in transfer_unmap with async DMA copies. The function
> implementing the copy should do all the necessary synchronization
> between command streams by itself.
>
> Marek

I have v3 coming that should please you.

Cheers,
Jerome