[PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

Fri Oct 7 07:14:55 PDT 2011

On Fri, Oct 7, 2011 at 10:05 AM, Thomas Hellstrom <thomas at shipmail.org> wrote:
> On 10/07/2011 03:24 PM, Alex Deucher wrote:
>>
>> On Fri, Oct 7, 2011 at 4:00 AM, Thomas Hellstrom<thomas at shipmail.org>
>>  wrote:
>>
>>>
>>> On 10/07/2011 12:42 AM, Marek Olšák wrote:
>>>
>>>>
>>>> On Wed, Oct 5, 2011 at 7:54 AM, Thomas Hellstrom<thomas at shipmail.org>
>>>>  wrote:
>>>>
>>>>
>>>>>
>>>>> In any case, I'm not saying fences is the best way to flush but since
>>>>> the
>>>>> bo
>>>>> code assumes that signaling a sync object means "make the buffer
>>>>> contents
>>>>> available for CPU read / write", it's usually a good way to do it;
>>>>> there's
>>>>> even a sync_obj_flush() method that gets called when a potential flush
>>>>> needs
>>>>> to happen.
>>>>>
>>>>>
>>>>
>>>> I don't think we use it like that. To my knowledge, the purpose of the
>>>> sync obj (to Radeon Gallium drivers at least) is to be able to wait
>>>> for the last use of a buffer. Whether the contents can or cannot be
>>>> available to the CPU is totally irrelevant.
>>>>
>>>> Currently (and it's a very important performance optimization),
>>>> buffers stay mapped and available for CPU read/write during their
>>>> first map_buffer call. Unmap_buffer is a no-op. The unmapping happens
>>>> on buffer destruction. We only call bo_wait when we want to wait for
>>>> the GPU until it's done with the buffer (we don't always want that,
>>>> sometimes we want to use the unsychronized flag). Otherwise the
>>>> contents of buffers are available at *any time*.
>>>>
>>>> We could probably implement bo_wait privately in the kernel driver and
>>>> not use ttm_bo_wait. I preferred code sharing though.
>>>>
>>>> Textures (especially the tiled ones) are never mapped directly and a
>>>> temporary staging resource is used instead, so we don't actually
>>>> pollute address space that much. (in case you would have such a
>>>> remark) We will use staging resources for buffers too, but it's really
>>>> the last resort to avoid waiting when direct access can't be used.
>>>>
>>>>
>>>>
>>>>
>>>>>>>
>>>>>>> 2) Can't we say that a write_sync_obj is simply a sync_obj? What's
>>>>>>> the
>>>>>>> difference between those two? I think we should remove the
>>>>>>> write_sync_obj
>>>>>>> bo
>>>>>>> member.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Okay, but I think we should remove sync_obj instead, and keep write
>>>>>> and read sync objs. In the case of READWRITE usage, read_sync_obj
>>>>>> would be equal to write_sync_obj.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> Sure, I'm fine with that.
>>>>>
>>>>> One other thing, though, that makes me a little puzzled:
>>>>>
>>>>> Let's assume you don't allow readers and writers at the same time,
>>>>> which
>>>>> is
>>>>> my perception of how read- and write fences should work; you either
>>>>> have
>>>>> a
>>>>> list of read fences or a single write fence (in the same way a
>>>>> read-write
>>>>> lock works).
>>>>>
>>>>> Now, if you only allow a single read fence, like in this patch. That
>>>>> implies
>>>>> that you can only have either a single read fence or a single write
>>>>> fence
>>>>> at
>>>>> any one time. We'd only need a single fence pointer on the bo, and
>>>>> sync_obj_arg would tell us whether to signal the fence for read or for
>>>>> write
>>>>> (assuming that sync_obj_arg was set up to indicate read / write at
>>>>> validation time), then the patch really isn't necessary at all, as it
>>>>> only
>>>>> allows a single read fence?
>>>>>
>>>>> Or is it that you want to allow read- and write fences co-existing? In
>>>>> that
>>>>> case, what's the use case?
>>>>>
>>>>>
>>>>
>>>> There are lots of read-write use cases which don't need any barriers
>>>> or flushing. The obvious ones are color blending and depth-stencil
>>>> buffering. The OpenGL application is also allowed to use a subrange of
>>>> a buffer as a vertex buffer (read-only) and another disjoint subrange
>>>> of the same buffer for transform feedback (write-only), which kinda
>>>> makes me think about whether we should track subranges instead of
>>>> treating a whole buffer as "busy". It gets even more funky with
>>>> ARB_shader_image_load_store, which supports atomic read-modify-write
>>>> operations on textures, not to mention atomic memory operations in
>>>> compute shaders (wait, isn't that also exposed in GL as
>>>> GL_ARB_shader_atomic_counters?).
>>>>
>>>> I was thinking whether the two sync objs should be "read" and
>>>> "readwrite", or "read" and "write". I chose the latter, because it's
>>>> more fine-grained and we have to keep at least two of them around
>>>> anyway.
>>>>
>>>> So now that you know what we use sync objs for, what are your ideas on
>>>> re-implementing that patch in a way that is okay with you? Besides
>>>> removing the third sync objs of course.
>>>>
>>>> Marek
>>>>
>>>>
>>>
>>> OK. First I think we need to make a distinction: bo sync objects vs
>>> driver
>>> fences. The bo sync obj api is there to strictly provide functionality
>>> that
>>> the ttm bo subsystem is using, and that follows a simple set of rules:
>>>
>>> 1) the bo subsystem does never assume sync objects are ordered. That
>>> means
>>> the bo subsystem needs to wait on a sync object before removing it from a
>>> buffer. Any other assumption is buggy and must be fixed. BUT, if that
>>> assumption takes place in the driver unknowingly from the ttm bo
>>> subsystem
>>> (which is usually the case), it's OK.
>>>
>>> 2) When the sync object(s) attached to the bo are signaled the ttm bo
>>> subsystem is free to copy the bo contents and to unbind the bo.
>>>
>>> 3) The ttm bo system allows sync objects to be signaled in different ways
>>> opaque to the subsystem using sync_obj_arg. The driver is responsible for
>>> setting up that argument.
>>>
>>> 4) Driver fences may be used for or expose other functionality or
>>> adaptions
>>> to APIs as long as the sync obj api exported to the bo sybsystem follows
>>> the
>>> above rules.
>>>
>>> This means the following w r t the patch.
>>>
>>> A) it violates 1). This is a bug that must be fixed. Assumptions that if
>>> one
>>> sync object is singnaled, another sync object is also signaled must be
>>> done
>>> in the driver and not in the bo subsystem. Hence we need to explicitly
>>> wait
>>> for a fence to remove it from the bo.
>>>
>>> B) the sync_obj_arg carries *per-sync-obj* information on how it should
>>> be
>>> signaled. If we need to attach multiple sync objects to a buffer object,
>>> we
>>> also need multiple sync_obj_args. This is a bug and needs to be fixed.
>>>
>>> C) There is really only one reason that the ttm bo subsystem should care
>>> about multiple sync objects, and that is because the driver can't order
>>> them
>>> efficiently. A such example would be hardware with multiple pipes reading
>>> simultaneously from the same texture buffer. Currently we don't support
>>> this
>>> so only the *last* sync object needs to be know by the bo subsystem.
>>> Keeping
>>> track of multiple fences generates a lot of completely unnecessary code
>>> in
>>> the ttm_bo_util file, the ttm_bo_vm file, and will be a nightmare if /
>>> when
>>> we truly support pipelined moves.
>>>
>>
>> Just an FYI, cayman GPUs have multiple rings and we will be working to
>> support them in the coming months.
>>
>> Alex
>>
>>
>
> Right. Nvidia has had it for a long time, but I think they use barriers
> ("Semaphores") to order
> sync objects when needed, so that only the last sync object is attached to
> the buffer.
>
> If you need to / want to support simultaneous ring access to a buffer
> without such barriers,
> we should change the single buffer object fence pointer to a list of
> pointers to fence objects.

We were planning to use semaphores as well.

Alex

>
> /Thomas
>
>
>
>
>