[PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

Fri Oct 7 06:42:10 PDT 2011

On Fri, Oct 7, 2011 at 9:38 AM, Jerome Glisse <j.glisse at gmail.com> wrote:
> On Fri, Oct 7, 2011 at 4:00 AM, Thomas Hellstrom <thomas at shipmail.org> wrote:
>> On 10/07/2011 12:42 AM, Marek Olšák wrote:
>>>
>>> On Wed, Oct 5, 2011 at 7:54 AM, Thomas Hellstrom<thomas at shipmail.org>
>>>  wrote:
>>>
>>>>
>>>> In any case, I'm not saying fences is the best way to flush but since the
>>>> bo
>>>> code assumes that signaling a sync object means "make the buffer contents
>>>> available for CPU read / write", it's usually a good way to do it;
>>>> there's
>>>> even a sync_obj_flush() method that gets called when a potential flush
>>>> needs
>>>> to happen.
>>>>
>>>
>>> I don't think we use it like that. To my knowledge, the purpose of the
>>> sync obj (to Radeon Gallium drivers at least) is to be able to wait
>>> for the last use of a buffer. Whether the contents can or cannot be
>>> available to the CPU is totally irrelevant.
>>>
>>> Currently (and it's a very important performance optimization),
>>> buffers stay mapped and available for CPU read/write during their
>>> first map_buffer call. Unmap_buffer is a no-op. The unmapping happens
>>> on buffer destruction. We only call bo_wait when we want to wait for
>>> the GPU until it's done with the buffer (we don't always want that,
>>> sometimes we want to use the unsychronized flag). Otherwise the
>>> contents of buffers are available at *any time*.
>>>
>>> We could probably implement bo_wait privately in the kernel driver and
>>> not use ttm_bo_wait. I preferred code sharing though.
>>>
>>> Textures (especially the tiled ones) are never mapped directly and a
>>> temporary staging resource is used instead, so we don't actually
>>> pollute address space that much. (in case you would have such a
>>> remark) We will use staging resources for buffers too, but it's really
>>> the last resort to avoid waiting when direct access can't be used.
>>>
>>>
>>>
>>>>>>
>>>>>> 2) Can't we say that a write_sync_obj is simply a sync_obj? What's the
>>>>>> difference between those two? I think we should remove the
>>>>>> write_sync_obj
>>>>>> bo
>>>>>> member.
>>>>>>
>>>>>>
>>>>>
>>>>> Okay, but I think we should remove sync_obj instead, and keep write
>>>>> and read sync objs. In the case of READWRITE usage, read_sync_obj
>>>>> would be equal to write_sync_obj.
>>>>>
>>>>>
>>>>>
>>>>
>>>> Sure, I'm fine with that.
>>>>
>>>> One other thing, though, that makes me a little puzzled:
>>>>
>>>> Let's assume you don't allow readers and writers at the same time, which
>>>> is
>>>> my perception of how read- and write fences should work; you either have
>>>> a
>>>> list of read fences or a single write fence (in the same way a read-write
>>>> lock works).
>>>>
>>>> Now, if you only allow a single read fence, like in this patch. That
>>>> implies
>>>> that you can only have either a single read fence or a single write fence
>>>> at
>>>> any one time. We'd only need a single fence pointer on the bo, and
>>>> sync_obj_arg would tell us whether to signal the fence for read or for
>>>> write
>>>> (assuming that sync_obj_arg was set up to indicate read / write at
>>>> validation time), then the patch really isn't necessary at all, as it
>>>> only
>>>> allows a single read fence?
>>>>
>>>> Or is it that you want to allow read- and write fences co-existing? In
>>>> that
>>>> case, what's the use case?
>>>>
>>>
>>> There are lots of read-write use cases which don't need any barriers
>>> or flushing. The obvious ones are color blending and depth-stencil
>>> buffering. The OpenGL application is also allowed to use a subrange of
>>> a buffer as a vertex buffer (read-only) and another disjoint subrange
>>> of the same buffer for transform feedback (write-only), which kinda
>>> makes me think about whether we should track subranges instead of
>>> treating a whole buffer as "busy". It gets even more funky with
>>> ARB_shader_image_load_store, which supports atomic read-modify-write
>>> operations on textures, not to mention atomic memory operations in
>>> compute shaders (wait, isn't that also exposed in GL as
>>> GL_ARB_shader_atomic_counters?).
>>>
>>> I was thinking whether the two sync objs should be "read" and
>>> "readwrite", or "read" and "write". I chose the latter, because it's
>>> more fine-grained and we have to keep at least two of them around
>>> anyway.
>>>
>>> So now that you know what we use sync objs for, what are your ideas on
>>> re-implementing that patch in a way that is okay with you? Besides
>>> removing the third sync objs of course.
>>>
>>> Marek
>>>
>>
>> OK. First I think we need to make a distinction: bo sync objects vs driver
>> fences. The bo sync obj api is there to strictly provide functionality that
>> the ttm bo subsystem is using, and that follows a simple set of rules:
>>
>> 1) the bo subsystem does never assume sync objects are ordered. That means
>> the bo subsystem needs to wait on a sync object before removing it from a
>> buffer. Any other assumption is buggy and must be fixed. BUT, if that
>> assumption takes place in the driver unknowingly from the ttm bo subsystem
>> (which is usually the case), it's OK.
>>
>> 2) When the sync object(s) attached to the bo are signaled the ttm bo
>> subsystem is free to copy the bo contents and to unbind the bo.
>>
>> 3) The ttm bo system allows sync objects to be signaled in different ways
>> opaque to the subsystem using sync_obj_arg. The driver is responsible for
>> setting up that argument.
>>
>> 4) Driver fences may be used for or expose other functionality or adaptions
>> to APIs as long as the sync obj api exported to the bo sybsystem follows the
>> above rules.
>>
>> This means the following w r t the patch.
>>
>> A) it violates 1). This is a bug that must be fixed. Assumptions that if one
>> sync object is singnaled, another sync object is also signaled must be done
>> in the driver and not in the bo subsystem. Hence we need to explicitly wait
>> for a fence to remove it from the bo.
>>
>> B) the sync_obj_arg carries *per-sync-obj* information on how it should be
>> signaled. If we need to attach multiple sync objects to a buffer object, we
>> also need multiple sync_obj_args. This is a bug and needs to be fixed.
>>
>> C) There is really only one reason that the ttm bo subsystem should care
>> about multiple sync objects, and that is because the driver can't order them
>> efficiently. A such example would be hardware with multiple pipes reading
>> simultaneously from the same texture buffer. Currently we don't support this
>> so only the *last* sync object needs to be know by the bo subsystem. Keeping
>> track of multiple fences generates a lot of completely unnecessary code in
>> the ttm_bo_util file, the ttm_bo_vm file, and will be a nightmare if / when
>> we truly support pipelined moves.
>>
>> As I understand it from your patches, you want to keep multiple fences
>> around only to track rendering history. If we want to do that generically, i
>> suggest doing it in the execbuf util code in the following way:
>>
>> struct ttm_eu_rendering_history {
>>    void *last_read_sync_obj;
>>    void *last_read_sync_obj_arg;
>>    void *last_write_sync_obj;
>>    void *last_write_sync_obj_arg;
>> }
>>
>> Embed this structure in the radeon_bo, and build a small api around it,
>> including *optionally* passing it to the existing execbuf utilities, and you
>> should be done. The bo_util code and bo_vm code doesn't care about the
>> rendering history. Only that the bo is completely idle.
>>
>> Note also that when an accelerated bo move is scheduled, the driver needs to
>> update this struct
>>
>> /Thomas
>
> I should have look at the patch long ago ... anyway i think a better
> approach would be to expose fence id as 64bits unsigned to each
> userspace client. I was thinking of mapping a page readonly (same page
> as the one gpu write back) at somespecific offset in drm file (bit
> like sarea but readonly so no lock). Each time userspace submit a
> command stream it would get the fence id associated with the command
> stream. It would then be up to userspace to track btw read or write
> and do appropriate things. The kernel code would be simple (biggest
> issue is finding an offset we can use for that), it would be fast as
> no round trip to kernel to know the last fence.
>
> Each fence seq would be valid only for a specific ring (only future
> gpu impacted here, maybe cayman).
>
> So no change to ttm, just change to radeon to return fence seq and to
> move to an unsigned 64. Issue would be when gpu write back is
> disabled, then we would either need userspace to call somethings like
> bo wait or to other ioctl to get the kernel to update the copy, copy
> would be updated in the irq handler too so at least it get updated
> anytime something enable irq.
>
> Cheers,
> Jerome
>

Forget to mention that we would need a new wait_fence ioctl to wait
for a specific fence seq on a specific ring

Cheers,
Jerome