[PATCH 5/5] [RFC]drm: add syncobj timeline support v3
zhoucm1
zhoucm1 at amd.com
Tue Sep 4 07:53:58 UTC 2018
On 2018年09月04日 15:00, Christian König wrote:
> Am 04.09.2018 um 06:04 schrieb zhoucm1:
>>
>>
>> On 2018年09月03日 19:19, Christian König wrote:
>>> Am 03.09.2018 um 12:07 schrieb Chunming Zhou:
>>>>
>>>>
>>>> 在 2018/9/3 16:50, Christian König 写道:
>>>>> Am 03.09.2018 um 06:13 schrieb Chunming Zhou:
>>>>>>
>>>>>>
>>>>>> 在 2018/8/30 19:32, Christian König 写道:
>>>>>>> [SNIP]
>>>>>>>>>>>
>>>>>>>>>>>> +
>>>>>>>>>>>> +struct drm_syncobj_wait_pt {
>>>>>>>>>>>> + struct drm_syncobj_stub_fence base;
>>>>>>>>>>>> + u64 value;
>>>>>>>>>>>> + struct rb_node node;
>>>>>>>>>>>> +};
>>>>>>>>>>>> +struct drm_syncobj_signal_pt {
>>>>>>>>>>>> + struct drm_syncobj_stub_fence base;
>>>>>>>>>>>> + struct dma_fence *signal_fence;
>>>>>>>>>>>> + struct dma_fence *pre_pt_base;
>>>>>>>>>>>> + struct dma_fence_cb signal_cb;
>>>>>>>>>>>> + struct dma_fence_cb pre_pt_cb;
>>>>>>>>>>>> + struct drm_syncobj *syncobj;
>>>>>>>>>>>> + u64 value;
>>>>>>>>>>>> + struct list_head list;
>>>>>>>>>>>> +};
>>>>>>>>>>>
>>>>>>>>>>> What are those two structures good for
>>>>>>>>>> They are used to record wait ops points and signal ops points.
>>>>>>>>>> For timeline, they are connected by timeline value, works like:
>>>>>>>>>> a. signal pt increase timeline value to signal_pt value
>>>>>>>>>> when signal_pt is signaled. signal_pt is depending on
>>>>>>>>>> previous pt fence and itself signal fence from signal ops.
>>>>>>>>>> b. wait pt tree checks if timeline value over itself value.
>>>>>>>>>>
>>>>>>>>>> For normal, works like:
>>>>>>>>>> a. signal pt increase 1 for
>>>>>>>>>> syncobj_timeline->signal_point every time when signal ops is
>>>>>>>>>> performed.
>>>>>>>>>> b. when wait ops is comming, wait pt will fetch above
>>>>>>>>>> last signal pt value as its wait point. wait pt will be only
>>>>>>>>>> signaled by equal point value signal_pt.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> and why is the stub fence their base?
>>>>>>>>>> Good question, I tried to kzalloc for them as well when I
>>>>>>>>>> debug them, I encountered a problem:
>>>>>>>>>> I lookup/find wait_pt or signal_pt successfully, but when I
>>>>>>>>>> tried to use them, they are freed sometimes, and results in
>>>>>>>>>> NULL point.
>>>>>>>>>> and generally, when lookup them, we often need their stub
>>>>>>>>>> fence as well, and in the meantime, their lifecycle are same.
>>>>>>>>>> Above reason, I make stub fence as their base.
>>>>>>>>>
>>>>>>>>> That sounds like you only did this because you messed up the
>>>>>>>>> lifecycle.
>>>>>>>>>
>>>>>>>>> Additional to that I don't think you correctly considered the
>>>>>>>>> lifecycle of the waits and the sync object itself. E.g.
>>>>>>>>> blocking in drm_syncobj_timeline_fini() until all waits are
>>>>>>>>> done is not a good idea.
>>>>>>>>>
>>>>>>>>> What you should do instead is to create a fence_array object
>>>>>>>>> with all the fence we need to wait for when a wait point is
>>>>>>>>> requested.
>>>>>>>> Yeah, this is our initial discussion result, but when I tried
>>>>>>>> to do that, I found that cannot meet the advance signal
>>>>>>>> requirement:
>>>>>>>> a. We need to consider the wait and signal pt value are not
>>>>>>>> one-to-one match, it's difficult to find dependent point, at
>>>>>>>> least, there is some overhead.
>>>>>>>
>>>>>>> As far as I can see that is independent of using a fence array
>>>>>>> here. See you can either use a ring buffer or an rb-tree, but
>>>>>>> when you want to wait for a specific point we need to condense
>>>>>>> the not yet signaled fences into an array.
>>>>>> again, need to find the range of where the specific point in, we
>>>>>> should close to timeline semantics, I also refered the sw_sync.c
>>>>>> timeline, usally wait_pt is signalled by timeline point. And I
>>>>>> agree we can implement it with several methods, but I don't think
>>>>>> there are basical differences.
>>>>>
>>>>> The problem is that with your current approach you need the
>>>>> sync_obj alive for the synchronization to work. That is most
>>>>> likely not a good idea.
>>>> Indeed, I will fix that. How abount only creating fence array for
>>>> every wait pt when syncobj release? when syncobj release, wait pt
>>>> must have waited the signal opertation, then we can easily condense
>>>> fences for every wait pt. And meantime, we can take timeline based
>>>> wait pt advantage.
>>>
>>> That could work, but I don't see how you want to replace the already
>>> issued fence with a fence_array when the sync object is destroyed.
>>>
>>> Additional to that I would rather prefer a consistent handling, e.g.
>>> without extra rarely used code paths.
>> Ah, I find a easy way, we just need to make syncobj_timeline
>> structure as a reference. This way syncobj itself can be released
>> first, wait_pt/signal_pt don't need syncobj at all.
>> every wait_pt/signal_pt keep a reference of syncobj_timeline.
>
> I've thought about that as well, but came to the conclusion that you
> run into problems because of circle dependencies.
>
> E.g. sync_obj references sync_point and sync_point references sync_obj.
sync_obj can be freed first, only sync point references syncobj_timeline
structure, syncobj_timeline doesn't reference sync_pt, no circle dep.
>
> Additional to that it is quite a bit larger memory footprint because
> you need to keep the sync_obj around as well.
all signaled sync_pt are freed immediately except syncobj_timeline
sturcture, where does extra memory foootprint take?
>
>>
>>>
>>>>
>>>>>
>>>>> Additional to that you enable signaling without a need from the
>>>>> waiting side. That is rather bad for implementations which need
>>>>> that optimization.
>>>> Do you mean increasing timeline based on signal fence is not
>>>> better? only update timeline value when requested by a wait pt?
>>>
>>> Yes, exactly.
>>>
>>>> This way, we will not update timeline value immidiately and cannot
>>>> free signal pt immidiately, and we also need to consider it to CPU
>>>> query and wait.
>>>
>>> That is actually the better coding style. We usually try to avoid
>>> doing things in interrupt handlers as much as possible.
>> OK, I see your concern, how about to delay handling to a workqueue?
>> this way, we only increase timeline value and wake up workqueue in
>> fence cb, is that acceptable?
>
> A little bit, but not much better. The nice thing with garbage
> collection is that the CPU time is accounted to the process causing
> the garbage.
>
>>>
>>> How about this idea:
>>> 1. Each signaling point is a fence implementation with an rb node.
>>> 2. Each node keeps a reference to the last previously inserted node.
>>> 3. Each node is referenced by the sync object itself.
>>> 4. Before each signal/wait operation we do a garbage collection and
>>> remove the first node from the tree as long as it is signaled.
>>>
>>> 5. When enable_signaling is requested for a node we cascade that to
>>> the left using rb_prev.
>>> This ensures that signaling is enabled for the current fence as
>>> well as all previous fences.
>>>
>>> 6. A wait just looks into the tree for the signal point lower or
>>> equal of the requested sequence number.
After re-thought your idea, I think it doesn't work since there is no
timeline value as a line:
signal pt value doesn't must be continues, which can be jump by signal
operation, like 1, 4, 8, 15, 19, e.g. there are five singal_pt,
signal_pt1->signal_pt4->signal_pt8->signal_pt15->signal_pt19, if a wait
pt is 7, do you mean this wait only needs signal_pt1 and signal_pt4???
That's certainly not right, we need to make sure the timeline value is
bigger than wait pt value, that means signal_pt8 is need for wait_pt7.
>>>
>>> 7. When the sync object is released we use
>>> rbtree_postorder_for_each_entry_safe() and drop the extra reference
>>> to each node, but never call rb_erase!
>>> This way the rb_tree stays in memory, but without a root (e.g.
>>> the sync object). It only destructs itself when the looked up
>>> references to the nodes are dropped.
And here, who will destroy rb node since no one do enable_signaling, and
there is no callback to free themselves.
Regards,
David Zhou
>>>
>>> Well that is quite a bunch of logic, but I think that should work fine.
>> Yeah, it could work, simple timeline reference also can solve 'free'
>> problem.
>
> I think this approach is still quite a bit better,
> e.g. you don't run into circle dependency problems, it needs less
> memory and each node has always the same size which means we can use a
> kmem_cache for it.
>
> Regards,
> Christian.
>
>>
>> Thanks,
>> David Zhou
>
More information about the amd-gfx
mailing list