[PATCH 5/5] [RFC]drm: add syncobj timeline support v3

Tue Sep 4 07:53:58 UTC 2018


On 2018年09月04日 15:00, Christian König wrote:
> Am 04.09.2018 um 06:04 schrieb zhoucm1:
>>
>>
>> On 2018年09月03日 19:19, Christian König wrote:
>>> Am 03.09.2018 um 12:07 schrieb Chunming Zhou:
>>>>
>>>>
>>>> 在 2018/9/3 16:50, Christian König 写道:
>>>>> Am 03.09.2018 um 06:13 schrieb Chunming Zhou:
>>>>>>
>>>>>>
>>>>>> 在 2018/8/30 19:32, Christian König 写道:
>>>>>>> [SNIP]
>>>>>>>>>>>
>>>>>>>>>>>> +
>>>>>>>>>>>> +struct drm_syncobj_wait_pt {
>>>>>>>>>>>> +    struct drm_syncobj_stub_fence base;
>>>>>>>>>>>> +    u64    value;
>>>>>>>>>>>> +    struct rb_node   node;
>>>>>>>>>>>> +};
>>>>>>>>>>>> +struct drm_syncobj_signal_pt {
>>>>>>>>>>>> +    struct drm_syncobj_stub_fence base;
>>>>>>>>>>>> +    struct dma_fence *signal_fence;
>>>>>>>>>>>> +    struct dma_fence *pre_pt_base;
>>>>>>>>>>>> +    struct dma_fence_cb signal_cb;
>>>>>>>>>>>> +    struct dma_fence_cb pre_pt_cb;
>>>>>>>>>>>> +    struct drm_syncobj *syncobj;
>>>>>>>>>>>> +    u64    value;
>>>>>>>>>>>> +    struct list_head list;
>>>>>>>>>>>> +};
>>>>>>>>>>>
>>>>>>>>>>> What are those two structures good for
>>>>>>>>>> They are used to record wait ops points and signal ops points.
>>>>>>>>>> For timeline, they are connected by timeline value, works like:
>>>>>>>>>>     a. signal pt increase timeline value to signal_pt value 
>>>>>>>>>> when signal_pt is signaled. signal_pt is depending on 
>>>>>>>>>> previous pt fence and itself signal fence from signal ops.
>>>>>>>>>>     b. wait pt tree checks if timeline value over itself value.
>>>>>>>>>>
>>>>>>>>>> For normal, works like:
>>>>>>>>>>     a. signal pt increase 1 for 
>>>>>>>>>> syncobj_timeline->signal_point every time when signal ops is 
>>>>>>>>>> performed.
>>>>>>>>>>     b. when wait ops is comming, wait pt will fetch above 
>>>>>>>>>> last signal pt value as its wait point. wait pt will be only 
>>>>>>>>>> signaled by equal point value signal_pt.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> and why is the stub fence their base?
>>>>>>>>>> Good question, I tried to kzalloc for them as well when I 
>>>>>>>>>> debug them, I encountered a problem:
>>>>>>>>>> I lookup/find wait_pt or signal_pt successfully, but when I 
>>>>>>>>>> tried to use them, they are freed sometimes, and results in 
>>>>>>>>>> NULL point.
>>>>>>>>>> and generally, when lookup them, we often need their stub 
>>>>>>>>>> fence as well, and in the meantime, their lifecycle are same.
>>>>>>>>>> Above reason, I make stub fence as their base.
>>>>>>>>>
>>>>>>>>> That sounds like you only did this because you messed up the 
>>>>>>>>> lifecycle.
>>>>>>>>>
>>>>>>>>> Additional to that I don't think you correctly considered the 
>>>>>>>>> lifecycle of the waits and the sync object itself. E.g. 
>>>>>>>>> blocking in drm_syncobj_timeline_fini() until all waits are 
>>>>>>>>> done is not a good idea.
>>>>>>>>>
>>>>>>>>> What you should do instead is to create a fence_array object 
>>>>>>>>> with all the fence we need to wait for when a wait point is 
>>>>>>>>> requested.
>>>>>>>> Yeah, this is our initial discussion result, but when I tried 
>>>>>>>> to do that, I found that cannot meet the advance signal 
>>>>>>>> requirement:
>>>>>>>>     a. We need to consider the wait and signal pt value are not 
>>>>>>>> one-to-one match, it's difficult to find dependent point, at 
>>>>>>>> least, there is some overhead.
>>>>>>>
>>>>>>> As far as I can see that is independent of using a fence array 
>>>>>>> here. See you can either use a ring buffer or an rb-tree, but 
>>>>>>> when you want to wait for a specific point we need to condense 
>>>>>>> the not yet signaled fences into an array.
>>>>>> again, need to find the range of where the specific point in, we 
>>>>>> should close to timeline semantics, I also refered the sw_sync.c 
>>>>>> timeline, usally wait_pt is signalled by timeline point. And I 
>>>>>> agree we can implement it with several methods, but I don't think 
>>>>>> there are basical differences.
>>>>>
>>>>> The problem is that with your current approach you need the 
>>>>> sync_obj alive for the synchronization to work. That is most 
>>>>> likely not a good idea.
>>>> Indeed, I will fix that. How abount only creating fence array for 
>>>> every wait pt when syncobj release? when syncobj release, wait pt 
>>>> must have waited the signal opertation, then we can easily condense 
>>>> fences for every wait pt. And meantime, we can take timeline based 
>>>> wait pt advantage.
>>>
>>> That could work, but I don't see how you want to replace the already 
>>> issued fence with a fence_array when the sync object is destroyed.
>>>
>>> Additional to that I would rather prefer a consistent handling, e.g. 
>>> without extra rarely used code paths.
>> Ah, I find a easy way, we just need to make syncobj_timeline 
>> structure as a reference. This way syncobj itself can be released 
>> first, wait_pt/signal_pt don't need syncobj at all.
>> every wait_pt/signal_pt keep a reference of syncobj_timeline.
>
> I've thought about that as well, but came to the conclusion that you 
> run into problems because of circle dependencies.
>
> E.g. sync_obj references sync_point and sync_point references sync_obj.
sync_obj can be freed first, only sync point references syncobj_timeline 
structure, syncobj_timeline doesn't reference sync_pt, no circle dep.

>
> Additional to that it is quite a bit larger memory footprint because 
> you need to keep the sync_obj around as well.
all signaled sync_pt are freed immediately except syncobj_timeline 
sturcture, where does extra memory foootprint take?

>
>>
>>>
>>>>
>>>>>
>>>>> Additional to that you enable signaling without a need from the 
>>>>> waiting side. That is rather bad for implementations which need 
>>>>> that optimization.
>>>> Do you mean increasing timeline based on signal fence is not 
>>>> better? only update timeline value when requested by a wait pt?
>>>
>>> Yes, exactly.
>>>
>>>> This way, we will not update timeline value immidiately and cannot 
>>>> free signal pt immidiately, and we also need to consider it to CPU 
>>>> query and wait.
>>>
>>> That is actually the better coding style. We usually try to avoid 
>>> doing things in interrupt handlers as much as possible.
>> OK, I see your concern, how about to delay handling to a workqueue? 
>> this way, we only increase timeline value and wake up workqueue in 
>> fence cb, is that acceptable?
>
> A little bit, but not much better. The nice thing with garbage 
> collection is that the CPU time is accounted to the process causing 
> the garbage.
>
>>>
>>> How about this idea:
>>> 1. Each signaling point is a fence implementation with an rb node.
>>> 2. Each node keeps a reference to the last previously inserted node.
>>> 3. Each node is referenced by the sync object itself.
>>> 4. Before each signal/wait operation we do a garbage collection and 
>>> remove the first node from the tree as long as it is signaled.
>>>
>>> 5. When enable_signaling is requested for a node we cascade that to 
>>> the left using rb_prev.
>>>     This ensures that signaling is enabled for the current fence as 
>>> well as all previous fences.
>>>
>>> 6. A wait just looks into the tree for the signal point lower or 
>>> equal of the requested sequence number.
After re-thought your idea, I think it doesn't work since there is no 
timeline value as a line:
signal pt value doesn't must be continues, which can be jump by signal 
operation, like 1, 4, 8, 15, 19, e.g. there are five singal_pt, 
signal_pt1->signal_pt4->signal_pt8->signal_pt15->signal_pt19, if a wait 
pt is 7, do you mean this wait only needs signal_pt1 and signal_pt4???  
That's certainly not right, we need to make sure the timeline value is 
bigger than wait pt value, that means signal_pt8 is need for wait_pt7.


>>>
>>> 7. When the sync object is released we use 
>>> rbtree_postorder_for_each_entry_safe() and drop the extra reference 
>>> to each node, but never call rb_erase!
>>>     This way the rb_tree stays in memory, but without a root (e.g. 
>>> the sync object). It only destructs itself when the looked up 
>>> references to the nodes are dropped.
And here, who will destroy rb node since no one do enable_signaling, and 
there is no callback to free themselves.

Regards,
David Zhou
>>>
>>> Well that is quite a bunch of logic, but I think that should work fine.
>> Yeah, it could work, simple timeline reference also can solve 'free' 
>> problem.
>
> I think this approach is still quite a bit better, 

> e.g. you don't run into circle dependency problems, it needs less 
> memory and each node has always the same size which means we can use a 
> kmem_cache for it.
>
> Regards,
> Christian.
>
>>
>> Thanks,
>> David Zhou
>