[PATCH] drm/amdgpu: fix amdgpu_amdkfd_remove_eviction_fence

Christian König christian.koenig at amd.com
Thu Aug 16 18:49:58 UTC 2018


Am 16.08.2018 um 20:43 schrieb Felix Kuehling:
>
> On 2018-08-16 02:26 PM, Christian König wrote:
>> Am 16.08.2018 um 20:23 schrieb Felix Kuehling:
>>> On 2018-08-16 02:18 PM, Christian König wrote:
>>>> Am 16.08.2018 um 18:50 schrieb Felix Kuehling:
>>>>> On 2018-08-16 02:43 AM, Christian König wrote:
>>>>> [SNIP]
>>>>>> I mean it could be that in the worst case we race and stop a KFD
>>>>>> process for no good reason.
>>>>> Right. For a more practical example, a KFD BO can get evicted just
>>>>> before the application decides to unmap it. The preemption happens
>>>>> asynchronously, handled by an SDMA job in the GPU scheduler. That job
>>>>> will have an amdgpu_sync object with the eviction fence in it.
>>>>>
>>>>> While that SDMA job is pending or in progress, the application decides
>>>>> to unmap the BO. That removes the eviction fence from that BO's
>>>>> reservation. But it can't remove the fence from all the sync objects
>>>>> that were previously created and are still in flight. So the
>>>>> preemption
>>>>> will be triggered, and the fence will eventually signal when the KFD
>>>>> preemption is complete.
>>>>>
>>>>> I don't think that's something we can prevent. The worst case is
>>>>> that a
>>>>> preemption happens unnecessarily if an eviction gets triggered just
>>>>> before removing the fence. But removing the fence will prevent future
>>>>> evictions of the BO from triggering a KFD process preemption.
>>>>> That's the
>>>>> best we can do.
>>>> It's true that you can't drop the SDMA job which wants to evict the
>>>> BO, but at this time the fence signaling is already underway and not
>>>> stoppable anymore.
>>>>
>>>> Replacing the fence with a new one would just be much more cleaner and
>>>> fix quite a bunch of corner cases where the KFD process would be
>>>> preempted without good reason.
>>> Replacing the fence cleanly probably also involves a preemption, so you
>>> don't gain anything.
>> Mhm, why that?
>>
>> My idea would be to create a new fence, replace the old one with the
>> new one and then manually signal the old one.
>>
>> So why should there be a preemption triggered here?
> Right maybe you can do this without triggering a preemption, but it
> would require a bunch of new code. Currently the only thing that
> replaces an old eviction fence with a new one is a preemption+restore.
>
> You'll need to replace the fence in all BOs belonging to the process.
> Since removing a fence is something you don't want to do, that really
> means adding a new fence, and leaving the old fence in place until it
> signals. In the mean time you have two eviction fences active for this
> process, and if either one gets triggered, you'll get a preemption. Only
> after all BOs have the new eviction fence, you can disarm the old
> eviction fence and signal it (without triggering a preemption).
>
> The way I see it, this adds a bunch of CPU overhead (depending on the
> number of BOs in the process), and increases the likelihood of
> unnecessary preemptions, because it takes that much longer for the
> operation to complete.

Yeah, the CPU overhead is indeed a really good point.

> Talking about overhead, imagine a process cleaning up at process
> termination, which unmaps and frees all BOs. Each BO that gets freed
> replaces the eviction fence on all remaining BOs. If you have N BOs,
> you'll end up creating N new eviction fences in the process. The
> overhead scales with O(N²) of the number of BOs in the process.

Well as I said that is the only case where I think replacing fences is 
unproblematic.

E.g. when a BO is freed we already have a mechanism to copy all fences 
to a local reservation object (to prevent that any new fences can be 
added to the BO if it shares it's reservation object).

At this time it should be really easy to filter out BOs you don't want 
to wait for when the BO is destructed.

But anyway, when you can live with the possibility that a stale fence 
triggers a process preemption than there isn't much point in cleaning 
this up.

Regards,
Christian.

>
> Regards,
>    Felix
>
>> Christian.
>>
>>> Regards,
>>>     Felix
>>>
>>>> It's probably quite a bit of more CPU overhead of doing so, but I
>>>> think that this would still be the more fail prove option.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>
>>>>> Regards,
>>>>>      Felix
>>>>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx



More information about the amd-gfx mailing list