[PATCH] drm/amdgpu: fix amdgpu_amdkfd_remove_eviction_fence

Felix Kuehling felix.kuehling at amd.com
Thu Aug 16 18:43:06 UTC 2018



On 2018-08-16 02:26 PM, Christian König wrote:
> Am 16.08.2018 um 20:23 schrieb Felix Kuehling:
>> On 2018-08-16 02:18 PM, Christian König wrote:
>>> Am 16.08.2018 um 18:50 schrieb Felix Kuehling:
>>>> On 2018-08-16 02:43 AM, Christian König wrote:
>>>> [SNIP]
>>>>> I mean it could be that in the worst case we race and stop a KFD
>>>>> process for no good reason.
>>>> Right. For a more practical example, a KFD BO can get evicted just
>>>> before the application decides to unmap it. The preemption happens
>>>> asynchronously, handled by an SDMA job in the GPU scheduler. That job
>>>> will have an amdgpu_sync object with the eviction fence in it.
>>>>
>>>> While that SDMA job is pending or in progress, the application decides
>>>> to unmap the BO. That removes the eviction fence from that BO's
>>>> reservation. But it can't remove the fence from all the sync objects
>>>> that were previously created and are still in flight. So the
>>>> preemption
>>>> will be triggered, and the fence will eventually signal when the KFD
>>>> preemption is complete.
>>>>
>>>> I don't think that's something we can prevent. The worst case is
>>>> that a
>>>> preemption happens unnecessarily if an eviction gets triggered just
>>>> before removing the fence. But removing the fence will prevent future
>>>> evictions of the BO from triggering a KFD process preemption.
>>>> That's the
>>>> best we can do.
>>> It's true that you can't drop the SDMA job which wants to evict the
>>> BO, but at this time the fence signaling is already underway and not
>>> stoppable anymore.
>>>
>>> Replacing the fence with a new one would just be much more cleaner and
>>> fix quite a bunch of corner cases where the KFD process would be
>>> preempted without good reason.
>> Replacing the fence cleanly probably also involves a preemption, so you
>> don't gain anything.
>
> Mhm, why that?
>
> My idea would be to create a new fence, replace the old one with the
> new one and then manually signal the old one.
>
> So why should there be a preemption triggered here?

Right maybe you can do this without triggering a preemption, but it
would require a bunch of new code. Currently the only thing that
replaces an old eviction fence with a new one is a preemption+restore.

You'll need to replace the fence in all BOs belonging to the process.
Since removing a fence is something you don't want to do, that really
means adding a new fence, and leaving the old fence in place until it
signals. In the mean time you have two eviction fences active for this
process, and if either one gets triggered, you'll get a preemption. Only
after all BOs have the new eviction fence, you can disarm the old
eviction fence and signal it (without triggering a preemption).

The way I see it, this adds a bunch of CPU overhead (depending on the
number of BOs in the process), and increases the likelihood of
unnecessary preemptions, because it takes that much longer for the
operation to complete.

Talking about overhead, imagine a process cleaning up at process
termination, which unmaps and frees all BOs. Each BO that gets freed
replaces the eviction fence on all remaining BOs. If you have N BOs,
you'll end up creating N new eviction fences in the process. The
overhead scales with O(N²) of the number of BOs in the process.

Regards,
  Felix

>
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>>> It's probably quite a bit of more CPU overhead of doing so, but I
>>> think that this would still be the more fail prove option.
>>>
>>> Regards,
>>> Christian.
>>>
>>>
>>>> Regards,
>>>>     Felix
>>>>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx



More information about the amd-gfx mailing list