[PATCH] drm/amdgpu: fix amdgpu_amdkfd_remove_eviction_fence
Christian König
christian.koenig at amd.com
Thu Aug 16 18:26:13 UTC 2018
Am 16.08.2018 um 20:23 schrieb Felix Kuehling:
> On 2018-08-16 02:18 PM, Christian König wrote:
>> Am 16.08.2018 um 18:50 schrieb Felix Kuehling:
>>> On 2018-08-16 02:43 AM, Christian König wrote:
>>> [SNIP]
>>>> I mean it could be that in the worst case we race and stop a KFD
>>>> process for no good reason.
>>> Right. For a more practical example, a KFD BO can get evicted just
>>> before the application decides to unmap it. The preemption happens
>>> asynchronously, handled by an SDMA job in the GPU scheduler. That job
>>> will have an amdgpu_sync object with the eviction fence in it.
>>>
>>> While that SDMA job is pending or in progress, the application decides
>>> to unmap the BO. That removes the eviction fence from that BO's
>>> reservation. But it can't remove the fence from all the sync objects
>>> that were previously created and are still in flight. So the preemption
>>> will be triggered, and the fence will eventually signal when the KFD
>>> preemption is complete.
>>>
>>> I don't think that's something we can prevent. The worst case is that a
>>> preemption happens unnecessarily if an eviction gets triggered just
>>> before removing the fence. But removing the fence will prevent future
>>> evictions of the BO from triggering a KFD process preemption. That's the
>>> best we can do.
>> It's true that you can't drop the SDMA job which wants to evict the
>> BO, but at this time the fence signaling is already underway and not
>> stoppable anymore.
>>
>> Replacing the fence with a new one would just be much more cleaner and
>> fix quite a bunch of corner cases where the KFD process would be
>> preempted without good reason.
> Replacing the fence cleanly probably also involves a preemption, so you
> don't gain anything.
Mhm, why that?
My idea would be to create a new fence, replace the old one with the new
one and then manually signal the old one.
So why should there be a preemption triggered here?
Christian.
>
> Regards,
> Felix
>
>> It's probably quite a bit of more CPU overhead of doing so, but I
>> think that this would still be the more fail prove option.
>>
>> Regards,
>> Christian.
>>
>>
>>> Regards,
>>> Felix
>>>
More information about the amd-gfx
mailing list