[PATCH] drm/amdgpu: workaround for TLB seq race

Fri Nov 4 07:10:10 UTC 2022

Am 03.11.22 um 22:18 schrieb Philip Yang:
>
> On 2022-11-02 10:58, Christian König wrote:
>> It can happen that we query the sequence value before the callback
>> had a chance to run.
>>
>> Work around that by grabbing the fence lock and releasing it again.
>> Should be replaced by hw handling soon.
>
> kfd_flush_tlb is always called after waiting for map/unmap to GPU 
> fence signalled, that means the callback is already executed

And exactly that's incorrect.

Waiting for the fence to signal means that the callback has started 
executing, but it doesn't mean that it is finished.

This can then result in one CPU racing with the callback handler and 
because of this you see the wrong TLB seq.

Regards,
Christian.

> and the sequence is increased if tlb flush is needed, so no such race 
> from KFD.
>
> I am not sure but seems the race does exist for amdgpu to grab vm and 
> schedule job.
>
> Acked-by: Philip Yang <Philip.Yang at amd.com>
>
>> Signed-off-by: Christian König <christian.koenig at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 15 +++++++++++++++
>>   1 file changed, 15 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> index 9ecb7f663e19..e51a46c9582b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> @@ -485,6 +485,21 @@ void amdgpu_debugfs_vm_bo_info(struct amdgpu_vm 
>> *vm, struct seq_file *m);
>>    */
>>   static inline uint64_t amdgpu_vm_tlb_seq(struct amdgpu_vm *vm)
>>   {
>> +    unsigned long flags;
>> +    spinlock_t *lock;
>> +
>> +    /*
>> +     * Work around to stop racing between the fence signaling and 
>> handling
>> +     * the cb. The lock is static after initially setting it up, 
>> just make
>> +     * sure that the dma_fence structure isn't freed up.
>> +     */
>> +    rcu_read_lock();
>> +    lock = vm->last_tlb_flush->lock;
>> +    rcu_read_unlock();
>> +
>> +    spin_lock_irqsave(lock, flags);
>> +    spin_unlock_irqrestore(lock, flags);
>> +
>>       return atomic64_read(&vm->tlb_seq);
>>   }