[PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery

Christian König christian.koenig at amd.com
Thu May 23 14:53:07 UTC 2024


Am 23.05.24 um 13:36 schrieb Li, Yunxiang (Teddy):
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>>> +void amdgpu_lock_hw_access(struct amdgpu_device *adev); void
>>> +amdgpu_unlock_hw_access(struct amdgpu_device *adev); int
>>> +amdgpu_begin_hw_access(struct amdgpu_device *adev); void
>>> +amdgpu_end_hw_access(struct amdgpu_device *adev);
>> Don't add anything to amdgpu.h. We are slowly decomposing that file.
> Where would be a better place? I just wanted to have them next to amdgpu_in_reset

amdgpu_reset.h if you have time feel free to move amdgpu_in_reset() over 
there as well.

>
>>> @@ -5816,6 +5816,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>              goto skip_hw_reset;
>>>      }
>>>
>>> +   amdgpu_lock_hw_access(adev);
>> That should already be locked here. So this will most likely deadlock.
>>
>>>    retry:    /* Rest of adevs pre asic reset from XGMI hive. */
>>>      list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
>>>              r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context); @@
>>> -5852,6 +5853,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>               */
>>>              amdgpu_device_stop_pending_resets(tmp_adev);
>>>      }
>>> +   amdgpu_unlock_hw_access(adev);
>>>
>>>    skip_hw_reset:
>> Don't add helpers for that, especially not with that name.
>>
>> We don't block HW access, but just prevent concurrent resets.
> Here is taking a different lock than the reset_domain->sem. It is a seperate reset_domain->gpu_sem that is only locked when we will actuall do reset, it is not taken in the skip_hw_reset path.

Exactly that is what you should *not* do. Please don't add any new lock 
to the code. This is already complicated enough.

If you think that adding wrappers for reset lock makes sense then we can 
probably do that, bot don't add any lock for hw access.


>
>>>      uint32_t seq;
>>>
>>> -   if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
>>> -       !down_read_trylock(&adev->reset_domain->sem)) {
>>> +   /*
>>> +   * A GPU reset should flush all TLBs anyway, so no need to do
>>> +   * this while one is ongoing.
>>> +   */
>>> +   if (!amdgpu_begin_hw_access(adev))
>>> +           return 0;
>>>
>>> +   if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
>>> +       amdgpu_in_reset(adev)) {
>> That check doesn't makes sense now any more.
> same here, the two checks are for different scope, although I wasn't sure if the original check is correct or not, is there a possibility that !adev->gmc.flush_pasid_uses_kiq and !ring->sched.ready are false but amdgpu_in_reset(adev) is true? and to we want to take this branch when that happens?

amdgpu_in_reset() in used incorrect in quite a lot of places. It should 
only be used inside the HW backend code to distinct between initial load 
and reset.

All other use cases especially the ones in the IOCTL front end functions 
as well as here in the midlayer which isn't used by GPU reset are incorrect.

>
>>> @@ -684,12 +684,18 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid,
>>>      struct amdgpu_ring *ring = &adev->gfx.kiq[inst].ring;
>>>      struct amdgpu_kiq *kiq = &adev->gfx.kiq[inst];
>>>      unsigned int ndw;
>>> -   signed long r;
>>> +   signed long r = 0;
>> Please don't initialize local variables if it isn't necessary.
>>
>>>              if (adev->gmc.flush_tlb_needs_extra_type_2)
>>>                      adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
>>>                                                               2, all_hub,
>>> @@ -703,46 +709,42 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid,
>>>              adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
>>>                                                       flush_type, all_hub,
>>>                                                       inst);
>>> -           return 0;
>>> -   }
>>> +   } else {
>> That the locking is missing here should be a separate bug fix independent of other changes.
> I will split this off into a seperate patch, initializing r is needed because I consolidated the return paths to drop the read lock.

In that case better set r when it's not initialized in some path.

Regards,
Christian.



More information about the amd-gfx mailing list