[PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery
Li, Yunxiang (Teddy)
Yunxiang.Li at amd.com
Thu May 23 11:36:20 UTC 2024
[AMD Official Use Only - AMD Internal Distribution Only]
> > +void amdgpu_lock_hw_access(struct amdgpu_device *adev); void
> > +amdgpu_unlock_hw_access(struct amdgpu_device *adev); int
> > +amdgpu_begin_hw_access(struct amdgpu_device *adev); void
> > +amdgpu_end_hw_access(struct amdgpu_device *adev);
>
> Don't add anything to amdgpu.h. We are slowly decomposing that file.
Where would be a better place? I just wanted to have them next to amdgpu_in_reset
> > @@ -5816,6 +5816,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> > goto skip_hw_reset;
> > }
> >
> > + amdgpu_lock_hw_access(adev);
>
> That should already be locked here. So this will most likely deadlock.
>
> > retry: /* Rest of adevs pre asic reset from XGMI hive. */
> > list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
> > r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context); @@
> > -5852,6 +5853,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> > */
> > amdgpu_device_stop_pending_resets(tmp_adev);
> > }
> > + amdgpu_unlock_hw_access(adev);
> >
> > skip_hw_reset:
>
> Don't add helpers for that, especially not with that name.
>
> We don't block HW access, but just prevent concurrent resets.
Here is taking a different lock than the reset_domain->sem. It is a seperate reset_domain->gpu_sem that is only locked when we will actuall do reset, it is not taken in the skip_hw_reset path.
> > uint32_t seq;
> >
> > - if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
> > - !down_read_trylock(&adev->reset_domain->sem)) {
> > + /*
> > + * A GPU reset should flush all TLBs anyway, so no need to do
> > + * this while one is ongoing.
> > + */
> > + if (!amdgpu_begin_hw_access(adev))
> > + return 0;
> >
> > + if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
> > + amdgpu_in_reset(adev)) {
>
> That check doesn't makes sense now any more.
same here, the two checks are for different scope, although I wasn't sure if the original check is correct or not, is there a possibility that !adev->gmc.flush_pasid_uses_kiq and !ring->sched.ready are false but amdgpu_in_reset(adev) is true? and to we want to take this branch when that happens?
> >
> > @@ -684,12 +684,18 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid,
> > struct amdgpu_ring *ring = &adev->gfx.kiq[inst].ring;
> > struct amdgpu_kiq *kiq = &adev->gfx.kiq[inst];
> > unsigned int ndw;
> > - signed long r;
> > + signed long r = 0;
>
> Please don't initialize local variables if it isn't necessary.
>
> > if (adev->gmc.flush_tlb_needs_extra_type_2)
> > adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
> > 2, all_hub,
> > @@ -703,46 +709,42 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid,
> > adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
> > flush_type, all_hub,
> > inst);
> > - return 0;
> > - }
> > + } else {
>
> That the locking is missing here should be a separate bug fix independent of other changes.
I will split this off into a seperate patch, initializing r is needed because I consolidated the return paths to drop the read lock.
More information about the amd-gfx
mailing list