[PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery

Thu May 23 11:36:20 UTC 2024

[AMD Official Use Only - AMD Internal Distribution Only]

> > +void amdgpu_lock_hw_access(struct amdgpu_device *adev); void
> > +amdgpu_unlock_hw_access(struct amdgpu_device *adev); int
> > +amdgpu_begin_hw_access(struct amdgpu_device *adev); void
> > +amdgpu_end_hw_access(struct amdgpu_device *adev);
>
> Don't add anything to amdgpu.h. We are slowly decomposing that file.

Where would be a better place? I just wanted to have them next to amdgpu_in_reset

> > @@ -5816,6 +5816,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> >             goto skip_hw_reset;
> >     }
> >
> > +   amdgpu_lock_hw_access(adev);
>
> That should already be locked here. So this will most likely deadlock.
>
> >   retry:    /* Rest of adevs pre asic reset from XGMI hive. */
> >     list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
> >             r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context); @@
> > -5852,6 +5853,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> >              */
> >             amdgpu_device_stop_pending_resets(tmp_adev);
> >     }
> > +   amdgpu_unlock_hw_access(adev);
> >
> >   skip_hw_reset:
>
> Don't add helpers for that, especially not with that name.
>
> We don't block HW access, but just prevent concurrent resets.

Here is taking a different lock than the reset_domain->sem. It is a seperate reset_domain->gpu_sem that is only locked when we will actuall do reset, it is not taken in the skip_hw_reset path.

> >     uint32_t seq;
> >
> > -   if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
> > -       !down_read_trylock(&adev->reset_domain->sem)) {
> > +   /*
> > +   * A GPU reset should flush all TLBs anyway, so no need to do
> > +   * this while one is ongoing.
> > +   */
> > +   if (!amdgpu_begin_hw_access(adev))
> > +           return 0;
> >
> > +   if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
> > +       amdgpu_in_reset(adev)) {
>
> That check doesn't makes sense now any more.

same here, the two checks are for different scope, although I wasn't sure if the original check is correct or not, is there a possibility that !adev->gmc.flush_pasid_uses_kiq and !ring->sched.ready are false but amdgpu_in_reset(adev) is true? and to we want to take this branch when that happens?

> >
> > @@ -684,12 +684,18 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid,
> >     struct amdgpu_ring *ring = &adev->gfx.kiq[inst].ring;
> >     struct amdgpu_kiq *kiq = &adev->gfx.kiq[inst];
> >     unsigned int ndw;
> > -   signed long r;
> > +   signed long r = 0;
>
> Please don't initialize local variables if it isn't necessary.
>
> >             if (adev->gmc.flush_tlb_needs_extra_type_2)
> >                     adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
> >                                                              2, all_hub,
> > @@ -703,46 +709,42 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid,
> >             adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
> >                                                      flush_type, all_hub,
> >                                                      inst);
> > -           return 0;
> > -   }
> > +   } else {
>
> That the locking is missing here should be a separate bug fix independent of other changes.

I will split this off into a seperate patch, initializing r is needed because I consolidated the return paths to drop the read lock.