[Patch V2] drm/amdgpu: Increase tlb flush timeout for sriov

Wed Aug 10 17:14:39 UTC 2022

On Wed, Aug 10, 2022 at 12:52 PM Christian König
<ckoenig.leichtzumerken at gmail.com> wrote:
>
>
>
> Am 10.08.22 um 10:50 schrieb Dusica Milinkovic:
> > [Why]
> > During multi-vf executing benchmark (Luxmark) observed kiq error timeout.
> > It happenes because all of VFs do the tlb invalidation at the same time.
> > Although each VF has the invalidate register set, from hardware side
> > the invalidate requests are queue to execute.
> >
> > [How]
> > In case of 12 VF increase timeout on 12*100ms
> >
> > Signed-off-by: Dusica Milinkovic <Dusica.Milinkovic at amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 6 +++++-
> >   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 6 +++++-
> >   2 files changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> > index 9ae8cdaa033e..5743975efea5 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> > @@ -419,6 +419,7 @@ static int gmc_v10_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,
> >       uint32_t seq;
> >       uint16_t queried_pasid;
> >       bool ret;
> > +     uint32_t sriov_usec_timeout = 1200000;  /* wait for 12 * 100ms for SRIOV */
>
> Please put that as a define into some header and never ever write
> comments at the same line after a define.
>
>
>
> >       struct amdgpu_ring *ring = &adev->gfx.kiq.ring;
> >       struct amdgpu_kiq *kiq = &adev->gfx.kiq;
> >
> > @@ -437,7 +438,10 @@ static int gmc_v10_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,
> >
> >               amdgpu_ring_commit(ring);
> >               spin_unlock(&adev->gfx.kiq.ring_lock);
> > -             r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);
> > +             if (amdgpu_sriov_vf(adev))
> > +                     r = amdgpu_fence_wait_polling(ring, seq, sriov_usec_timeout);
> > +             else
> > +                     r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);
>
> Don't duplicate the whole call, just change the parameter.

Per this, see my comment in the previous version of this patch.

Alex

>
> Regards,
> Christian.
>
> >               if (r < 1) {
> >                       dev_err(adev->dev, "wait for kiq fence error: %ld.\n", r);
> >                       return -ETIME;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> > index ab89d91975ab..bab26982b3f9 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> > @@ -896,6 +896,7 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,
> >       uint32_t seq;
> >       uint16_t queried_pasid;
> >       bool ret;
> > +     uint32_t sriov_usec_timeout = 1200000;  /* wait for 12 * 100ms for SRIOV */
> >       struct amdgpu_ring *ring = &adev->gfx.kiq.ring;
> >       struct amdgpu_kiq *kiq = &adev->gfx.kiq;
> >
> > @@ -935,7 +936,10 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,
> >
> >               amdgpu_ring_commit(ring);
> >               spin_unlock(&adev->gfx.kiq.ring_lock);
> > -             r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);
> > +             if (amdgpu_sriov_vf(adev))
> > +                     r = amdgpu_fence_wait_polling(ring, seq, sriov_usec_timeout);
> > +             else
> > +                     r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);
> >               if (r < 1) {
> >                       dev_err(adev->dev, "wait for kiq fence error: %ld.\n", r);
> >                       up_read(&adev->reset_domain->sem);
>