[PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS

Christian König ckoenig.leichtzumerken at gmail.com
Wed May 15 07:04:52 UTC 2019


Hi Prike,

no, that can lead to massive problems in a real OOM situation and is not 
something we can do here.

Christian.

Am 15.05.19 um 04:00 schrieb Liang, Prike:
>
> Hi Christian ,
>
> I just wonder when encounter ENOMEM error during pin amdgpu BOs can we 
> retry validate again as below.
>
> With the following simply patch the Abaqus pinned issue not observed.
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>
> index 11cbf63..72a32f5 100644
>
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>
> @@ -902,11 +902,15 @@ int amdgpu_bo_pin_restricted(struct amdgpu_bo 
> *bo, u32 domain,
>
> bo->placements[i].lpfn = lpfn;
>
>                 bo->placements[i].flags |= TTM_PL_FLAG_NO_EVICT;
>
>         }
>
> -
>
> +retry:
>
>         r = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
>
>         if (unlikely(r)) {
>
> -               dev_err(adev->dev, "%p pin failed\n", bo);
>
> -               goto error;
>
> +                if (r == -ENOMEM){
>
> +                        goto retry;
>
> +                } else {
>
> + dev_err(adev->dev, "%p pin failed\n", bo);
>
> +                       goto error;
>
> +                }
>
>         }
>
>         bo->pin_count = 1;
>
> Thanks,
>
> Prike
>
> *From:* Marek Olšák <maraeo at gmail.com>
> *Sent:* Wednesday, May 15, 2019 3:33 AM
> *To:* Christian König <ckoenig.leichtzumerken at gmail.com>
> *Cc:* Zhou, David(ChunMing) <David1.Zhou at amd.com>; Liang, Prike 
> <Prike.Liang at amd.com>; dri-devel <dri-devel at lists.freedesktop.org>; 
> amd-gfx mailing list <amd-gfx at lists.freedesktop.org>
> *Subject:* Re: [PATCH 11/11] drm/amdgpu: stop removing BOs from the 
> LRU during CS
>
> [CAUTION: External Email]
>
> This series fixes the OOM errors. However, if I torture the kernel 
> driver more, I can get it to deadlock and end up with unkillable 
> processes. I can also get an OOM error. I just ran the test 5 times:
>
> AMD_DEBUG=testgdsmm glxgears & AMD_DEBUG=testgdsmm glxgears & 
> AMD_DEBUG=testgdsmm glxgears & AMD_DEBUG=testgdsmm glxgears & 
> AMD_DEBUG=testgdsmm glxgears
>
> Marek
>
> On Tue, May 14, 2019 at 8:31 AM Christian König 
> <ckoenig.leichtzumerken at gmail.com 
> <mailto:ckoenig.leichtzumerken at gmail.com>> wrote:
>
>     This avoids OOM situations when we have lots of threads
>     submitting at the same time.
>
>     Signed-off-by: Christian König <christian.koenig at amd.com
>     <mailto:christian.koenig at amd.com>>
>     ---
>      drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-
>      1 file changed, 1 insertion(+), 1 deletion(-)
>
>     diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>     b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>     index fff558cf385b..f9240a94217b 100644
>     --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>     +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>     @@ -648,7 +648,7 @@ static int amdgpu_cs_parser_bos(struct
>     amdgpu_cs_parser *p,
>             }
>
>             r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,
>     -                                  &duplicates, true);
>     +                                  &duplicates, false);
>             if (unlikely(r != 0)) {
>                     if (r != -ERESTARTSYS)
>     DRM_ERROR("ttm_eu_reserve_buffers failed.\n");
>     -- 
>     2.17.1
>
>     _______________________________________________
>     amd-gfx mailing list
>     amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
>     https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20190515/c7f157ec/attachment.html>


More information about the amd-gfx mailing list