<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:DengXian;
        panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:Consolas;
        panose-1:2 11 6 9 2 2 4 3 2 4;}
@font-face
        {font-family:DengXian;
        panose-1:2 1 6 0 3 1 1 1 1 1;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;
        color:black;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
pre
        {mso-style-priority:99;
        mso-style-link:"HTML Preformatted Char";
        margin:0in;
        margin-bottom:.0001pt;
        font-size:10.0pt;
        font-family:"Courier New";
        color:black;}
p.msonormal0, li.msonormal0, div.msonormal0
        {mso-style-name:msonormal;
        mso-margin-top-alt:auto;
        margin-right:0in;
        mso-margin-bottom-alt:auto;
        margin-left:0in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;
        color:black;}
p.emailquote, li.emailquote, div.emailquote
        {mso-style-name:emailquote;
        mso-margin-top-alt:auto;
        margin-right:0in;
        mso-margin-bottom-alt:auto;
        margin-left:1.0pt;
        border:none;
        padding:0in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;
        color:black;}
span.HTMLPreformattedChar
        {mso-style-name:"HTML Preformatted Char";
        mso-style-priority:99;
        mso-style-link:"HTML Preformatted";
        font-family:Consolas;
        color:black;}
span.EmailStyle21
        {mso-style-type:personal-reply;
        font-family:"Calibri",sans-serif;
        color:windowtext;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.25in 1.0in 1.25in;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body bgcolor="white" lang="EN-US" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal"><span style="color:windowtext">Thanks Christian proposal and David draft the solution implement .<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">The pinned Bos failed not observed from prepare_fb ,but Abaqus job can’t  finished through the whole night .
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">Regards the NULL fist BO EBUSY error case , which  comes from amdgpu_cs_bo_validate perform period as the below call stack show . Now the NULL first BO debug error message popup out endlessly during Abaqus
 running ,that’s seems the function @amdgpu_cs_validate run into invoked amdgpu_cs_bo_validate dead loop.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">lxj ttm_mem_evict_first first_bo=          (null),request_resv=ffff929d47b33218,request_resv->lock.ctx=ffff929b8d6bfbd8<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.091731] CPU: 3 PID: 10739 Comm: standard Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.el7.x86_64 #1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.103046] Hardware name: MSI MS-7984/Z170 KRAIT GAMING (MS-7984), BIOS B.80 05/11/2016<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.111181] Call Trace:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.113745]  [<ffffffff81961dc1>] dump_stack+0x19/0x1b<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.118979]  [<ffffffffc055cd19>] ttm_mem_evict_first+0x3a9/0x400 [amdttm]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.125974]  [<ffffffffc055d05b>] amdttm_bo_mem_space+0x2eb/0x4a0 [amdttm]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.132967]  [<ffffffffc055d6e4>] amdttm_bo_validate+0xc4/0x140 [amdttm]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.139827]  [<ffffffffc059fed5>] amdgpu_cs_bo_validate+0xa5/0x220 [amdgpu]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.146879]  [<ffffffffc05a0097>] amdgpu_cs_validate+0x47/0x2e0 [amdgpu]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.153776]  [<ffffffffc05b41a2>] ? amdgpu_vm_del_from_lru_notify+0x12/0x80 [amdgpu]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.161707]  [<ffffffffc05a0050>] ? amdgpu_cs_bo_validate+0x220/0x220 [amdgpu]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.169018]  [<ffffffffc05b4452>] amdgpu_vm_validate_pt_bos+0x92/0x140 [amdgpu]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.176512]  [<ffffffffc05a23e5>] amdgpu_cs_ioctl+0x18a5/0x1d40 [amdgpu]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.183372]  [<ffffffffc05a0b40>] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.190815]  [<ffffffffc042df2c>] drm_ioctl_kernel+0x6c/0xb0 [drm]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.197109]  [<ffffffffc042e647>] drm_ioctl+0x1e7/0x420 [drm]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.202995]  [<ffffffffc05a0b40>] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.210471]  [<ffffffffc058004b>] amdgpu_drm_ioctl+0x4b/0x80 [amdgpu]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.217019]  [<ffffffff81456210>] do_vfs_ioctl+0x3a0/0x5a0<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.222596]  [<ffffffff8196744a>] ? __schedule+0x13a/0x890<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.228172]  [<ffffffff814564b1>] SyS_ioctl+0xa1/0xc0<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">[ 2703.233308]  [<ffffffff81974ddb>] system_call_fastpath+0x22/0x27<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">Thanks,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:windowtext">Prike<o:p></o:p></span></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b><span style="color:windowtext">From:</span></b><span style="color:windowtext"> Christian König <ckoenig.leichtzumerken@gmail.com>
<br>
<b>Sent:</b> Thursday, May 09, 2019 10:59 PM<br>
<b>To:</b> Zhou, David(ChunMing) <David1.Zhou@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Liang, Prike <Prike.Liang@amd.com>; dri-devel@lists.freedesktop.org<br>
<b>Subject:</b> Re: [PATCH 1/2] drm/ttm: fix busy memory to fail other user v7<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">[CAUTION: External Email] <o:p></o:p></p>
<div>
<div>
<p class="MsoNormal">Oh, I know where this is coming from.<br>
<br>
The problem is that we remove the BOs from the LRU during CS and so we can't wait for the CS to finish up.<br>
<br>
Already working on this problem for Marek's similar issue,<br>
Christian.<br>
<br>
Am 09.05.19 um 16:46 schrieb Zhou, David(ChunMing):<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<p class="MsoNormal" style="margin-bottom:12.0pt">I know that before, it will issue warning only when debug option is enabled. Removing that is ok to me.<br>
I only help Prike draft your idea, and Prike is trying this patch on his side. The latest feedback he gave me is first_bo is always null, code doesn't run into busy path, which is very confusing me, and he said  he is debugging  that.<br>
<br>
-David<br>
<br>
<br>
-------- Original Message --------<br>
Subject: Re: [PATCH 1/2] drm/ttm: fix busy memory to fail other user v7<br>
From: "Koenig, Christian" <br>
To: "Zhou, David(ChunMing)" ,"Liang, Prike" ,<a href="mailto:dri-devel@lists.freedesktop.org">dri-devel@lists.freedesktop.org</a><br>
CC: <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="margin-bottom:12.0pt">I've foudn one more problem with this.<br>
<br>
With lockdep enabled I get a warning because ttm_eu_reserve_buffers() <br>
has called ww_acquire_done() on the ticket (which essentially means we <br>
are done, no more locking with that ticket).<br>
<br>
The simplest solution is probably to just remove the call to <br>
ww_acquire_done() from ttm_eu_reserve_buffers().<br>
<br>
Christian.<br>
<br>
Am 07.05.19 um 13:45 schrieb Chunming Zhou:<br>
> heavy gpu job could occupy memory long time, which lead other user fail to get memory.<br>
><br>
> basically pick up Christian idea:<br>
><br>
> 1. Reserve the BO in DC using a ww_mutex ticket (trivial).<br>
> 2. If we then run into this EBUSY condition in TTM check if the BO we need memory for (or rather the ww_mutex of its reservation object) has a ticket assigned.<br>
> 3. If we have a ticket we grab a reference to the first BO on the LRU, drop the LRU lock and try to grab the reservation lock with the ticket.<br>
> 4. If getting the reservation lock with the ticket succeeded we check if the BO is still the first one on the LRU in question (the BO could have moved).<br>
> 5. If the BO is still the first one on the LRU in question we try to evict it as we would evict any other BO.<br>
> 6. If any of the "If's" above fail we just back off and return -EBUSY.<br>
><br>
> v2: fix some minor check<br>
> v3: address Christian v2 comments.<br>
> v4: fix some missing<br>
> v5: handle first_bo unlock and bo_get/put<br>
> v6: abstract unified iterate function, and handle all possible usecase not only pinned bo.<br>
> v7: pass request bo->resv to ttm_bo_evict_first<br>
><br>
> Change-Id: I21423fb922f885465f13833c41df1e134364a8e7<br>
> Signed-off-by: Chunming Zhou <a href="mailto:david1.zhou@amd.com"><david1.zhou@amd.com></a><br>
> ---<br>
>   drivers/gpu/drm/ttm/ttm_bo.c | 111 +++++++++++++++++++++++++++++------<br>
>   1 file changed, 94 insertions(+), 17 deletions(-)<br>
><br>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c<br>
> index 8502b3ed2d88..f5e6328e4a57 100644<br>
> --- a/drivers/gpu/drm/ttm/ttm_bo.c<br>
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c<br>
> @@ -766,11 +766,13 @@ EXPORT_SYMBOL(ttm_bo_eviction_valuable);<br>
>    * b. Otherwise, trylock it.<br>
>    */<br>
>   static bool ttm_bo_evict_swapout_allowable(struct ttm_buffer_object *bo,<br>
> -                     struct ttm_operation_ctx *ctx, bool *locked)<br>
> +                     struct ttm_operation_ctx *ctx, bool *locked, bool *busy)<br>
>   {<br>
>        bool ret = false;<br>
>   <br>
>        *locked = false;<br>
> +     if (busy)<br>
> +             *busy = false;<br>
>        if (bo->resv == ctx->resv) {<br>
>                reservation_object_assert_held(bo->resv);<br>
>                if (ctx->flags & TTM_OPT_FLAG_ALLOW_RES_EVICT<br>
> @@ -779,35 +781,46 @@ static bool ttm_bo_evict_swapout_allowable(struct ttm_buffer_object *bo,<br>
>        } else {<br>
>                *locked = reservation_object_trylock(bo->resv);<br>
>                ret = *locked;<br>
> +             if (!ret && busy)<br>
> +                     *busy = true;<br>
>        }<br>
>   <br>
>        return ret;<br>
>   }<br>
>   <br>
> -static int ttm_mem_evict_first(struct ttm_bo_device *bdev,<br>
> -                            uint32_t mem_type,<br>
> -                            const struct ttm_place *place,<br>
> -                            struct ttm_operation_ctx *ctx)<br>
> +static struct ttm_buffer_object*<br>
> +ttm_mem_find_evitable_bo(struct ttm_bo_device *bdev,<br>
> +                      struct ttm_mem_type_manager *man,<br>
> +                      const struct ttm_place *place,<br>
> +                      struct ttm_operation_ctx *ctx,<br>
> +                      struct ttm_buffer_object **first_bo,<br>
> +                      bool *locked)<br>
>   {<br>
> -     struct ttm_bo_global *glob = bdev->glob;<br>
> -     struct ttm_mem_type_manager *man = &bdev->man[mem_type];<br>
>        struct ttm_buffer_object *bo = NULL;<br>
> -     bool locked = false;<br>
> -     unsigned i;<br>
> -     int ret;<br>
> +     int i;<br>
>   <br>
> -     spin_lock(&glob->lru_lock);<br>
> +     if (first_bo)<br>
> +             *first_bo = NULL;<br>
>        for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i) {<br>
>                list_for_each_entry(bo, &man->lru[i], lru) {<br>
> -                     if (!ttm_bo_evict_swapout_allowable(bo, ctx, &locked))<br>
> +                     bool busy = false;<br>
> +<br>
> +                     if (!ttm_bo_evict_swapout_allowable(bo, ctx, locked,<br>
> +                                                         &busy)) {<br>
> +                             if (first_bo && !(*first_bo) && busy) {<br>
> +                                     ttm_bo_get(bo);<br>
> +                                     *first_bo = bo;<br>
> +                             }<br>
>                                continue;<br>
> +                     }<br>
>   <br>
>                        if (place && !bdev->driver->eviction_valuable(bo,<br>
>                                                                      place)) {<br>
> -                             if (locked)<br>
> +                             if (*locked)<br>
>                                        reservation_object_unlock(bo->resv);<br>
>                                continue;<br>
>                        }<br>
> +<br>
>                        break;<br>
>                }<br>
>   <br>
> @@ -818,9 +831,67 @@ static int ttm_mem_evict_first(struct ttm_bo_device *bdev,<br>
>                bo = NULL;<br>
>        }<br>
>   <br>
> +     return bo;<br>
> +}<br>
> +<br>
> +static int ttm_mem_evict_first(struct ttm_bo_device *bdev,<br>
> +                            uint32_t mem_type,<br>
> +                            const struct ttm_place *place,<br>
> +                            struct ttm_operation_ctx *ctx,<br>
> +                            struct reservation_object *request_resv)<br>
> +{<br>
> +     struct ttm_bo_global *glob = bdev->glob;<br>
> +     struct ttm_mem_type_manager *man = &bdev->man[mem_type];<br>
> +     struct ttm_buffer_object *bo = NULL, *first_bo = NULL;<br>
> +     bool locked = false;<br>
> +     int ret;<br>
> +<br>
> +     spin_lock(&glob->lru_lock);<br>
> +     bo = ttm_mem_find_evitable_bo(bdev, man, place, ctx, &first_bo,<br>
> +                                   &locked);<br>
>        if (!bo) {<br>
> +             struct ttm_operation_ctx busy_ctx;<br>
> +<br>
>                spin_unlock(&glob->lru_lock);<br>
> -             return -EBUSY;<br>
> +             /* check if other user occupy memory too long time */<br>
> +             if (!first_bo || !request_resv || !request_resv->lock.ctx) {<br>
> +                     if (first_bo)<br>
> +                             ttm_bo_put(first_bo);<br>
> +                     return -EBUSY;<br>
> +             }<br>
> +             if (first_bo->resv == request_resv) {<br>
> +                     ttm_bo_put(first_bo);<br>
> +                     return -EBUSY;<br>
> +             }<br>
> +             if (ctx->interruptible)<br>
> +                     ret = ww_mutex_lock_interruptible(&first_bo->resv->lock,<br>
> +                                                       request_resv->lock.ctx);<br>
> +             else<br>
> +                     ret = ww_mutex_lock(&first_bo->resv->lock, request_resv->lock.ctx);<br>
> +             if (ret) {<br>
> +                     ttm_bo_put(first_bo);<br>
> +                     return ret;<br>
> +             }<br>
> +             spin_lock(&glob->lru_lock);<br>
> +             /* previous busy resv lock is held by above, idle now,<br>
> +              * so let them evictable.<br>
> +              */<br>
> +             busy_ctx.interruptible = ctx->interruptible;<br>
> +             busy_ctx.no_wait_gpu   = ctx->no_wait_gpu;<br>
> +             busy_ctx.resv          = first_bo->resv;<br>
> +             busy_ctx.flags         = TTM_OPT_FLAG_ALLOW_RES_EVICT;<br>
> +<br>
> +             bo = ttm_mem_find_evitable_bo(bdev, man, place, &busy_ctx, NULL,<br>
> +                                           &locked);<br>
> +             if (bo && (bo->resv == first_bo->resv))<br>
> +                     locked = true;<br>
> +             else if (bo)<br>
> +                     ww_mutex_unlock(&first_bo->resv->lock);<br>
> +             if (!bo) {<br>
> +                     spin_unlock(&glob->lru_lock);<br>
> +                     ttm_bo_put(first_bo);<br>
> +                     return -EBUSY;<br>
> +             }<br>
>        }<br>
>   <br>
>        kref_get(&bo->list_kref);<br>
> @@ -829,11 +900,15 @@ static int ttm_mem_evict_first(struct ttm_bo_device *bdev,<br>
>                ret = ttm_bo_cleanup_refs(bo, ctx->interruptible,<br>
>                                          ctx->no_wait_gpu, locked);<br>
>                kref_put(&bo->list_kref, ttm_bo_release_list);<br>
> +             if (first_bo)<br>
> +                     ttm_bo_put(first_bo);<br>
>                return ret;<br>
>        }<br>
>   <br>
>        ttm_bo_del_from_lru(bo);<br>
>        spin_unlock(&glob->lru_lock);<br>
> +     if (first_bo)<br>
> +             ttm_bo_put(first_bo);<br>
>   <br>
>        ret = ttm_bo_evict(bo, ctx);<br>
>        if (locked) {<br>
> @@ -907,7 +982,7 @@ static int ttm_bo_mem_force_space(struct ttm_buffer_object *bo,<br>
>                        return ret;<br>
>                if (mem->mm_node)<br>
>                        break;<br>
> -             ret = ttm_mem_evict_first(bdev, mem_type, place, ctx);<br>
> +             ret = ttm_mem_evict_first(bdev, mem_type, place, ctx, bo->resv);<br>
>                if (unlikely(ret != 0))<br>
>                        return ret;<br>
>        } while (1);<br>
> @@ -1413,7 +1488,8 @@ static int ttm_bo_force_list_clean(struct ttm_bo_device *bdev,<br>
>        for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i) {<br>
>                while (!list_empty(&man->lru[i])) {<br>
>                        spin_unlock(&glob->lru_lock);<br>
> -                     ret = ttm_mem_evict_first(bdev, mem_type, NULL, &ctx);<br>
> +                     ret = ttm_mem_evict_first(bdev, mem_type, NULL, &ctx,<br>
> +                                               NULL);<br>
>                        if (ret)<br>
>                                return ret;<br>
>                        spin_lock(&glob->lru_lock);<br>
> @@ -1784,7 +1860,8 @@ int ttm_bo_swapout(struct ttm_bo_global *glob, struct ttm_operation_ctx *ctx)<br>
>        spin_lock(&glob->lru_lock);<br>
>        for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i) {<br>
>                list_for_each_entry(bo, &glob->swap_lru[i], swap) {<br>
> -                     if (ttm_bo_evict_swapout_allowable(bo, ctx, &locked)) {<br>
> +                     if (ttm_bo_evict_swapout_allowable(bo, ctx, &locked,<br>
> +                                                        NULL)) {<br>
>                                ret = 0;<br>
>                                break;<br>
>                        }<o:p></o:p></p>
</div>
<p class="MsoNormal"><br>
<br>
<o:p></o:p></p>
<pre>_______________________________________________<o:p></o:p></pre>
<pre>dri-devel mailing list<o:p></o:p></pre>
<pre><a href="mailto:dri-devel@lists.freedesktop.org">dri-devel@lists.freedesktop.org</a><o:p></o:p></pre>
<pre><a href="https://lists.freedesktop.org/mailman/listinfo/dri-devel">https://lists.freedesktop.org/mailman/listinfo/dri-devel</a><o:p></o:p></pre>
</blockquote>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</div>
</body>
</html>