[PATCH] drm/amdgpu: add VM update fences back to the root PD

Thu Feb 20 02:55:29 UTC 2020

I was able to bisect it to this commit:

$git bisect good
6643ba1ff05d252e451bada9443759edb95eab3b is the first bad commit
commit 6643ba1ff05d252e451bada9443759edb95eab3b
Author: Luben Tuikov <luben.tuikov at amd.com>
Date:   Mon Feb 10 18:16:45 2020 -0500

    drm/amdgpu: Move to a per-IB secure flag (TMZ)

    Move from a per-CS secure flag (TMZ) to a per-IB
    secure flag.

    Signed-off-by: Luben Tuikov <luben.tuikov at amd.com>
    Reviewed-by: Huang Rui <ray.huang at amd.com>

 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c   |  2 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c   | 23 ++++++++++++++++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.h  |  3 ---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  9 ++++-----
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c   | 23 +++++++----------------
 drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c    |  3 +--
 drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c    |  3 +--
 drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c    |  3 +--
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c    | 20 ++++++--------------
 include/uapi/drm/amdgpu_drm.h            |  7 ++++---
 10 files changed, 44 insertions(+), 52 deletions(-)

It's a bit baffling and perhaps there is a clash in the new flag,
or libdrm needs to also be updated. Will look at it more tomorrow.

My bisect log can be found below.

Regards,
Luben
------------
git bisect start
# good: [31866a9d7d40245316ad7c17b87961f68321cab8] drm/amd/display: Move drm_dp_mst_atomic_check() to the front of dc_validate_global_state()
git bisect good 31866a9d7d40245316ad7c17b87961f68321cab8
# bad: [7fd3b632e17e55c5ffd008f9f025754e7daa1b66] drm/amdgpu: fix colliding of preemption
git bisect bad 7fd3b632e17e55c5ffd008f9f025754e7daa1b66
# good: [41d073f29e59abdfb0d415033772c01c321086c9] drm/amdgpu/vcn2.5: fix warning
git bisect good 41d073f29e59abdfb0d415033772c01c321086c9
# good: [71da21488b65ade2b789416088b9f2493ad3e056] drm/amd/display: fix dtm unloading
git bisect good 71da21488b65ade2b789416088b9f2493ad3e056
# bad: [e3ca25cd2e75824e4dd9e6bb16013ab5f3ec63a6] drm/ttm: individualize resv objects before calling release_notify
git bisect bad e3ca25cd2e75824e4dd9e6bb16013ab5f3ec63a6
# good: [7e3452a6536ee7136a4d79f2369f15d5ce96583c] drm/amdgpu: return -EFAULT if copy_to_user() fails
git bisect good 7e3452a6536ee7136a4d79f2369f15d5ce96583c
# bad: [9b7ac0fb3bbfd6dd001423da497aafec3e8a5131] drm/amdgpu: log on non-zero error conter per IP before GPU reset
git bisect bad 9b7ac0fb3bbfd6dd001423da497aafec3e8a5131
# bad: [6643ba1ff05d252e451bada9443759edb95eab3b] drm/amdgpu: Move to a per-IB secure flag (TMZ)
git bisect bad 6643ba1ff05d252e451bada9443759edb95eab3b
# good: [3387f56e37b2fa8b0fbb3a538bc08daae923bb5f] drm/amd/powerplay: correct the way for checking SMU_FEATURE_BACO_BIT support
git bisect good 3387f56e37b2fa8b0fbb3a538bc08daae923bb5f
# first bad commit: [6643ba1ff05d252e451bada9443759edb95eab3b] drm/amdgpu: Move to a per-IB secure flag (TMZ)
------------

On 2020-02-19 8:02 p.m., Luben Tuikov wrote:
> New developments:
> 
> Running "amdgpu_test -s 1 -t 4" causes timeouts and koops. Attached
> is the system log, tested Navi 10:
> 
> [  144.484547] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
> [  149.604641] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=1459, emitted seq=1462
> [  149.604779] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process amdgpu_test pid 2696 thread amdgpu_test pid 2696
> [  149.604788] amdgpu 0000:0b:00.0: GPU reset begin!
> ...
> 
> The kernel is at 7fd3b632e17e55c5ffd008f9f025754e7daa1b66 plus
> the patch of the original post of this thread (thus the "-dirty").
> 
> Running the same test on the previous version of the kernel I was running,
> at 31866a9d7d40245316ad7c17b87961f68321cab8, succeeds as follows:
> 
> Suite: Basic Tests
>   Test: Command submission Test (GFX) ...passed
> 
> Run Summary:    Type  Total    Ran Passed Failed Inactive
>               suites     11      0    n/a      0        0
>                tests     63      1      1      0        0
>              asserts 526725 526725 526725      0      n/a
> 
> Elapsed time =    0.027 seconds
> 
> Regards,
> Luben
> 
> On 2020-02-19 4:40 p.m., Luben Tuikov wrote:
>> On 2020-02-19 9:44 a.m., Christian König wrote:
>>> Well it should apply on top of amd-staging-drm-next. But I haven't 
>>> fetched that today yet.
>>>
>>> Give me a minute to rebase.
>>
>> This patch seems to have fixed the regression we saw yesterday.
>> It applies to amd-staging-drm-next with a small jitter:
>>
>> $patch -p1 < /tmp/\[PATCH\]\ drm_amdgpu\:\ add\ VM\ update\ fences\ back\ to\ the\ root\ PD.eml 
>> patching file amdgpu_vm.c
>> Hunk #2 succeeded at 1599 (offset -20 lines).
>>
>> I've been running 'glxgears' on the root window and 'pinion'
>> and no problems--clean log.
>>
>> Tested-by: Luben Tuikov <luben.tuikov at amd.com>
>>
>> Regards,
>> Luben
>>
>>>
>>> Christian.
>>>
>>> Am 19.02.20 um 15:27 schrieb Tom St Denis:
>>>> This doesn't apply on top of 7fd3b632e17e55c5ffd008f9f025754e7daa1b66 
>>>> which is the tip of drm-next
>>>>
>>>>
>>>> Tom
>>>>
>>>> On 2020-02-19 9:20 a.m., Christian König wrote:
>>>>> Add update fences to the root PD while mapping BOs.
>>>>>
>>>>> Otherwise PDs freed during the mapping won't wait for
>>>>> updates to finish and can cause corruptions.
>>>>>
>>>>> Signed-off-by: Christian König <christian.koenig at amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 14 ++++++++++++--
>>>>>   1 file changed, 12 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> index e7ab0c1e2793..dd63ccdbad2a 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> @@ -585,8 +585,8 @@ void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm,
>>>>>   {
>>>>>       entry->priority = 0;
>>>>>       entry->tv.bo = &vm->root.base.bo->tbo;
>>>>> -    /* One for TTM and one for the CS job */
>>>>> -    entry->tv.num_shared = 2;
>>>>> +    /* Two for VM updates, one for TTM and one for the CS job */
>>>>> +    entry->tv.num_shared = 4;
>>>>>       entry->user_pages = NULL;
>>>>>       list_add(&entry->tv.head, validated);
>>>>>   }
>>>>> @@ -1619,6 +1619,16 @@ static int amdgpu_vm_bo_update_mapping(struct 
>>>>> amdgpu_device *adev,
>>>>>           goto error_unlock;
>>>>>       }
>>>>>   +    if (flags & AMDGPU_PTE_VALID) {
>>>>> +        struct amdgpu_bo *root = vm->root.base.bo;
>>>>> +
>>>>> +        if (!dma_fence_is_signaled(vm->last_direct))
>>>>> +            amdgpu_bo_fence(root, vm->last_direct, true);
>>>>> +
>>>>> +        if (!dma_fence_is_signaled(vm->last_delayed))
>>>>> +            amdgpu_bo_fence(root, vm->last_delayed, true);
>>>>> +    }
>>>>> +
>>>>>       r = vm->update_funcs->prepare(&params, resv, sync_mode);
>>>>>       if (r)
>>>>>           goto error_unlock;
>>>
>>
> 
> 
>