回复: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
Liu, Monk
Monk.Liu at amd.com
Thu Feb 20 13:35:03 UTC 2020
Sorry, my previous idea still leave RQ null, please check if below method works:
29 static struct drm_sched_rq *
130 drm_sched_entity_get_free_sched(struct drm_sched_entity *entity)
131 {
132 struct drm_sched_rq *rq = NULL;
133 unsigned int min_jobs = UINT_MAX, num_jobs;
134 int i;
135
While (!mutex_trylock(....))
Sleep()
136 for (i = 0; i < entity->num_rq_list; ++i) {
137 struct drm_gpu_scheduler *sched = entity->rq_list[i]->sched;
138
139 if (!entity->rq_list[i]->sched->ready) { //we take the gpu reset mutex lock, so now sched->ready won't be set to "not ready"
140 DRM_WARN("sched%s is not ready, skipping", sched->name);
141 continue;
142 }
143
144 num_jobs = atomic_read(&sched->num_jobs);
145 if (num_jobs < min_jobs) {
146 min_jobs = num_jobs;
147 rq = entity->rq_list[i];
148 }
149 }
Mutex_unlock(...)
150
151 return rq;
152 }
-----邮件原件-----
发件人: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> 代表 Liu, Monk
发送时间: 2020年2月20日 21:28
收件人: Li, Dennis <Dennis.Li at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>
主题: 回复: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
so the scenario is:
KMD doing page update -> ECC -> baco reset (mute SDMA scheduler) -> drm_sched_entity_get_free_sched() return NULL -> you get your NULL pointer
And your approach is to introduce a bail out in the amdgpu_vm_sdma_commit() so the original amdgpu vm update jobs will be dropped,
That really can fix your NULL pointer issue, but actually bring another issue: you didn't finish the vm update work !
my initial idea is you do something like:
static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
...
struct amdgpu_device *adev = p->adev;
while (!mutex_trylock(&adev->lock_reset)) //means GPU is under recovering
msleep(xxx);
....
mutex_unlock(&adev->lock_reset); //this put at the end of this routine }
that do make our implement more complicated, but at least it doesn't drop the necessary vm update, otherwise you will run into other problems ...
@Koenig, Christian do you have another solution ?
Thanks
-----邮件原件-----
发件人: Li, Dennis <Dennis.Li at amd.com>
发送时间: 2020年2月20日 16:41
收件人: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>
主题: RE: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
[AMD Official Use Only - Internal Distribution Only]
Hi, Christian and Monk,
When doing SDMA copy, a RAS uncorrectable error happens, which will cause this issue. The RAS uncorrectable error event will trigger driver to do BACO reset which will set the status of SDMA scheduler to no ready. And then drm_sched_entity_get_free_sched will return NULL in drm_sched_entity_select_rq, which cause entity->rq to NULL.
Best Regards
Dennis Li
-----Original Message-----
From: Liu, Monk <Monk.Liu at amd.com>
Sent: Wednesday, February 19, 2020 7:30 PM
To: Koenig, Christian <Christian.Koenig at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; Li, Dennis <Dennis.Li at amd.com>; amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>
Subject: 回复: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
> + if (!entity->rq)
> + return 0;
> +
Yes, supposedly we shouldn't get 'entity->rq == NULL' case , that looks the true bug
-----邮件原件-----
发件人: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> 代表 Christian K?nig
发送时间: 2020年2月19日 18:50
收件人: Zhang, Hawking <Hawking.Zhang at amd.com>; Li, Dennis <Dennis.Li at amd.com>; amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>
主题: Re: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
Well of hand this patch looks like a clear NAK to me.
Returning without raising an error is certainly the wrong thing to do here because we just drop the necessary page table updates.
How does the entity->rq ends up as NULL in the first place?
Regards,
Christian.
Am 19.02.20 um 07:26 schrieb Zhang, Hawking:
> [AMD Official Use Only - Internal Distribution Only]
>
> Reviewed-by: Hawking Zhang <Hawking.Zhang at amd.com>
>
> Regards,
> Hawking
> -----Original Message-----
> From: Dennis Li <Dennis.Li at amd.com>
> Sent: Wednesday, February 19, 2020 12:05
> To: amd-gfx at lists.freedesktop.org; Deucher, Alexander
> <Alexander.Deucher at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Zhang,
> Hawking <Hawking.Zhang at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>
> Cc: Li, Dennis <Dennis.Li at amd.com>
> Subject: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
>
> check whether the queue of entity is null to avoid null pointer dereference.
>
> Change-Id: I08d56774012cf229ba2fe7a011c1359e8d1e2781
> Signed-off-by: Dennis Li <Dennis.Li at amd.com>
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index 4cc7881f438c..67cca463ddcc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -95,6 +95,9 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
> int r;
>
> entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> + if (!entity->rq)
> + return 0;
> +
> ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>
> WARN_ON(ib->length_dw == 0);
> --
> 2.17.1
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Cmo
> nk.liu%40amd.com%7C28e7260af3a24eec758f08d7b52975e3%7C3dd8961fe4884e60
> 8e11a82d994e183d%7C0%7C0%7C637177062003213431&sdata=vMXmhwTlN8lAav
> uqhYhpmKLM6V%2F%2B2%2FubFBbsk%2BGY%2Bjw%3D&reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx at lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Cmonk.liu%40amd.com%7C7acd21e839804ca7b70e08d7b608b9af%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637178021336998643&sdata=cxadPi08Q1dqMZYtodebNWm55GruAr5BOmcsZR%2BFR3M%3D&reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx at lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Cmonk.liu%40amd.com%7C7acd21e839804ca7b70e08d7b608b9af%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637178021336998643&sdata=cxadPi08Q1dqMZYtodebNWm55GruAr5BOmcsZR%2BFR3M%3D&reserved=0
More information about the amd-gfx
mailing list