[Intel-xe] [PATCH 1/2] drm/xe: Handle -EDEADLK case in preempt worker
Niranjana Vishwanathapura
niranjana.vishwanathapura at intel.com
Tue May 9 07:01:34 UTC 2023
On Tue, May 09, 2023 at 08:57:40AM +0200, Thomas Hellström wrote:
>
>On 5/9/23 08:33, Niranjana Vishwanathapura wrote:
>>On Mon, May 08, 2023 at 10:46:48PM -0700, Hellstrom, Thomas wrote:
>>>On Mon, 2023-05-08 at 22:24 -0700, Niranjana Vishwanathapura wrote:
>>>>With multiple active VMs, under memory pressure, it is possible that
>>>>ttm_bo_validate() run into -EDEADLK in ttm_mem_evict_wait_busy() and
>>>>return -ENOMEM.
>>>>
>>>>Until ttm properly handles locking in such scenarios, best thing the
>>>>driver can do is unwind the lock and retry.
>>>>
>>>>Update preempt worker to retry validating BOs with a timeout upon
>>>>-ENOMEM and while at it, apply the timeout for retries upon -EAGAIN
>>>>as well.
>>>>
>>>>Signed-off-by: Niranjana Vishwanathapura
>>>><niranjana.vishwanathapura at intel.com>
>>>>---
>>>> drivers/gpu/drm/xe/xe_vm.c | 25 ++++++++++++++++++++++---
>>>> 1 file changed, 22 insertions(+), 3 deletions(-)
>>>>
>>>>diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>>>>index 91576cec000d..d92c9f659ab3 100644
>>>>--- a/drivers/gpu/drm/xe/xe_vm.c
>>>>+++ b/drivers/gpu/drm/xe/xe_vm.c
>>>>@@ -10,6 +10,7 @@
>>>> #include <drm/ttm/ttm_execbuf_util.h>
>>>> #include <drm/ttm/ttm_tt.h>
>>>> #include <drm/xe_drm.h>
>>>>+#include <linux/delay.h>
>>>> #include <linux/kthread.h>
>>>> #include <linux/mm.h>
>>>> #include <linux/swap.h>
>>>>@@ -508,6 +509,8 @@ void xe_vm_unlock_dma_resv(struct xe_vm *vm,
>>>> kvfree(tv);
>>>> }
>>>>
>>>>+#define XE_VM_REBIND_RETRY_TIMEOUT_MS 1000
>>>>+
>>>> static void preempt_rebind_work_func(struct work_struct *w)
>>>> {
>>>> struct xe_vm *vm = container_of(w, struct xe_vm,
>>>>preempt.rebind_work);
>>>>@@ -519,6 +522,7 @@ static void preempt_rebind_work_func(struct
>>>>work_struct *w)
>>>> struct dma_fence *rebind_fence;
>>>> unsigned int fence_count = 0;
>>>> LIST_HEAD(preempt_fences);
>>>>+ ktime_t end = 0;
>>>> int err;
>>>> long wait;
>>>> int __maybe_unused tries = 0;
>>>>@@ -633,9 +637,24 @@ static void preempt_rebind_work_func(struct
>>>>work_struct *w)
>>>> out_unlock:
>>>> xe_vm_unlock_dma_resv(vm, tv_onstack, tv, &ww, &objs);
>>>> out_unlock_outer:
>>>>- if (err == -EAGAIN) {
>>>>- trace_xe_vm_rebind_worker_retry(vm);
>>>>- goto retry;
>>>>+ /*
>>>>+ * With multiple active VMs, under memory pressure, it is
>>>>possible that
>>>>+ * ttm_bo_validate() run into -EDEADLK and in such case
>>>>returns -ENOMEM.
>>>>+ * Until ttm properly handles locking in such scenarios, best
>>>>thing the
>>>>+ * driver can do is retry with a timeout. Apply timeout for
>>>>retries upon
>>>>+ * -EAGAIN as well instead of indefinitely retrying. Killing
>>>>the VM or
>>>>+ * putting it in error state after timeout or other error
>>>>scenarios is
>>>>+ * still TBD.
>>>>+ */
>>>>+ if ((err == -EAGAIN) || (err == -ENOMEM)) {
>>>
>>>We should not timeout for userptr -EAGAIN, see previous mail.
>>>Otherwise LGTM.
>>
>>Ok, will leave -EAGAIN as is.
>>
>>Niranjana
>
>
>Thanks. If we later find out somehow it might be needed, it's a much
>bigger change involving multiple retry points so we should tackle that
>separately anyway.
>
Ok, makes sense. Thanks for the review.
posted revised series with your r-b.
Will push once CI passes.
Niranjana
>Thomas
>
>
>>
>>>
>>>>+ ktime_t cur = ktime_get();
>>>>+
>>>>+ end = end ? : ktime_add_ms(cur,
>>>>XE_VM_REBIND_RETRY_TIMEOUT_MS);
>>>>+ if (ktime_before(cur, end)) {
>>>>+ msleep(20);
>>>>+ trace_xe_vm_rebind_worker_retry(vm);
>>>>+ goto retry;
>>>>+ }
>>>> }
>>>> up_write(&vm->lock);
>>>>
>>>
More information about the Intel-xe
mailing list