From mcanal at igalia.com Sat May 3 20:59:51 2025 From: mcanal at igalia.com (=?utf-8?q?Ma=C3=ADra_Canal?=) Date: Sat, 03 May 2025 17:59:51 -0300 Subject: [PATCH 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_RUNNING Message-ID: <20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com> When the DRM scheduler times out, it's possible that the GPU isn't hung; instead, a job may still be running, and there may be no valid reason to reset the hardware. This can occur in two situations: 1. The GPU exposes some mechanism that ensures the GPU is still making progress. By checking this mechanism, we can safely skip the reset, rearm the timeout, and allow the job to continue running until completion. This is the case for v3d and Etnaviv. 2. TDR has fired before the IRQ that signals the fence. Consequently, the job actually finishes, but it triggers a timeout before signaling the completion fence. These two scenarios are problematic because we remove the job from the `sched->pending_list` before calling `sched->ops->timedout_job()`. This means that when the job finally signals completion (e.g. in the IRQ handler), the scheduler won't call `sched->ops->free_job()`. As a result, the job and its resources won't be freed, leading to a memory leak. For v3d specifically, we have observed that these memory leaks can be significant in certain scenarios, as reported by users in [1][2]. To address this situation, I submitted a patch similar to commit 704d3d60fec4 ("drm/etnaviv: don't block scheduler when GPU is still active") for v3d [3]. This patch has already landed in drm-misc-fixes and successfully resolved the users' issues. However, as I mentioned in [3], exposing the scheduler's internals within the drivers isn't ideal and I believe this specific situation can be addressed within the DRM scheduler framework. This series aims to resolve this issue by adding a new DRM sched status that allows a driver to skip the reset. This new status will indicate that the job should be reinserted into the pending list, and the driver will still signal its completion. The series can be divided into three parts: * Patch 1: Implementation of the new status in the DRM scheduler. * Patches 2-4: Some fixes to the DRM scheduler KUnit tests and the addition of a test for the new status. * Patches 5-8: Usage the new status in four different drivers. [1] https://gitlab.freedesktop.org/mesa/mesa/-/issues/12227 [2] https://github.com/raspberrypi/linux/issues/6817 [3] https://lore.kernel.org/dri-devel/20250430210643.57924-1-mcanal at igalia.com/T/ Best Regards, - Ma?ra --- Ma?ra Canal (8): drm/sched: Allow drivers to skip the reset and keep on running drm/sched: Always free the job after the timeout drm/sched: Reduce scheduler's timeout for timeout tests drm/sched: Add new test for DRM_GPU_SCHED_STAT_RUNNING drm/v3d: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset drm/etnaviv: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset drm/xe: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset drm/panfrost: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset drivers/gpu/drm/etnaviv/etnaviv_sched.c | 12 ++--- drivers/gpu/drm/panfrost/panfrost_job.c | 8 ++-- drivers/gpu/drm/scheduler/sched_main.c | 14 ++++++ drivers/gpu/drm/scheduler/tests/mock_scheduler.c | 13 ++++++ drivers/gpu/drm/scheduler/tests/tests_basic.c | 57 ++++++++++++++++++++++-- drivers/gpu/drm/v3d/v3d_sched.c | 4 +- drivers/gpu/drm/xe/xe_guc_submit.c | 8 +--- include/drm/gpu_scheduler.h | 2 + 8 files changed, 94 insertions(+), 24 deletions(-) --- base-commit: 760e296124ef3b6e14cd1d940f2a01c5ed7c0dac change-id: 20250502-sched-skip-reset-bf7c163233da From mcanal at igalia.com Sat May 3 20:59:52 2025 From: mcanal at igalia.com (=?utf-8?q?Ma=C3=ADra_Canal?=) Date: Sat, 03 May 2025 17:59:52 -0300 Subject: [PATCH 1/8] drm/sched: Allow drivers to skip the reset and keep on running In-Reply-To: <20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com> References: <20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com> Message-ID: <20250503-sched-skip-reset-v1-1-ed0d6701a3fe@igalia.com> When the DRM scheduler times out, it's possible that the GPU isn't hung; instead, a job may still be running, and there may be no valid reason to reset the hardware. This can occur in two situations: 1. The GPU exposes some mechanism that ensures the GPU is still making progress. By checking this mechanism, we can safely skip the reset, rearm the timeout, and allow the job to continue running until completion. This is the case for v3d and Etnaviv. 2. TDR has fired before the IRQ that signals the fence. Consequently, the job actually finishes, but it triggers a timeout before signaling the completion fence. These two scenarios are problematic because we remove the job from the `sched->pending_list` before calling `sched->ops->timedout_job()`. This means that when the job finally signals completion (e.g. in the IRQ handler), the scheduler won't call `sched->ops->free_job()`. As a result, the job and its resources won't be freed, leading to a memory leak. To resolve this issue, we create a new `drm_gpu_sched_stat` that allows a driver to skip the reset. This new status will indicate that the job should be reinserted into the pending list, and the driver will still signal its completion. Signed-off-by: Ma?ra Canal --- drivers/gpu/drm/scheduler/sched_main.c | 14 ++++++++++++++ include/drm/gpu_scheduler.h | 2 ++ 2 files changed, 16 insertions(+) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 829579c41c6b5d8b2abce5ad373c7017469b7680..68ca827d77e32187a034309f881135dbc639a9b4 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -568,6 +568,17 @@ static void drm_sched_job_timedout(struct work_struct *work) job->sched->ops->free_job(job); sched->free_guilty = false; } + + /* + * If the driver indicated that the GPU is still running and wants to skip + * the reset, reinsert the job back into the pending list and realarm the + * timeout. + */ + if (status == DRM_GPU_SCHED_STAT_RUNNING) { + spin_lock(&sched->job_list_lock); + list_add(&job->list, &sched->pending_list); + spin_unlock(&sched->job_list_lock); + } } else { spin_unlock(&sched->job_list_lock); } @@ -590,6 +601,9 @@ static void drm_sched_job_timedout(struct work_struct *work) * This function is typically used for reset recovery (see the docu of * drm_sched_backend_ops.timedout_job() for details). Do not call it for * scheduler teardown, i.e., before calling drm_sched_fini(). + * + * As it's used for reset recovery, drm_sched_stop() shouldn't be called + * if the scheduler skipped the timeout (DRM_SCHED_STAT_RUNNING). */ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad) { diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 1a7e377d4cbb4fc12ed93c548b236970217945e8..fe9043b6d43141bee831b5fc16b927202a507d51 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -389,11 +389,13 @@ struct drm_sched_job { * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use. * @DRM_GPU_SCHED_STAT_NOMINAL: Operation succeeded. * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available anymore. + * @DRM_GPU_SCHED_STAT_RUNNING: GPU is still running, so skip the reset. */ enum drm_gpu_sched_stat { DRM_GPU_SCHED_STAT_NONE, DRM_GPU_SCHED_STAT_NOMINAL, DRM_GPU_SCHED_STAT_ENODEV, + DRM_GPU_SCHED_STAT_RUNNING, }; /** -- 2.49.0 From mcanal at igalia.com Sat May 3 20:59:53 2025 From: mcanal at igalia.com (=?utf-8?q?Ma=C3=ADra_Canal?=) Date: Sat, 03 May 2025 17:59:53 -0300 Subject: [PATCH 2/8] drm/sched: Always free the job after the timeout In-Reply-To: <20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com> References: <20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com> Message-ID: <20250503-sched-skip-reset-v1-2-ed0d6701a3fe@igalia.com> Currently, if we add the assertions presented in this commit to the mock scheduler, we will see the following output: [15:47:08] ============== [PASSED] drm_sched_basic_tests ============== [15:47:08] ======== drm_sched_basic_timeout_tests (1 subtest) ========= [15:47:08] # drm_sched_basic_timeout: ASSERTION FAILED at drivers/gpu/drm/scheduler/tests/tests_basic.c:246 [15:47:08] Expected list_empty(&sched->job_list) to be true, but is false [15:47:08] [FAILED] drm_sched_basic_timeout [15:47:08] # module: drm_sched_tests This occurs because `mock_sched_timedout_job()` doesn't properly handle the hang. From the DRM sched documentation, `drm_sched_stop()` and `drm_sched_start()` are typically used for reset recovery. If these functions are not used, the offending job won't be freed and should be freed by the caller. Currently, the mock scheduler doesn't use the functions provided by the API, nor does it handle the freeing of the job. As a result, the job isn't removed from the job list. This commit mocks a GPU reset by stopping the scheduler affected by the reset, waiting a couple of microseconds to mimic a hardware reset, and then restart the affected scheduler. Signed-off-by: Ma?ra Canal --- drivers/gpu/drm/scheduler/tests/mock_scheduler.c | 10 ++++++++++ drivers/gpu/drm/scheduler/tests/tests_basic.c | 3 +++ 2 files changed, 13 insertions(+) diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c index f999c8859cf7adb8f06fc8a37969656dd3249fa7..e9af202d84bd55ea5cc048215e39f5407bc84458 100644 --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c @@ -1,6 +1,8 @@ // SPDX-License-Identifier: GPL-2.0 /* Copyright (c) 2025 Valve Corporation */ +#include + #include "sched_tests.h" /* @@ -203,10 +205,18 @@ static struct dma_fence *mock_sched_run_job(struct drm_sched_job *sched_job) static enum drm_gpu_sched_stat mock_sched_timedout_job(struct drm_sched_job *sched_job) { + struct drm_mock_scheduler *sched = + drm_sched_to_mock_sched(sched_job->sched); struct drm_mock_sched_job *job = drm_sched_job_to_mock_job(sched_job); job->flags |= DRM_MOCK_SCHED_JOB_TIMEDOUT; + drm_sched_stop(&sched->base, &job->base); + + usleep_range(200, 500); + + drm_sched_start(&sched->base, 0); + return DRM_GPU_SCHED_STAT_NOMINAL; } diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/gpu/drm/scheduler/tests/tests_basic.c index 7230057e0594c6246f02608f07fcb1f8d738ac75..8f960f0fd31d0af7873f410ceba2d636f58a5474 100644 --- a/drivers/gpu/drm/scheduler/tests/tests_basic.c +++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c @@ -241,6 +241,9 @@ static void drm_sched_basic_timeout(struct kunit *test) job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT, DRM_MOCK_SCHED_JOB_TIMEDOUT); + KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list)); + KUNIT_ASSERT_TRUE(test, list_empty(&sched->done_list)); + drm_mock_sched_entity_free(entity); } -- 2.49.0 From mcanal at igalia.com Sat May 3 20:59:54 2025 From: mcanal at igalia.com (=?utf-8?q?Ma=C3=ADra_Canal?=) Date: Sat, 03 May 2025 17:59:54 -0300 Subject: [PATCH 3/8] drm/sched: Reduce scheduler's timeout for timeout tests In-Reply-To: <20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com> References: <20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com> Message-ID: <20250503-sched-skip-reset-v1-3-ed0d6701a3fe@igalia.com> As more KUnit tests are introduced to evaluate the basic capabilities of the `timedout_job()` hook, the test suite will continue to increase in duration. To reduce the overall running time of the test suite, decrease the scheduler's timeout for the timeout tests. Before this commit: [15:42:26] Elapsed time: 15.637s total, 0.002s configuring, 10.387s building, 5.229s running After this commit: [15:45:26] Elapsed time: 9.263s total, 0.002s configuring, 5.168s building, 4.037s running Signed-off-by: Ma?ra Canal --- drivers/gpu/drm/scheduler/tests/tests_basic.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/gpu/drm/scheduler/tests/tests_basic.c index 8f960f0fd31d0af7873f410ceba2d636f58a5474..00c691cb3c306f609684f554f17fcb54ba74cb95 100644 --- a/drivers/gpu/drm/scheduler/tests/tests_basic.c +++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c @@ -5,6 +5,8 @@ #include "sched_tests.h" +#define MOCK_TIMEOUT (HZ / 5) + /* * DRM scheduler basic tests should check the basic functional correctness of * the scheduler, including some very light smoke testing. More targeted tests, @@ -28,7 +30,7 @@ static void drm_sched_basic_exit(struct kunit *test) static int drm_sched_timeout_init(struct kunit *test) { - test->priv = drm_mock_sched_new(test, HZ); + test->priv = drm_mock_sched_new(test, MOCK_TIMEOUT); return 0; } @@ -224,17 +226,17 @@ static void drm_sched_basic_timeout(struct kunit *test) drm_mock_sched_job_submit(job); - done = drm_mock_sched_job_wait_scheduled(job, HZ); + done = drm_mock_sched_job_wait_scheduled(job, MOCK_TIMEOUT); KUNIT_ASSERT_TRUE(test, done); - done = drm_mock_sched_job_wait_finished(job, HZ / 2); + done = drm_mock_sched_job_wait_finished(job, MOCK_TIMEOUT / 2); KUNIT_ASSERT_FALSE(test, done); KUNIT_ASSERT_EQ(test, job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT, 0); - done = drm_mock_sched_job_wait_finished(job, HZ); + done = drm_mock_sched_job_wait_finished(job, MOCK_TIMEOUT); KUNIT_ASSERT_FALSE(test, done); KUNIT_ASSERT_EQ(test, -- 2.49.0 From mcanal at igalia.com Sat May 3 20:59:55 2025 From: mcanal at igalia.com (=?utf-8?q?Ma=C3=ADra_Canal?=) Date: Sat, 03 May 2025 17:59:55 -0300 Subject: [PATCH 4/8] drm/sched: Add new test for DRM_GPU_SCHED_STAT_RUNNING In-Reply-To: <20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com> References: <20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com> Message-ID: <20250503-sched-skip-reset-v1-4-ed0d6701a3fe@igalia.com> Add a test to submit a single job against a scheduler with the timeout configured and verify that if the job is still running, the timeout handler will skip the reset and allow the job to complete. Signed-off-by: Ma?ra Canal --- drivers/gpu/drm/scheduler/tests/mock_scheduler.c | 3 ++ drivers/gpu/drm/scheduler/tests/tests_basic.c | 44 ++++++++++++++++++++++++ 2 files changed, 47 insertions(+) diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c index e9af202d84bd55ea5cc048215e39f5407bc84458..9d594cb5bf567be25e018ddbcd28b70a7e994260 100644 --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c @@ -211,6 +211,9 @@ mock_sched_timedout_job(struct drm_sched_job *sched_job) job->flags |= DRM_MOCK_SCHED_JOB_TIMEDOUT; + if (job->finish_at && ktime_before(ktime_get(), job->finish_at)) + return DRM_GPU_SCHED_STAT_RUNNING; + drm_sched_stop(&sched->base, &job->base); usleep_range(200, 500); diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/gpu/drm/scheduler/tests/tests_basic.c index 00c691cb3c306f609684f554f17fcb54ba74cb95..669a211b216ee298544ac237abb866077d856586 100644 --- a/drivers/gpu/drm/scheduler/tests/tests_basic.c +++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c @@ -249,8 +249,52 @@ static void drm_sched_basic_timeout(struct kunit *test) drm_mock_sched_entity_free(entity); } +static void drm_sched_skip_reset(struct kunit *test) +{ + struct drm_mock_scheduler *sched = test->priv; + struct drm_mock_sched_entity *entity; + struct drm_mock_sched_job *job; + bool done; + + /* + * Submit a single job against a scheduler with the timeout configured + * and verify that if the job is still running, the timeout handler + * will skip the reset and allow the job to complete. + */ + + entity = drm_mock_sched_entity_new(test, + DRM_SCHED_PRIORITY_NORMAL, + sched); + job = drm_mock_sched_job_new(test, entity); + + drm_mock_sched_job_set_duration_us(job, jiffies_to_usecs(2 * MOCK_TIMEOUT)); + drm_mock_sched_job_submit(job); + + done = drm_mock_sched_job_wait_finished(job, MOCK_TIMEOUT); + KUNIT_ASSERT_FALSE(test, done); + + KUNIT_ASSERT_EQ(test, + job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT, + DRM_MOCK_SCHED_JOB_TIMEDOUT); + + KUNIT_ASSERT_FALSE(test, list_empty(&sched->job_list)); + + done = drm_mock_sched_job_wait_finished(job, MOCK_TIMEOUT); + KUNIT_ASSERT_TRUE(test, done); + + KUNIT_ASSERT_EQ(test, + job->flags & DRM_MOCK_SCHED_JOB_TIMEDOUT, + DRM_MOCK_SCHED_JOB_TIMEDOUT); + + KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list)); + KUNIT_ASSERT_TRUE(test, list_empty(&sched->done_list)); + + drm_mock_sched_entity_free(entity); +} + static struct kunit_case drm_sched_timeout_tests[] = { KUNIT_CASE(drm_sched_basic_timeout), + KUNIT_CASE(drm_sched_skip_reset), {} }; -- 2.49.0 From mcanal at igalia.com Sat May 3 20:59:56 2025 From: mcanal at igalia.com (=?utf-8?q?Ma=C3=ADra_Canal?=) Date: Sat, 03 May 2025 17:59:56 -0300 Subject: [PATCH 5/8] drm/v3d: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset In-Reply-To: <20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com> References: <20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com> Message-ID: <20250503-sched-skip-reset-v1-5-ed0d6701a3fe@igalia.com> When a CL/CSD job times out, we check if the GPU has made any progress since the last timeout. If so, instead of resetting the hardware, we skip the reset and allow the timer to be rearmed. This gives long-running jobs a chance to complete. Use the DRM_GPU_SCHED_STAT_RUNNING status to skip the reset and rearm the timer. Signed-off-by: Ma?ra Canal --- drivers/gpu/drm/v3d/v3d_sched.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c index b3be08b0ca9188564f9fb6aa32694940a5fadc9d..51770b6686d0befffa3e87c290bbdc1a12a19ad5 100644 --- a/drivers/gpu/drm/v3d/v3d_sched.c +++ b/drivers/gpu/drm/v3d/v3d_sched.c @@ -751,7 +751,7 @@ v3d_cl_job_timedout(struct drm_sched_job *sched_job, enum v3d_queue q, if (*timedout_ctca != ctca || *timedout_ctra != ctra) { *timedout_ctca = ctca; *timedout_ctra = ctra; - return DRM_GPU_SCHED_STAT_NOMINAL; + return DRM_GPU_SCHED_STAT_RUNNING; } return v3d_gpu_reset_for_timeout(v3d, sched_job); @@ -795,7 +795,7 @@ v3d_csd_job_timedout(struct drm_sched_job *sched_job) */ if (job->timedout_batches != batches) { job->timedout_batches = batches; - return DRM_GPU_SCHED_STAT_NOMINAL; + return DRM_GPU_SCHED_STAT_RUNNING; } return v3d_gpu_reset_for_timeout(v3d, sched_job); -- 2.49.0 From mcanal at igalia.com Sat May 3 20:59:57 2025 From: mcanal at igalia.com (=?utf-8?q?Ma=C3=ADra_Canal?=) Date: Sat, 03 May 2025 17:59:57 -0300 Subject: [PATCH 6/8] drm/etnaviv: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset In-Reply-To: <20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com> References: <20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com> Message-ID: <20250503-sched-skip-reset-v1-6-ed0d6701a3fe@igalia.com> Etnaviv can skip a hardware reset in two situations: 1. TDR has fired before the IRQ and the timeout is spurious. 2. The GPU is still making progress on the front-end and we can give the job a chance to complete. Instead of relying on the scheduler internals, use the DRM_GPU_SCHED_STAT_RUNNING status to skip the reset and rearm the timer. Signed-off-by: Ma?ra Canal --- drivers/gpu/drm/etnaviv/etnaviv_sched.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c index 76a3a3e517d8d9f654fb6b9e98e72910795cfc7a..b87ffdb4136aebade736d78b3677de2f21d52ebc 100644 --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c @@ -40,11 +40,11 @@ static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job int change; /* - * If the GPU managed to complete this jobs fence, the timout is - * spurious. Bail out. + * If the GPU managed to complete this jobs fence, TDR has fired before + * IRQ and the timeout is spurious. Bail out. */ if (dma_fence_is_signaled(submit->out_fence)) - goto out_no_timeout; + return DRM_GPU_SCHED_STAT_RUNNING; /* * If the GPU is still making forward progress on the front-end (which @@ -70,7 +70,7 @@ static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job gpu->hangcheck_dma_addr = dma_addr; gpu->hangcheck_primid = primid; gpu->hangcheck_fence = gpu->completed_fence; - goto out_no_timeout; + return DRM_GPU_SCHED_STAT_RUNNING; } /* block scheduler */ @@ -87,10 +87,6 @@ static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job drm_sched_start(&gpu->sched, 0); return DRM_GPU_SCHED_STAT_NOMINAL; - -out_no_timeout: - list_add(&sched_job->list, &sched_job->sched->pending_list); - return DRM_GPU_SCHED_STAT_NOMINAL; } static void etnaviv_sched_free_job(struct drm_sched_job *sched_job) -- 2.49.0 From mcanal at igalia.com Sat May 3 20:59:58 2025 From: mcanal at igalia.com (=?utf-8?q?Ma=C3=ADra_Canal?=) Date: Sat, 03 May 2025 17:59:58 -0300 Subject: [PATCH 7/8] drm/xe: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset In-Reply-To: <20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com> References: <20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com> Message-ID: <20250503-sched-skip-reset-v1-7-ed0d6701a3fe@igalia.com> Xe can skip the reset if TDR has fired before the free job worker. Instead of using the scheduler internals to add the job to the pending list, use the DRM_GPU_SCHED_STAT_RUNNING status to skip the reset and rearm the timer. Note that there is no need to restart submission if it hasn't been stopped. Signed-off-by: Ma?ra Canal --- drivers/gpu/drm/xe/xe_guc_submit.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c index 31bc2022bfc2d80f0ef54726dfeb8d7f8e6b32c8..4c40d3921d4a5e190d3413736a68c6e7295223dd 100644 --- a/drivers/gpu/drm/xe/xe_guc_submit.c +++ b/drivers/gpu/drm/xe/xe_guc_submit.c @@ -1058,12 +1058,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job) * list so job can be freed and kick scheduler ensuring free job is not * lost. */ - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags)) { - xe_sched_add_pending_job(sched, job); - xe_sched_submission_start(sched); - - return DRM_GPU_SCHED_STAT_NOMINAL; - } + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &job->fence->flags)) + return DRM_GPU_SCHED_STAT_RUNNING; /* Kill the run_job entry point */ xe_sched_submission_stop(sched); -- 2.49.0 From mcanal at igalia.com Sat May 3 20:59:59 2025 From: mcanal at igalia.com (=?utf-8?q?Ma=C3=ADra_Canal?=) Date: Sat, 03 May 2025 17:59:59 -0300 Subject: [PATCH 8/8] drm/panfrost: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset In-Reply-To: <20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com> References: <20250503-sched-skip-reset-v1-0-ed0d6701a3fe@igalia.com> Message-ID: <20250503-sched-skip-reset-v1-8-ed0d6701a3fe@igalia.com> Panfrost can skip the reset if TDR has fired before the IRQ handler. Currently, since Panfrost doesn't take any action on these scenarios, the job is being leaked, considering that `free_job()` won't be called. To avoid such leaks, use the DRM_GPU_SCHED_STAT_RUNNING status to skip the reset and rearm the timer. Signed-off-by: Ma?ra Canal --- drivers/gpu/drm/panfrost/panfrost_job.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index 5657106c2f7d0a0ca6162850767f58f3200cce13..2948d5c02115544a0e0babffd850f1506152849d 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.c +++ b/drivers/gpu/drm/panfrost/panfrost_job.c @@ -751,11 +751,11 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job int js = panfrost_job_get_slot(job); /* - * If the GPU managed to complete this jobs fence, the timeout is - * spurious. Bail out. + * If the GPU managed to complete this jobs fence, TDR has fired before + * IRQ and the timeout is spurious. Bail out. */ if (dma_fence_is_signaled(job->done_fence)) - return DRM_GPU_SCHED_STAT_NOMINAL; + return DRM_GPU_SCHED_STAT_RUNNING; /* * Panfrost IRQ handler may take a long time to process an interrupt @@ -770,7 +770,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job if (dma_fence_is_signaled(job->done_fence)) { dev_warn(pfdev->dev, "unexpectedly high interrupt latency\n"); - return DRM_GPU_SCHED_STAT_NOMINAL; + return DRM_GPU_SCHED_STAT_RUNNING; } dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, head=0x%x, tail=0x%x, sched_job=%p", -- 2.49.0