[PATCH v2] drm/xe: Faster devcoredump
Matthew Brost
matthew.brost at intel.com
Tue Jul 30 22:42:08 UTC 2024
On Mon, Jul 29, 2024 at 10:47:59AM +0200, Maarten Lankhorst wrote:
> Hey,
>
> I like speed, so great to have it fixed!
>
Yep. Uncover bug here exposed a pretty poor implementation...
> Den 2024-07-27 kl. 00:01, skrev Zanoni, Paulo R:
> > On Thu, 2024-07-25 at 22:21 -0700, Matthew Brost wrote:
> >> The current algorithm to read out devcoredump is O(N*N) where N is the
> >> size of coredump due to usage of the drm_coredump_printer in
> >> xe_devcoredump_read. Switch to a O(N) algorithm which prints the
> >> devcoredump into a readable format in snapshot work and update
> >> xe_devcoredump_read to memcpy from the readable format directly.
> >
> > I just tested this:
> >
> > root at martianriver:~# time cp /sys/class/drm/card0/device/devcoredump/data gpu-hang.data
> >
> > real 0m0.313s
> > user 0m0.008s
> > sys 0m0.298s
> > root at martianriver:~# ls -lh gpu-hang.data
> > -rw------- 1 root root 221M Jul 26 14:47 gpu-hang.data
> >
> > Going from an estimated 221 minutes to 0.3 seconds, I'd say it's an improvement.
> >
> >>
> >> v2:
> >> - Fix double free on devcoredump removal (Testing)
> >> - Set read_data_size after snap work flush
> >> - Adjust remaining in iterator upon realloc (Testing)
> >> - Set read_data upon realloc (Testing)
> >>
> >> Reported-by: Paulo Zanoni <paulo.r.zanoni at intel.com>
> >> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2408
> >> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> >> Cc: Maarten Lankhorst <maarten.lankhorst at linux.intel.com>
> >> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> >> ---
> >> drivers/gpu/drm/xe/xe_devcoredump.c | 140 +++++++++++++++++-----
> >> drivers/gpu/drm/xe/xe_devcoredump.h | 13 ++
> >> drivers/gpu/drm/xe/xe_devcoredump_types.h | 4 +
> >> drivers/gpu/drm/xe/xe_vm.c | 9 +-
> >> drivers/gpu/drm/xe/xe_vm.h | 4 +-
> >> 5 files changed, 136 insertions(+), 34 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
> >> index d8d8ca2c19d3..6af161250a9e 100644
> >> --- a/drivers/gpu/drm/xe/xe_devcoredump.c
> >> +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
> >> @@ -66,22 +66,9 @@ static struct xe_guc *exec_queue_to_guc(struct xe_exec_queue *q)
> >> return &q->gt->uc.guc;
> >> }
> >>
> >> -static void xe_devcoredump_deferred_snap_work(struct work_struct *work)
> >> +static void __xe_devcoredump_read(char *buffer, size_t count,
> >> + struct xe_devcoredump *coredump)
> >> {
> >> - struct xe_devcoredump_snapshot *ss = container_of(work, typeof(*ss), work);
> >> -
> >> - /* keep going if fw fails as we still want to save the memory and SW data */
> >> - if (xe_force_wake_get(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL))
> >> - xe_gt_info(ss->gt, "failed to get forcewake for coredump capture\n");
> >> - xe_vm_snapshot_capture_delayed(ss->vm);
> >> - xe_guc_exec_queue_snapshot_capture_delayed(ss->ge);
> >> - xe_force_wake_put(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL);
> Should this put be conditional?
>
> >> -}
> >> -
> >> -static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
> >> - size_t count, void *data, size_t datalen)
> >> -{
> >> - struct xe_devcoredump *coredump = data;
> >> struct xe_device *xe;
> >> struct xe_devcoredump_snapshot *ss;
> >> struct drm_printer p;
> >> @@ -89,18 +76,12 @@ static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
> >> struct timespec64 ts;
> >> int i;
> >>
> >> - if (!coredump)
> >> - return -ENODEV;
> >> -
> >> xe = coredump_to_xe(coredump);
> >> ss = &coredump->snapshot;
> >>
> >> - /* Ensure delayed work is captured before continuing */
> >> - flush_work(&ss->work);
> >> -
> >> iter.data = buffer;
> >> iter.offset = 0;
> >> - iter.start = offset;
> >> + iter.start = 0;
> >> iter.remain = count;
> >>
> >> p = drm_coredump_printer(&iter);
> >> @@ -129,15 +110,86 @@ static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
> >> xe_hw_engine_snapshot_print(coredump->snapshot.hwe[i],
> >> &p);
> >> drm_printf(&p, "\n**** VM state ****\n");
> >> - xe_vm_snapshot_print(coredump->snapshot.vm, &p);
> >> + xe_vm_snapshot_print(ss, coredump->snapshot.vm, &p);
> >>
> >> - return count - iter.remain;
> >> + ss->read_data_size = iter.offset;
> >> +}
> >> +
> >> +static void xe_devcoredump_snapshot_free(struct xe_devcoredump_snapshot *ss)
> >> +{
> >> + int i;
> >> +
> >> + xe_guc_ct_snapshot_free(ss->ct);
> >> + ss->ct = NULL;
> >> +
> >> + xe_guc_exec_queue_snapshot_free(ss->ge);
> >> + ss->ge = NULL;
> >> +
> >> + xe_sched_job_snapshot_free(ss->job);
> >> + ss->job = NULL;
> >> +
> >> + for (i = 0; i < XE_NUM_HW_ENGINES; i++)
> >> + if (ss->hwe[i]) {
> >> + xe_hw_engine_snapshot_free(ss->hwe[i]);
> >> + ss->hwe[i] = NULL;
> >> + }
> >> +
> >> + xe_vm_snapshot_free(ss->vm);
> >> + ss->vm = NULL;
> >> +}
> >> +
> >> +static void xe_devcoredump_deferred_snap_work(struct work_struct *work)
> >> +{
> >> + struct xe_devcoredump_snapshot *ss = container_of(work, typeof(*ss), work);
> >> + struct xe_devcoredump *coredump = container_of(ss, typeof(*coredump), snapshot);
> >> +
> >> + /* keep going if fw fails as we still want to save the memory and SW data */
> >> + if (xe_force_wake_get(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL))
> >> + xe_gt_info(ss->gt, "failed to get forcewake for coredump capture\n");
> >> + xe_vm_snapshot_capture_delayed(ss->vm);
> >> + xe_guc_exec_queue_snapshot_capture_delayed(ss->ge);
> >> + xe_force_wake_put(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL);
> >> +
> >> + ss->read_data = kvmalloc(SZ_16M, GFP_USER);
> >> + if (!ss->read_data)
> >> + return;
> >> +
> >> + ss->read_data_size = SZ_16M;
> Shouldn't it be easy to actually make a reasonable approximation of the size, instead of reallocating all the time?
> Or run twice, returning size on first attempt, and data on second.
>
I tried this initially and is actually a pretty large refactor, hence I
went with the realloc approach for simplicity. Paulo's test case does 4
reallocs and seems to preform just fine.
> In any case,
>
> ss->read_data_size = some const + VM_DUMP_SIZE * some other const
>
> This approach is likely too fragile, but we should be able to change the code to dump twice to get the accurate number.
>
It is fragile in the sense that anything before xe_vm_snapshot_print has
to be 16M or less. This is all static though, how about adding in an
assert which pops if we hit this limit? The drm_coredump_printer
protects us against buffer overrun and corrupting memory, we just lose
coredump data.
Matt
> Cheers,
> ~Maarten
More information about the Intel-xe
mailing list