[PATCH v2] drm/xe: Faster devcoredump

Tue Jul 30 22:42:08 UTC 2024

On Mon, Jul 29, 2024 at 10:47:59AM +0200, Maarten Lankhorst wrote:
> Hey,
> 
> I like speed, so great to have it fixed!
> 

Yep. Uncover bug here exposed a pretty poor implementation...

> Den 2024-07-27 kl. 00:01, skrev Zanoni, Paulo R:
> > On Thu, 2024-07-25 at 22:21 -0700, Matthew Brost wrote:
> >> The current algorithm to read out devcoredump is O(N*N) where N is the
> >> size of coredump due to usage of the drm_coredump_printer in
> >> xe_devcoredump_read. Switch to a O(N) algorithm which prints the
> >> devcoredump into a readable format in snapshot work and update
> >> xe_devcoredump_read to memcpy from the readable format directly.
> > 
> > I just tested this:
> > 
> > root at martianriver:~# time cp /sys/class/drm/card0/device/devcoredump/data gpu-hang.data
> > 
> > real	0m0.313s
> > user	0m0.008s
> > sys	0m0.298s
> > root at martianriver:~# ls -lh gpu-hang.data 
> > -rw------- 1 root root 221M Jul 26 14:47 gpu-hang.data
> > 
> > Going from an estimated 221 minutes to 0.3 seconds, I'd say it's an improvement.
> > 
> >>
> >> v2:
> >>  - Fix double free on devcoredump removal (Testing)
> >>  - Set read_data_size after snap work flush
> >>  - Adjust remaining in iterator upon realloc (Testing)
> >>  - Set read_data upon realloc (Testing)
> >>
> >> Reported-by: Paulo Zanoni <paulo.r.zanoni at intel.com>
> >> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2408
> >> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> >> Cc: Maarten Lankhorst <maarten.lankhorst at linux.intel.com>
> >> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> >> ---
> >>  drivers/gpu/drm/xe/xe_devcoredump.c       | 140 +++++++++++++++++-----
> >>  drivers/gpu/drm/xe/xe_devcoredump.h       |  13 ++
> >>  drivers/gpu/drm/xe/xe_devcoredump_types.h |   4 +
> >>  drivers/gpu/drm/xe/xe_vm.c                |   9 +-
> >>  drivers/gpu/drm/xe/xe_vm.h                |   4 +-
> >>  5 files changed, 136 insertions(+), 34 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
> >> index d8d8ca2c19d3..6af161250a9e 100644
> >> --- a/drivers/gpu/drm/xe/xe_devcoredump.c
> >> +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
> >> @@ -66,22 +66,9 @@ static struct xe_guc *exec_queue_to_guc(struct xe_exec_queue *q)
> >>  	return &q->gt->uc.guc;
> >>  }
> >>  
> >> -static void xe_devcoredump_deferred_snap_work(struct work_struct *work)
> >> +static void __xe_devcoredump_read(char *buffer, size_t count,
> >> +				  struct xe_devcoredump *coredump)
> >>  {
> >> -	struct xe_devcoredump_snapshot *ss = container_of(work, typeof(*ss), work);
> >> -
> >> -	/* keep going if fw fails as we still want to save the memory and SW data */
> >> -	if (xe_force_wake_get(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL))
> >> -		xe_gt_info(ss->gt, "failed to get forcewake for coredump capture\n");
> >> -	xe_vm_snapshot_capture_delayed(ss->vm);
> >> -	xe_guc_exec_queue_snapshot_capture_delayed(ss->ge);
> >> -	xe_force_wake_put(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL);
> Should this put be conditional?
> 
> >> -}
> >> -
> >> -static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
> >> -				   size_t count, void *data, size_t datalen)
> >> -{
> >> -	struct xe_devcoredump *coredump = data;
> >>  	struct xe_device *xe;
> >>  	struct xe_devcoredump_snapshot *ss;
> >>  	struct drm_printer p;
> >> @@ -89,18 +76,12 @@ static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
> >>  	struct timespec64 ts;
> >>  	int i;
> >>  
> >> -	if (!coredump)
> >> -		return -ENODEV;
> >> -
> >>  	xe = coredump_to_xe(coredump);
> >>  	ss = &coredump->snapshot;
> >>  
> >> -	/* Ensure delayed work is captured before continuing */
> >> -	flush_work(&ss->work);
> >> -
> >>  	iter.data = buffer;
> >>  	iter.offset = 0;
> >> -	iter.start = offset;
> >> +	iter.start = 0;
> >>  	iter.remain = count;
> >>  
> >>  	p = drm_coredump_printer(&iter);
> >> @@ -129,15 +110,86 @@ static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
> >>  			xe_hw_engine_snapshot_print(coredump->snapshot.hwe[i],
> >>  						    &p);
> >>  	drm_printf(&p, "\n**** VM state ****\n");
> >> -	xe_vm_snapshot_print(coredump->snapshot.vm, &p);
> >> +	xe_vm_snapshot_print(ss, coredump->snapshot.vm, &p);
> >>  
> >> -	return count - iter.remain;
> >> +	ss->read_data_size = iter.offset;
> >> +}
> >> +
> >> +static void xe_devcoredump_snapshot_free(struct xe_devcoredump_snapshot *ss)
> >> +{
> >> +	int i;
> >> +
> >> +	xe_guc_ct_snapshot_free(ss->ct);
> >> +	ss->ct = NULL;
> >> +
> >> +	xe_guc_exec_queue_snapshot_free(ss->ge);
> >> +	ss->ge = NULL;
> >> +
> >> +	xe_sched_job_snapshot_free(ss->job);
> >> +	ss->job = NULL;
> >> +
> >> +	for (i = 0; i < XE_NUM_HW_ENGINES; i++)
> >> +		if (ss->hwe[i]) {
> >> +			xe_hw_engine_snapshot_free(ss->hwe[i]);
> >> +			ss->hwe[i] = NULL;
> >> +		}
> >> +
> >> +	xe_vm_snapshot_free(ss->vm);
> >> +	ss->vm = NULL;
> >> +}
> >> +
> >> +static void xe_devcoredump_deferred_snap_work(struct work_struct *work)
> >> +{
> >> +	struct xe_devcoredump_snapshot *ss = container_of(work, typeof(*ss), work);
> >> +	struct xe_devcoredump *coredump = container_of(ss, typeof(*coredump), snapshot);
> >> +
> >> +	/* keep going if fw fails as we still want to save the memory and SW data */
> >> +	if (xe_force_wake_get(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL))
> >> +		xe_gt_info(ss->gt, "failed to get forcewake for coredump capture\n");
> >> +	xe_vm_snapshot_capture_delayed(ss->vm);
> >> +	xe_guc_exec_queue_snapshot_capture_delayed(ss->ge);
> >> +	xe_force_wake_put(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL);
> >> +
> >> +	ss->read_data = kvmalloc(SZ_16M, GFP_USER);
> >> +	if (!ss->read_data)
> >> +		return;
> >> +
> >> +	ss->read_data_size = SZ_16M;
> Shouldn't it be easy to actually make a reasonable approximation of the size, instead of reallocating all the time?
> Or run twice, returning size on first attempt, and data on second.
> 

I tried this initially and is actually a pretty large refactor, hence I
went with the realloc approach for simplicity. Paulo's test case does 4
reallocs and seems to preform just fine.

> In any case,
> 
> ss->read_data_size = some const + VM_DUMP_SIZE * some other const
> 
> This approach is likely too fragile, but we should be able to change the code to dump twice to get the accurate number.
> 

It is fragile in the sense that anything before xe_vm_snapshot_print has
to be 16M or less. This is all static though, how about adding in an
assert which pops if we hit this limit? The drm_coredump_printer
protects us against buffer overrun and corrupting memory, we just lose
coredump data.

Matt

> Cheers,
> ~Maarten