[PATCH v2] drm/xe: Faster devcoredump

Wed Jul 31 21:22:52 UTC 2024

On Tue, Jul 30, 2024 at 10:42:08PM +0000, Matthew Brost wrote:
> On Mon, Jul 29, 2024 at 10:47:59AM +0200, Maarten Lankhorst wrote:
> > Hey,
> > 
> > I like speed, so great to have it fixed!
> > 
> 
> Yep. Uncover bug here exposed a pretty poor implementation...
> 
> > Den 2024-07-27 kl. 00:01, skrev Zanoni, Paulo R:
> > > On Thu, 2024-07-25 at 22:21 -0700, Matthew Brost wrote:
> > >> The current algorithm to read out devcoredump is O(N*N) where N is the
> > >> size of coredump due to usage of the drm_coredump_printer in
> > >> xe_devcoredump_read. Switch to a O(N) algorithm which prints the
> > >> devcoredump into a readable format in snapshot work and update
> > >> xe_devcoredump_read to memcpy from the readable format directly.
> > > 
> > > I just tested this:
> > > 
> > > root at martianriver:~# time cp /sys/class/drm/card0/device/devcoredump/data gpu-hang.data
> > > 
> > > real	0m0.313s
> > > user	0m0.008s
> > > sys	0m0.298s
> > > root at martianriver:~# ls -lh gpu-hang.data 
> > > -rw------- 1 root root 221M Jul 26 14:47 gpu-hang.data
> > > 
> > > Going from an estimated 221 minutes to 0.3 seconds, I'd say it's an improvement.
> > > 
> > >>
> > >> v2:
> > >>  - Fix double free on devcoredump removal (Testing)
> > >>  - Set read_data_size after snap work flush
> > >>  - Adjust remaining in iterator upon realloc (Testing)
> > >>  - Set read_data upon realloc (Testing)
> > >>
> > >> Reported-by: Paulo Zanoni <paulo.r.zanoni at intel.com>
> > >> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2408
> > >> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > >> Cc: Maarten Lankhorst <maarten.lankhorst at linux.intel.com>
> > >> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> > >> ---
> > >>  drivers/gpu/drm/xe/xe_devcoredump.c       | 140 +++++++++++++++++-----
> > >>  drivers/gpu/drm/xe/xe_devcoredump.h       |  13 ++
> > >>  drivers/gpu/drm/xe/xe_devcoredump_types.h |   4 +
> > >>  drivers/gpu/drm/xe/xe_vm.c                |   9 +-
> > >>  drivers/gpu/drm/xe/xe_vm.h                |   4 +-
> > >>  5 files changed, 136 insertions(+), 34 deletions(-)
> > >>
> > >> diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
> > >> index d8d8ca2c19d3..6af161250a9e 100644
> > >> --- a/drivers/gpu/drm/xe/xe_devcoredump.c
> > >> +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
> > >> @@ -66,22 +66,9 @@ static struct xe_guc *exec_queue_to_guc(struct xe_exec_queue *q)
> > >>  	return &q->gt->uc.guc;
> > >>  }
> > >>  
> > >> -static void xe_devcoredump_deferred_snap_work(struct work_struct *work)
> > >> +static void __xe_devcoredump_read(char *buffer, size_t count,
> > >> +				  struct xe_devcoredump *coredump)
> > >>  {
> > >> -	struct xe_devcoredump_snapshot *ss = container_of(work, typeof(*ss), work);
> > >> -
> > >> -	/* keep going if fw fails as we still want to save the memory and SW data */
> > >> -	if (xe_force_wake_get(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL))
> > >> -		xe_gt_info(ss->gt, "failed to get forcewake for coredump capture\n");
> > >> -	xe_vm_snapshot_capture_delayed(ss->vm);
> > >> -	xe_guc_exec_queue_snapshot_capture_delayed(ss->ge);
> > >> -	xe_force_wake_put(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL);
> > Should this put be conditional?
> > 
> > >> -}
> > >> -
> > >> -static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
> > >> -				   size_t count, void *data, size_t datalen)
> > >> -{
> > >> -	struct xe_devcoredump *coredump = data;
> > >>  	struct xe_device *xe;
> > >>  	struct xe_devcoredump_snapshot *ss;
> > >>  	struct drm_printer p;
> > >> @@ -89,18 +76,12 @@ static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
> > >>  	struct timespec64 ts;
> > >>  	int i;
> > >>  
> > >> -	if (!coredump)
> > >> -		return -ENODEV;
> > >> -
> > >>  	xe = coredump_to_xe(coredump);
> > >>  	ss = &coredump->snapshot;
> > >>  
> > >> -	/* Ensure delayed work is captured before continuing */
> > >> -	flush_work(&ss->work);
> > >> -
> > >>  	iter.data = buffer;
> > >>  	iter.offset = 0;
> > >> -	iter.start = offset;
> > >> +	iter.start = 0;
> > >>  	iter.remain = count;
> > >>  
> > >>  	p = drm_coredump_printer(&iter);
> > >> @@ -129,15 +110,86 @@ static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
> > >>  			xe_hw_engine_snapshot_print(coredump->snapshot.hwe[i],
> > >>  						    &p);
> > >>  	drm_printf(&p, "\n**** VM state ****\n");
> > >> -	xe_vm_snapshot_print(coredump->snapshot.vm, &p);
> > >> +	xe_vm_snapshot_print(ss, coredump->snapshot.vm, &p);
> > >>  
> > >> -	return count - iter.remain;
> > >> +	ss->read_data_size = iter.offset;
> > >> +}
> > >> +
> > >> +static void xe_devcoredump_snapshot_free(struct xe_devcoredump_snapshot *ss)
> > >> +{
> > >> +	int i;
> > >> +
> > >> +	xe_guc_ct_snapshot_free(ss->ct);
> > >> +	ss->ct = NULL;
> > >> +
> > >> +	xe_guc_exec_queue_snapshot_free(ss->ge);
> > >> +	ss->ge = NULL;
> > >> +
> > >> +	xe_sched_job_snapshot_free(ss->job);
> > >> +	ss->job = NULL;
> > >> +
> > >> +	for (i = 0; i < XE_NUM_HW_ENGINES; i++)
> > >> +		if (ss->hwe[i]) {
> > >> +			xe_hw_engine_snapshot_free(ss->hwe[i]);
> > >> +			ss->hwe[i] = NULL;
> > >> +		}
> > >> +
> > >> +	xe_vm_snapshot_free(ss->vm);
> > >> +	ss->vm = NULL;
> > >> +}
> > >> +
> > >> +static void xe_devcoredump_deferred_snap_work(struct work_struct *work)
> > >> +{
> > >> +	struct xe_devcoredump_snapshot *ss = container_of(work, typeof(*ss), work);
> > >> +	struct xe_devcoredump *coredump = container_of(ss, typeof(*coredump), snapshot);
> > >> +
> > >> +	/* keep going if fw fails as we still want to save the memory and SW data */
> > >> +	if (xe_force_wake_get(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL))
> > >> +		xe_gt_info(ss->gt, "failed to get forcewake for coredump capture\n");
> > >> +	xe_vm_snapshot_capture_delayed(ss->vm);
> > >> +	xe_guc_exec_queue_snapshot_capture_delayed(ss->ge);
> > >> +	xe_force_wake_put(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL);
> > >> +
> > >> +	ss->read_data = kvmalloc(SZ_16M, GFP_USER);
> > >> +	if (!ss->read_data)
> > >> +		return;
> > >> +
> > >> +	ss->read_data_size = SZ_16M;
> > Shouldn't it be easy to actually make a reasonable approximation of the size, instead of reallocating all the time?
> > Or run twice, returning size on first attempt, and data on second.
> > 
> 
> I tried this initially and is actually a pretty large refactor, hence I
> went with the realloc approach for simplicity. Paulo's test case does 4
> reallocs and seems to preform just fine.
>  

Ignoring this, your right. Fairly easy to do 2 passes. Agree this is
better. Be on the lookout for my next post with this fixed.

Matt

> > In any case,
> > 
> > ss->read_data_size = some const + VM_DUMP_SIZE * some other const
> > 
> > This approach is likely too fragile, but we should be able to change the code to dump twice to get the accurate number.
> > 
> 
> It is fragile in the sense that anything before xe_vm_snapshot_print has
> to be 16M or less. This is all static though, how about adding in an
> assert which pops if we hit this limit? The drm_coredump_printer
> protects us against buffer overrun and corrupting memory, we just lose
> coredump data.
> 
> Matt
>  
> > Cheers,
> > ~Maarten