[PATCH v2] drm/xe: Faster devcoredump

Mon Jul 29 08:47:59 UTC 2024

Hey,

I like speed, so great to have it fixed!

Den 2024-07-27 kl. 00:01, skrev Zanoni, Paulo R:
> On Thu, 2024-07-25 at 22:21 -0700, Matthew Brost wrote:
>> The current algorithm to read out devcoredump is O(N*N) where N is the
>> size of coredump due to usage of the drm_coredump_printer in
>> xe_devcoredump_read. Switch to a O(N) algorithm which prints the
>> devcoredump into a readable format in snapshot work and update
>> xe_devcoredump_read to memcpy from the readable format directly.
> 
> I just tested this:
> 
> root at martianriver:~# time cp /sys/class/drm/card0/device/devcoredump/data gpu-hang.data
> 
> real	0m0.313s
> user	0m0.008s
> sys	0m0.298s
> root at martianriver:~# ls -lh gpu-hang.data 
> -rw------- 1 root root 221M Jul 26 14:47 gpu-hang.data
> 
> Going from an estimated 221 minutes to 0.3 seconds, I'd say it's an improvement.
> 
>>
>> v2:
>>  - Fix double free on devcoredump removal (Testing)
>>  - Set read_data_size after snap work flush
>>  - Adjust remaining in iterator upon realloc (Testing)
>>  - Set read_data upon realloc (Testing)
>>
>> Reported-by: Paulo Zanoni <paulo.r.zanoni at intel.com>
>> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2408
>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>> Cc: Maarten Lankhorst <maarten.lankhorst at linux.intel.com>
>> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
>> ---
>>  drivers/gpu/drm/xe/xe_devcoredump.c       | 140 +++++++++++++++++-----
>>  drivers/gpu/drm/xe/xe_devcoredump.h       |  13 ++
>>  drivers/gpu/drm/xe/xe_devcoredump_types.h |   4 +
>>  drivers/gpu/drm/xe/xe_vm.c                |   9 +-
>>  drivers/gpu/drm/xe/xe_vm.h                |   4 +-
>>  5 files changed, 136 insertions(+), 34 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
>> index d8d8ca2c19d3..6af161250a9e 100644
>> --- a/drivers/gpu/drm/xe/xe_devcoredump.c
>> +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
>> @@ -66,22 +66,9 @@ static struct xe_guc *exec_queue_to_guc(struct xe_exec_queue *q)
>>  	return &q->gt->uc.guc;
>>  }
>>  
>> -static void xe_devcoredump_deferred_snap_work(struct work_struct *work)
>> +static void __xe_devcoredump_read(char *buffer, size_t count,
>> +				  struct xe_devcoredump *coredump)
>>  {
>> -	struct xe_devcoredump_snapshot *ss = container_of(work, typeof(*ss), work);
>> -
>> -	/* keep going if fw fails as we still want to save the memory and SW data */
>> -	if (xe_force_wake_get(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL))
>> -		xe_gt_info(ss->gt, "failed to get forcewake for coredump capture\n");
>> -	xe_vm_snapshot_capture_delayed(ss->vm);
>> -	xe_guc_exec_queue_snapshot_capture_delayed(ss->ge);
>> -	xe_force_wake_put(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL);
Should this put be conditional?

>> -}
>> -
>> -static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
>> -				   size_t count, void *data, size_t datalen)
>> -{
>> -	struct xe_devcoredump *coredump = data;
>>  	struct xe_device *xe;
>>  	struct xe_devcoredump_snapshot *ss;
>>  	struct drm_printer p;
>> @@ -89,18 +76,12 @@ static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
>>  	struct timespec64 ts;
>>  	int i;
>>  
>> -	if (!coredump)
>> -		return -ENODEV;
>> -
>>  	xe = coredump_to_xe(coredump);
>>  	ss = &coredump->snapshot;
>>  
>> -	/* Ensure delayed work is captured before continuing */
>> -	flush_work(&ss->work);
>> -
>>  	iter.data = buffer;
>>  	iter.offset = 0;
>> -	iter.start = offset;
>> +	iter.start = 0;
>>  	iter.remain = count;
>>  
>>  	p = drm_coredump_printer(&iter);
>> @@ -129,15 +110,86 @@ static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
>>  			xe_hw_engine_snapshot_print(coredump->snapshot.hwe[i],
>>  						    &p);
>>  	drm_printf(&p, "\n**** VM state ****\n");
>> -	xe_vm_snapshot_print(coredump->snapshot.vm, &p);
>> +	xe_vm_snapshot_print(ss, coredump->snapshot.vm, &p);
>>  
>> -	return count - iter.remain;
>> +	ss->read_data_size = iter.offset;
>> +}
>> +
>> +static void xe_devcoredump_snapshot_free(struct xe_devcoredump_snapshot *ss)
>> +{
>> +	int i;
>> +
>> +	xe_guc_ct_snapshot_free(ss->ct);
>> +	ss->ct = NULL;
>> +
>> +	xe_guc_exec_queue_snapshot_free(ss->ge);
>> +	ss->ge = NULL;
>> +
>> +	xe_sched_job_snapshot_free(ss->job);
>> +	ss->job = NULL;
>> +
>> +	for (i = 0; i < XE_NUM_HW_ENGINES; i++)
>> +		if (ss->hwe[i]) {
>> +			xe_hw_engine_snapshot_free(ss->hwe[i]);
>> +			ss->hwe[i] = NULL;
>> +		}
>> +
>> +	xe_vm_snapshot_free(ss->vm);
>> +	ss->vm = NULL;
>> +}
>> +
>> +static void xe_devcoredump_deferred_snap_work(struct work_struct *work)
>> +{
>> +	struct xe_devcoredump_snapshot *ss = container_of(work, typeof(*ss), work);
>> +	struct xe_devcoredump *coredump = container_of(ss, typeof(*coredump), snapshot);
>> +
>> +	/* keep going if fw fails as we still want to save the memory and SW data */
>> +	if (xe_force_wake_get(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL))
>> +		xe_gt_info(ss->gt, "failed to get forcewake for coredump capture\n");
>> +	xe_vm_snapshot_capture_delayed(ss->vm);
>> +	xe_guc_exec_queue_snapshot_capture_delayed(ss->ge);
>> +	xe_force_wake_put(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL);
>> +
>> +	ss->read_data = kvmalloc(SZ_16M, GFP_USER);
>> +	if (!ss->read_data)
>> +		return;
>> +
>> +	ss->read_data_size = SZ_16M;
Shouldn't it be easy to actually make a reasonable approximation of the size, instead of reallocating all the time?
Or run twice, returning size on first attempt, and data on second.

In any case,

ss->read_data_size = some const + VM_DUMP_SIZE * some other const

This approach is likely too fragile, but we should be able to change the code to dump twice to get the accurate number.

Cheers,
~Maarten