[PATCH v2] drm/xe: Faster devcoredump
Maarten Lankhorst
maarten.lankhorst at linux.intel.com
Mon Jul 29 08:47:59 UTC 2024
Hey,
I like speed, so great to have it fixed!
Den 2024-07-27 kl. 00:01, skrev Zanoni, Paulo R:
> On Thu, 2024-07-25 at 22:21 -0700, Matthew Brost wrote:
>> The current algorithm to read out devcoredump is O(N*N) where N is the
>> size of coredump due to usage of the drm_coredump_printer in
>> xe_devcoredump_read. Switch to a O(N) algorithm which prints the
>> devcoredump into a readable format in snapshot work and update
>> xe_devcoredump_read to memcpy from the readable format directly.
>
> I just tested this:
>
> root at martianriver:~# time cp /sys/class/drm/card0/device/devcoredump/data gpu-hang.data
>
> real 0m0.313s
> user 0m0.008s
> sys 0m0.298s
> root at martianriver:~# ls -lh gpu-hang.data
> -rw------- 1 root root 221M Jul 26 14:47 gpu-hang.data
>
> Going from an estimated 221 minutes to 0.3 seconds, I'd say it's an improvement.
>
>>
>> v2:
>> - Fix double free on devcoredump removal (Testing)
>> - Set read_data_size after snap work flush
>> - Adjust remaining in iterator upon realloc (Testing)
>> - Set read_data upon realloc (Testing)
>>
>> Reported-by: Paulo Zanoni <paulo.r.zanoni at intel.com>
>> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2408
>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>> Cc: Maarten Lankhorst <maarten.lankhorst at linux.intel.com>
>> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
>> ---
>> drivers/gpu/drm/xe/xe_devcoredump.c | 140 +++++++++++++++++-----
>> drivers/gpu/drm/xe/xe_devcoredump.h | 13 ++
>> drivers/gpu/drm/xe/xe_devcoredump_types.h | 4 +
>> drivers/gpu/drm/xe/xe_vm.c | 9 +-
>> drivers/gpu/drm/xe/xe_vm.h | 4 +-
>> 5 files changed, 136 insertions(+), 34 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
>> index d8d8ca2c19d3..6af161250a9e 100644
>> --- a/drivers/gpu/drm/xe/xe_devcoredump.c
>> +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
>> @@ -66,22 +66,9 @@ static struct xe_guc *exec_queue_to_guc(struct xe_exec_queue *q)
>> return &q->gt->uc.guc;
>> }
>>
>> -static void xe_devcoredump_deferred_snap_work(struct work_struct *work)
>> +static void __xe_devcoredump_read(char *buffer, size_t count,
>> + struct xe_devcoredump *coredump)
>> {
>> - struct xe_devcoredump_snapshot *ss = container_of(work, typeof(*ss), work);
>> -
>> - /* keep going if fw fails as we still want to save the memory and SW data */
>> - if (xe_force_wake_get(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL))
>> - xe_gt_info(ss->gt, "failed to get forcewake for coredump capture\n");
>> - xe_vm_snapshot_capture_delayed(ss->vm);
>> - xe_guc_exec_queue_snapshot_capture_delayed(ss->ge);
>> - xe_force_wake_put(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL);
Should this put be conditional?
>> -}
>> -
>> -static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
>> - size_t count, void *data, size_t datalen)
>> -{
>> - struct xe_devcoredump *coredump = data;
>> struct xe_device *xe;
>> struct xe_devcoredump_snapshot *ss;
>> struct drm_printer p;
>> @@ -89,18 +76,12 @@ static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
>> struct timespec64 ts;
>> int i;
>>
>> - if (!coredump)
>> - return -ENODEV;
>> -
>> xe = coredump_to_xe(coredump);
>> ss = &coredump->snapshot;
>>
>> - /* Ensure delayed work is captured before continuing */
>> - flush_work(&ss->work);
>> -
>> iter.data = buffer;
>> iter.offset = 0;
>> - iter.start = offset;
>> + iter.start = 0;
>> iter.remain = count;
>>
>> p = drm_coredump_printer(&iter);
>> @@ -129,15 +110,86 @@ static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
>> xe_hw_engine_snapshot_print(coredump->snapshot.hwe[i],
>> &p);
>> drm_printf(&p, "\n**** VM state ****\n");
>> - xe_vm_snapshot_print(coredump->snapshot.vm, &p);
>> + xe_vm_snapshot_print(ss, coredump->snapshot.vm, &p);
>>
>> - return count - iter.remain;
>> + ss->read_data_size = iter.offset;
>> +}
>> +
>> +static void xe_devcoredump_snapshot_free(struct xe_devcoredump_snapshot *ss)
>> +{
>> + int i;
>> +
>> + xe_guc_ct_snapshot_free(ss->ct);
>> + ss->ct = NULL;
>> +
>> + xe_guc_exec_queue_snapshot_free(ss->ge);
>> + ss->ge = NULL;
>> +
>> + xe_sched_job_snapshot_free(ss->job);
>> + ss->job = NULL;
>> +
>> + for (i = 0; i < XE_NUM_HW_ENGINES; i++)
>> + if (ss->hwe[i]) {
>> + xe_hw_engine_snapshot_free(ss->hwe[i]);
>> + ss->hwe[i] = NULL;
>> + }
>> +
>> + xe_vm_snapshot_free(ss->vm);
>> + ss->vm = NULL;
>> +}
>> +
>> +static void xe_devcoredump_deferred_snap_work(struct work_struct *work)
>> +{
>> + struct xe_devcoredump_snapshot *ss = container_of(work, typeof(*ss), work);
>> + struct xe_devcoredump *coredump = container_of(ss, typeof(*coredump), snapshot);
>> +
>> + /* keep going if fw fails as we still want to save the memory and SW data */
>> + if (xe_force_wake_get(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL))
>> + xe_gt_info(ss->gt, "failed to get forcewake for coredump capture\n");
>> + xe_vm_snapshot_capture_delayed(ss->vm);
>> + xe_guc_exec_queue_snapshot_capture_delayed(ss->ge);
>> + xe_force_wake_put(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL);
>> +
>> + ss->read_data = kvmalloc(SZ_16M, GFP_USER);
>> + if (!ss->read_data)
>> + return;
>> +
>> + ss->read_data_size = SZ_16M;
Shouldn't it be easy to actually make a reasonable approximation of the size, instead of reallocating all the time?
Or run twice, returning size on first attempt, and data on second.
In any case,
ss->read_data_size = some const + VM_DUMP_SIZE * some other const
This approach is likely too fragile, but we should be able to change the code to dump twice to get the accurate number.
Cheers,
~Maarten
More information about the Intel-xe
mailing list