[PATCH v3 2/2] drm/xe/devcoredump: Remove IS_ERR_OR_NULL check for kzalloc

Mon Feb 24 22:14:39 UTC 2025

On 2/24/2025 13:38, Lucas De Marchi wrote:
> On Thu, Feb 20, 2025 at 05:36:19PM -0800, John Harrison wrote:
>> On 2/20/2025 15:54, Lucas De Marchi wrote:
>>> On Thu, Feb 20, 2025 at 05:29:56PM +0100, Michal Wajdeczko wrote:
>>>> On 20.02.2025 01:17, Shuicheng Lin wrote:
>>>>> kzalloc returns a valid pointer or NULL if the allocation fails.
>>>>> It never returns an error pointer. It is better to check for NULL 
>>>>> directly.
>>>>>
>>>>> Signed-off-by: Shuicheng Lin <shuicheng.lin at intel.com>
>>>>> Cc: John Harrison <John.C.Harrison at Intel.com>
>>>>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>>>>> ---
>>>>>  drivers/gpu/drm/xe/xe_devcoredump.c | 4 ++--
>>>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c 
>>>>> b/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>> index 60d15e455017..81b9d9bb3f57 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
>>>>> @@ -426,8 +426,8 @@ void xe_print_blob_ascii85(struct drm_printer 
>>>>> *p, const char *prefix, char suffi
>>>>>          drm_printf(p, "Offset not word aligned: %zu", offset);
>>>>>
>>>>>      line_buff = kzalloc(DMESG_MAX_LINE_LEN, GFP_KERNEL);
>>>>> -    if (IS_ERR_OR_NULL(line_buff)) {
>>>>> -        drm_printf(p, "Failed to allocate line buffer: %pe", 
>>>>> line_buff);
>>>>> +    if (!line_buff) {
>>>>> +        drm_printf(p, "Failed to allocate line buffer\n");
>>>>
>>>> btw, since this line will be included in the output, where one could
>>>> expect ascii85 data, shouldn't we print that diagnostic message with
>>>> some special prefix to make it clear there is nothing to parse? like
>>>>
>>>>     "# Failed to allocate internal data\n"
>>>>
>>>> also since caller may have already provided a prefix, shouldn't we 
>>>> also
>>>> include it in this diagnostic message?
>>>>
>>>>     "%s%s# Failed to allocate internal data\n",
>>>>     prefix ?: "",
>>>>     prefix ? ": " : ""
>>>
>>> or stop printing and return an error. we are missing the `.error: ...`
>>> already that is used in other places.
>>>
>>> $ git grep '\.error: ' -- drivers/gpu/drm/xe
>>> drivers/gpu/drm/xe/xe_vm.c:             drm_printf(p, "[0].error: 
>>> %li\n", PTR_ERR(snap));
>>> drivers/gpu/drm/xe/xe_vm.c:                     drm_printf(p, 
>>> "[%llx].error: %li\n", snap->snap[i].ofs,
>> This is the place that should be printing an error. The whole point 
>> of this helper is that it wraps up all the blob output. However, do we 
>
> note that this is not printing an error in the log. This is adding the
> error message in the place that is supposed to have the *data* for that
> key. That's why there was supposed to be a .error key to accompany this
> behavior.  Right now if you look only at the devcoredump you have no
> clue the data is actually an error message, not real data.
Argh! Yes, getting myself confused. The '.data' is part of the prefix. 
We should trim the prefix down to just the bit in square brackets and 
have the helper print the size, the data and/or the error keys as 
appropriate. Although not sure how that would work with the GuC log 
being split across multiple bo's. It might be worth pushing support for 
split bo's in to the helper as well. Let it take care of everything.

>
>
>> need to distinguish between a non-capture-process error (e.g. bad VM 
>> object) versus an error in the capture itself (e.g. out of memory 
>> converting the binary data to a text string)?
>>
>> Not sure what error routes there are in the VM capture? Are they 
>> things that are important to include in the devcoredump because they 
>> have significant meaning about what caused the hang? Or are the only 
>> possible errors related to the capture process itself - failing to 
>> allocate memory to store the capture or such?
>>
>> If the only errors are capture related then yes, just change this 
>> line to print "[%prefix].error: %errno\n". But if there is use to 
>> distinguish between bad VM objects and failed captures, then maybe 
>> this one should be "[%prefix].capture_error: %errno\n" or something?
>
> -ENOMEM vs something else would already be a very good indicative.
But for a VM, can there be an ENOMEM error because the VM itself failed 
to allocate (and thus caused the app to dereference a null pointer on 
the GT and thus hit the crash)? Or can the ENOMEM only come from the 
devcoredump code trying to cache and/or convert the object into a 
dumpable entity?

Maybe there is no value to differentiating where the error came from. 
Maybe a bad VM or other object just won't exist and won't be included in 
the devcoredump in the first place. But I'm not familiar with that code 
so just wanting to make sure we have thought about the possibility.

John.

>
> This discussion can continue. For now applying these patches that are
> orthogonal.
>
> Applied both to drm-xe-next.
>
> thanks,
> Lucas De Marchi
>
>>
>> John.
>>
>>
>>>
>>> Lucas De Marchi
>>>
>>>
>>>
>>>
>>>>
>>>>>          return;
>>>>>      }
>>>>>
>>>>
>>