[PATCH v2 1/2] drm/sched: memset() 'job' in drm_sched_job_init()

Fri Sep 13 13:30:26 UTC 2024

On 13/09/2024 13:30, Philipp Stanner wrote:
> On Fri, 2024-09-13 at 12:56 +0100, Tvrtko Ursulin wrote:
>>
>> Hi,
>>
>> On 28/08/2024 10:41, Philipp Stanner wrote:
>>> drm_sched_job_init() has no control over how users allocate struct
>>> drm_sched_job. Unfortunately, the function can also not set some
>>> struct
>>> members such as job->sched.
>>
>> job->sched usage from within looks like a bug. But not related to the
>> memset you add.
>>
>> For this one something like this looks easiest for a start:
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index ab53ab486fe6..877113b01af2 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -788,7 +788,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
>>                    * or worse--a blank screen--leave a trail in the
>>                    * logs, so this can be debugged easier.
>>                    */
>> -               drm_err(job->sched, "%s: entity has no rq!\n",
>> __func__);
>> +               pr_err("%s: entity has no rq!\n", __func__);
>>                   return -ENOENT;
>>           }
>>
>> Fixes: 56e449603f0a ("drm/sched: Convert the GPU scheduler to
>> variable
>> number of run-queues")
>> Cc: <stable at vger.kernel.org> # v6.7+
> 
> Danilo and I already solved that:
> 
> https://lore.kernel.org/all/20240827074521.12828-2-pstanner@redhat.com/

Ah.. I saw the link to this in your maintainership thread and 
superficially assumed it is among the pending stuff. All good.

>>
>>> This could theoretically lead to UB by users dereferencing the
>>> struct's
>>> pointer members too early.
>>
>> Hmm if drm_sched_job_init returned an error callers should not
>> dereference anything. What was actually the issue you were debugging?
> 
> I was learning about the scheduler, wrote a dummy driver and had
> awkward behavior. Turned out it was this pointer not being initialized.
> I would have seen it immediately if it were NULL.
> 
> The actual issue was and is IMO that a function called
> drm_sched_job_init() initializes the job. But it doesn't, it only
> partially initializes it. Only after drm_sched_job_arm() ran you're
> actually ready to go.

In my experience one good approach when developing stuff is to have the 
various kernel debugging aids enabled. Lockdep, SLAB debugging, memory 
poisoning, kfence.. Then if you were allocating your job without 
GFP_ZERO, _and_ dereferencing something too early out of 
misunderstanding of the API, you would get something obvious in the oops 
and not a random pointer.

Which also applies to various CI systems, such as the Intel's one which 
already runs a debug kernel and a lot of these mistakes are caught 
instantly.

>> Adding a memset is I think not the best solution since it is very
>> likely
>> redundant to someone doing a kzalloc in the first place.
> 
> It is redundant in most cases, but it is effectively for free. I
> measured the runtime with 1e6 jobs with and without memset and there
> was no difference.

I guess if kzalloc and drm_sched_job_init() are close enough in time so 
that cachelines stays put, and depending how you measure, it may be hard 
to see but cost if still there.

For instance 
https://lore.kernel.org/amd-gfx/20240813140310.82706-1-tursulin@igalia.com/ 
I can see with perf that both memsets are hotspots even when testing 
with glxgears and vsync off.

But I don't feel too strongly about this and there definitely is sense 
in initializing everything. Perhaps even instead of a memset we should 
use correct methods per field? Since in there we have spcs_node, 
atomic_t, ktime_t, dma_fence_cb (even in an annoying union), 
drm_sched_priority.. In an ideal world all those would have their 
initializers. But some don't so meh.

Regards,

Tvrtko

>>> It is easier to debug such issues if these pointers are initialized
>>> to
>>> NULL, so dereferencing them causes a NULL pointer exception.
>>> Accordingly, drm_sched_entity_init() does precisely that and
>>> initializes
>>> its struct with memset().
>>>
>>> Initialize parameter "job" to 0 in drm_sched_job_init().
>>>
>>> Signed-off-by: Philipp Stanner <pstanner at redhat.com>
>>> ---
>>> No changes in v2.
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_main.c | 8 ++++++++
>>>    1 file changed, 8 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 356c30fa24a8..b0c8ad10b419 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -806,6 +806,14 @@ int drm_sched_job_init(struct drm_sched_job
>>> *job,
>>>    		return -EINVAL;
>>>    	}
>>>    
>>> +	/*
>>> +	 * We don't know for sure how the user has allocated.
>>> Thus, zero the
>>> +	 * struct so that unallowed (i.e., too early) usage of
>>> pointers that
>>> +	 * this function does not set is guaranteed to lead to a
>>> NULL pointer
>>> +	 * exception instead of UB.
>>> +	 */
>>> +	memset(job, 0, sizeof(*job));
>>> +
>>>    	job->entity = entity;
>>>    	job->credits = credits;
>>>    	job->s_fence = drm_sched_fence_alloc(entity, owner);
>>
>