[PATCH 12/13] drm/scheduler: rework entity flush, kill and fini

Thu Nov 17 13:11:00 UTC 2022

Am 17.11.22 um 14:00 schrieb Dmitry Osipenko:
> On 11/17/22 15:59, Dmitry Osipenko wrote:
>> On 11/17/22 15:55, Christian König wrote:
>>> Am 17.11.22 um 13:47 schrieb Dmitry Osipenko:
>>>> On 11/17/22 12:53, Christian König wrote:
>>>>> Am 17.11.22 um 03:36 schrieb Dmitry Osipenko:
>>>>>> Hi,
>>>>>>
>>>>>> On 10/14/22 11:46, Christian König wrote:
>>>>>>> +/* Remove the entity from the scheduler and kill all pending jobs */
>>>>>>> +static void drm_sched_entity_kill(struct drm_sched_entity *entity)
>>>>>>> +{
>>>>>>> +    struct drm_sched_job *job;
>>>>>>> +    struct dma_fence *prev;
>>>>>>> +
>>>>>>> +    if (!entity->rq)
>>>>>>> +        return;
>>>>>>> +
>>>>>>> +    spin_lock(&entity->rq_lock);
>>>>>>> +    entity->stopped = true;
>>>>>>> +    drm_sched_rq_remove_entity(entity->rq, entity);
>>>>>>> +    spin_unlock(&entity->rq_lock);
>>>>>>> +
>>>>>>> +    /* Make sure this entity is not used by the scheduler at the
>>>>>>> moment */
>>>>>>> +    wait_for_completion(&entity->entity_idle);
>>>>>> I'm always hitting lockup here using Panfrost driver on terminating
>>>>>> Xorg. Revering this patch helps. Any ideas how to fix it?
>>>>>>
>>>>> Well is the entity idle or are there some unsubmitted jobs left?
>>>> Do you mean unsubmitted to h/w? IIUC, there are unsubmitted jobs left.
>>>>
>>>> I see that there are 5-6 incomplete (in-flight) jobs when
>>>> panfrost_job_close() is invoked.
>>>>
>>>> There are 1-2 jobs that are constantly scheduled and finished once in a
>>>> few seconds after the lockup happens.
>>> Well what drm_sched_entity_kill() is supposed to do is to prevent
>>> pushing queued up stuff to the hw when the process which queued it is
>>> killed. Is the process really killed or is that just some incorrect
>>> handling?
>> It's actually 5-6 incomplete jobs of Xorg that are hanging when Xorg
>> process is closed.
>>
>> The two re-scheduled jobs are from sddm, so it's only the Xorg context
>> that hangs.
>>
>>> In other words I see two possibilities here, either we have a bug in the
>>> scheduler or panfrost isn't using it correctly.
>>>
>>> Does panfrost calls drm_sched_entity_flush() before it calls
>>> drm_sched_entity_fini()? (I don't have the driver source at hand at the
>>> moment).
>> Panfrost doesn't use drm_sched_entity_flush(), nor drm_sched_entity_flush().
> *nor drm_sched_entity_fini()

Well that would mean that this is *really* buggy! How do you then end up 
in drm_sched_entity_kill()? From drm_sched_entity_destroy()?

drm_sched_entity_flush() should be called from the flush callback from 
the file_operations structure of panfrost. See amdgpu_flush() and 
amdgpu_ctx_mgr_entity_flush(). This makes sure that we wait for all 
entities of the process/file descriptor to be flushed out.

drm_sched_entity_fini() must be called before you free the memory the 
entity structure or otherwise we would run into an use after free.

Regards,
Christian.