[PATCH 12/13] drm/scheduler: rework entity flush, kill and fini
Christian König
christian.koenig at amd.com
Thu Nov 17 13:11:00 UTC 2022
Am 17.11.22 um 14:00 schrieb Dmitry Osipenko:
> On 11/17/22 15:59, Dmitry Osipenko wrote:
>> On 11/17/22 15:55, Christian König wrote:
>>> Am 17.11.22 um 13:47 schrieb Dmitry Osipenko:
>>>> On 11/17/22 12:53, Christian König wrote:
>>>>> Am 17.11.22 um 03:36 schrieb Dmitry Osipenko:
>>>>>> Hi,
>>>>>>
>>>>>> On 10/14/22 11:46, Christian König wrote:
>>>>>>> +/* Remove the entity from the scheduler and kill all pending jobs */
>>>>>>> +static void drm_sched_entity_kill(struct drm_sched_entity *entity)
>>>>>>> +{
>>>>>>> + struct drm_sched_job *job;
>>>>>>> + struct dma_fence *prev;
>>>>>>> +
>>>>>>> + if (!entity->rq)
>>>>>>> + return;
>>>>>>> +
>>>>>>> + spin_lock(&entity->rq_lock);
>>>>>>> + entity->stopped = true;
>>>>>>> + drm_sched_rq_remove_entity(entity->rq, entity);
>>>>>>> + spin_unlock(&entity->rq_lock);
>>>>>>> +
>>>>>>> + /* Make sure this entity is not used by the scheduler at the
>>>>>>> moment */
>>>>>>> + wait_for_completion(&entity->entity_idle);
>>>>>> I'm always hitting lockup here using Panfrost driver on terminating
>>>>>> Xorg. Revering this patch helps. Any ideas how to fix it?
>>>>>>
>>>>> Well is the entity idle or are there some unsubmitted jobs left?
>>>> Do you mean unsubmitted to h/w? IIUC, there are unsubmitted jobs left.
>>>>
>>>> I see that there are 5-6 incomplete (in-flight) jobs when
>>>> panfrost_job_close() is invoked.
>>>>
>>>> There are 1-2 jobs that are constantly scheduled and finished once in a
>>>> few seconds after the lockup happens.
>>> Well what drm_sched_entity_kill() is supposed to do is to prevent
>>> pushing queued up stuff to the hw when the process which queued it is
>>> killed. Is the process really killed or is that just some incorrect
>>> handling?
>> It's actually 5-6 incomplete jobs of Xorg that are hanging when Xorg
>> process is closed.
>>
>> The two re-scheduled jobs are from sddm, so it's only the Xorg context
>> that hangs.
>>
>>> In other words I see two possibilities here, either we have a bug in the
>>> scheduler or panfrost isn't using it correctly.
>>>
>>> Does panfrost calls drm_sched_entity_flush() before it calls
>>> drm_sched_entity_fini()? (I don't have the driver source at hand at the
>>> moment).
>> Panfrost doesn't use drm_sched_entity_flush(), nor drm_sched_entity_flush().
> *nor drm_sched_entity_fini()
Well that would mean that this is *really* buggy! How do you then end up
in drm_sched_entity_kill()? From drm_sched_entity_destroy()?
drm_sched_entity_flush() should be called from the flush callback from
the file_operations structure of panfrost. See amdgpu_flush() and
amdgpu_ctx_mgr_entity_flush(). This makes sure that we wait for all
entities of the process/file descriptor to be flushed out.
drm_sched_entity_fini() must be called before you free the memory the
entity structure or otherwise we would run into an use after free.
Regards,
Christian.
More information about the amd-gfx
mailing list