[PATCH 12/13] drm/scheduler: rework entity flush, kill and fini

Thu Nov 17 14:41:25 UTC 2022

On 11/17/22 16:11, Christian König wrote:
> Am 17.11.22 um 14:00 schrieb Dmitry Osipenko:
>> On 11/17/22 15:59, Dmitry Osipenko wrote:
>>> On 11/17/22 15:55, Christian König wrote:
>>>> Am 17.11.22 um 13:47 schrieb Dmitry Osipenko:
>>>>> On 11/17/22 12:53, Christian König wrote:
>>>>>> Am 17.11.22 um 03:36 schrieb Dmitry Osipenko:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 10/14/22 11:46, Christian König wrote:
>>>>>>>> +/* Remove the entity from the scheduler and kill all pending
>>>>>>>> jobs */
>>>>>>>> +static void drm_sched_entity_kill(struct drm_sched_entity *entity)
>>>>>>>> +{
>>>>>>>> +    struct drm_sched_job *job;
>>>>>>>> +    struct dma_fence *prev;
>>>>>>>> +
>>>>>>>> +    if (!entity->rq)
>>>>>>>> +        return;
>>>>>>>> +
>>>>>>>> +    spin_lock(&entity->rq_lock);
>>>>>>>> +    entity->stopped = true;
>>>>>>>> +    drm_sched_rq_remove_entity(entity->rq, entity);
>>>>>>>> +    spin_unlock(&entity->rq_lock);
>>>>>>>> +
>>>>>>>> +    /* Make sure this entity is not used by the scheduler at the
>>>>>>>> moment */
>>>>>>>> +    wait_for_completion(&entity->entity_idle);
>>>>>>> I'm always hitting lockup here using Panfrost driver on terminating
>>>>>>> Xorg. Revering this patch helps. Any ideas how to fix it?
>>>>>>>
>>>>>> Well is the entity idle or are there some unsubmitted jobs left?
>>>>> Do you mean unsubmitted to h/w? IIUC, there are unsubmitted jobs left.
>>>>>
>>>>> I see that there are 5-6 incomplete (in-flight) jobs when
>>>>> panfrost_job_close() is invoked.
>>>>>
>>>>> There are 1-2 jobs that are constantly scheduled and finished once
>>>>> in a
>>>>> few seconds after the lockup happens.
>>>> Well what drm_sched_entity_kill() is supposed to do is to prevent
>>>> pushing queued up stuff to the hw when the process which queued it is
>>>> killed. Is the process really killed or is that just some incorrect
>>>> handling?
>>> It's actually 5-6 incomplete jobs of Xorg that are hanging when Xorg
>>> process is closed.
>>>
>>> The two re-scheduled jobs are from sddm, so it's only the Xorg context
>>> that hangs.
>>>
>>>> In other words I see two possibilities here, either we have a bug in
>>>> the
>>>> scheduler or panfrost isn't using it correctly.
>>>>
>>>> Does panfrost calls drm_sched_entity_flush() before it calls
>>>> drm_sched_entity_fini()? (I don't have the driver source at hand at the
>>>> moment).
>>> Panfrost doesn't use drm_sched_entity_flush(), nor
>>> drm_sched_entity_flush().
>> *nor drm_sched_entity_fini()
> 
> Well that would mean that this is *really* buggy! How do you then end up
> in drm_sched_entity_kill()? From drm_sched_entity_destroy()?

Yes, from drm_sched_entity_destroy().

> drm_sched_entity_flush() should be called from the flush callback from
> the file_operations structure of panfrost. See amdgpu_flush() and
> amdgpu_ctx_mgr_entity_flush(). This makes sure that we wait for all
> entities of the process/file descriptor to be flushed out.
> 
> drm_sched_entity_fini() must be called before you free the memory the
> entity structure or otherwise we would run into an use after free.

Right, drm_sched_entity_destroy() invokes these two functions and
Panfrost uses drm_sched_entity_destroy().

-- 
Best regards,
Dmitry