[PATCH 2/2] drm/scheduler: stop setting rq to NULL

Thu Aug 9 17:39:56 UTC 2018

On 08/06/2018 04:14 AM, Christian König wrote:
> Am 03.08.2018 um 20:41 schrieb Andrey Grodzovsky:
>>
>>
>>
>> On 08/06/2018 08:44 AM, Christian König wrote:
>>> Am 03.08.2018 um 16:54 schrieb Andrey Grodzovsky:
>>>>
>>>> [SNIP]
>>>>
>>>>>>
>>>>>>>
>>>>>>>     >
>>>>>>>     > Second of all, even after we removed the entity from rq in
>>>>>>>     > drm_sched_entity_flush to terminate any subsequent submissions
>>>>>>>     >
>>>>>>>     > to the queue the other thread doing push job can just
>>>>>>>     acquire again a
>>>>>>>     > run queue
>>>>>>>     >
>>>>>>>     > from drm_sched_entity_get_free_sched and continue submission
>>>>>>>
>>>>>>> Hi Christian
>>>>>>>
>>>>>>>     That is actually desired.
>>>>>>>
>>>>>>>     When another process is now using the entity to submit jobs
>>>>>>>     adding it
>>>>>>>     back to the rq is actually the right thing to do cause the
>>>>>>>     entity is
>>>>>>>     still in use.
>>>>>>>
>>>>>>
>>>>>> Yes, no problem if it's another process. But what about another 
>>>>>> thread from same process ? Is it a possible use case that 2 
>>>>>> threads from same process submit to same entity concurrently ? If 
>>>>>> so and we specifically kill one, the other will not stop event if 
>>>>>> we want him to because current code makes him just require a rq 
>>>>>> for him self.
>>>>>
>>>>> Well you can't kill a single thread of a process (you can only 
>>>>> interrupt it), a SIGKILL will always kill the whole process.
>>>>
>>>> Is the following scenario possible and acceptable ?
>>>> 2 threads from same process working on same queue where thread A 
>>>> currently in drm_sched_entity_flush->wait_event_timeout (the 
>>>> process getting shut down because of SIGKILL sent by user)
>>>> while thread B still inside drm_sched_entity_push_job before 'if 
>>>> (reschedule)'. 'A' stopped waiting because queue became empty and 
>>>> then removes the entity queue from scheduler's run queue while
>>>> B goes inside 'reschedule' because it evaluates to true ('first' is 
>>>> true and all the rest of the conditions), acquires new rq, and 
>>>> later adds it back to scheduler (different one maybe) and keeps 
>>>> submitting jobs as much as he likes and then can be stack for up to 
>>>> 'timeout' time  in his drm_sched_entity_flush waiting for them.
>>>
>>> I'm not 100% sure but I don't think that can happen.
>>>
>>> See flushing the fd is done while dropping the fd, which happens 
>>> only after all threads of the process in question are killed.
>>
>> Yea, this FDs handling is indeed a lot of gray area for me but as far 
>> as I remember flushing is done per each thread when exits (possibly 
>> due to a signal).
>> Now signals interception and processing (as a result of which .flush 
>> will get called if SIGKILL received) is done in some points amongst 
>> which is when returning from IOCTL.
>> So if first thread was at the very end of the CS ioctl when SIGKILL 
>> was received while the other one at the beginning  then I think we 
>> might see something like the scenario above.
>
> SIGKILL isn't processed as long as any thread of the application is 
> still inside the kernel. That's why we have wait_event_killable().

Can you tell me where is this happening ? What i see is in the code is that
do_group_exit calls zap_other_threads which just adds SIGKILL to signal 
sets of other threads in group and sends a wake up.
Then do_exit will close all FDs for current thread and so .flush will be 
called, when last thread drops it's refcount for the FD .release will be 
called.

Andrey

>
> So I don't think that the scenario above is possible, but I'm really 
> not 100% sure either.
>
> Christian.
>
>>
>> Andrey
>>
>>>
>>> Otherwise the flushing wouldn't make to much sense. In other words 
>>> imagine an application where a thread does a write() on a fd which 
>>> is killed.
>>>
>>> The idea of the flush is to preserve the data and that won't work if 
>>> that isn't correctly ordered.
>>>
>>>> My understanding was that introduction of entity->last is to force 
>>>> immediate termination job submissions by any thread from the 
>>>> terminating process.
>>>
>>> We could consider reordering that once more. Going to play out all 
>>> scenarios in my head over the weekend :)
>>>
>>> Christian.
>>>
>>>>
>>>> Andrey
>>>
>>
>>
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20180809/8a11cb04/attachment-0001.html>