Intermittent errors when using amdgpu_job_submit_direct

Wed Jul 10 13:32:33 UTC 2019

在 2019/7/10 3:26, Kuehling, Felix 写道:
> On 2019-07-09 8:58 a.m., Zhou, David(ChunMing) wrote:
>> I've raised it up when Christian make page fault, at that patch,
>> amdgpu_job_submit_direct uses exclusive page fault ring for that.
>>
>> But if you use amdgpu_job_submit_direct for gerneral rings ocuppied by
>> scheduler, I guess varias bugs will happen.
> The problem is, even the paging ring is used by the scheduler. There are
> several places where buffer operations are submitted to the paging ring
> through the scheduler. That makes any use of the paging ring through
> direct submission problematic.
>
> Even ignoring the scheduler, if it's possible that multiple threads
> submit to the paging ring, we'll need locking to ensure that the
> contents of the ring remain consistent. IIRC, the rings used to have
> locking before we had a GPU scheduler. For comparison, see
> radeon_ring.c, which still has locking. With the GPU scheduler, the
> rings became single-producer queues that no longer needed locking. But
> with direct submission that is no longer true. I think a good place to
> do that locking now would be in amdgpu_ib_schedule.

Yes, That is exact reason why we remove ring lock at that moment.

You can add back it when using submit_direct co-existing with scheduler.

-David

>
> Regards,
>     Felix
>
>
>> -David
>>
>> 在 2019/7/9 12:53, Kuehling, Felix 写道:
>>> I'm seeing some weird intermittent bugs (vm faults, hangs, etc) when
>>> trying to use amdgpu_job_submit_direct. I'm wondering if there is a
>>> possibility of a race condition, when a submit_direct and a GPU
>>> scheduler thread try to submit to the same ring at the same time. I
>>> didn't see any locking to allow multiple threads safely submitting to
>>> the same ring.
>>>
>>> Am I missing something?
>>>
>>> Thanks,
>>>       Felix
>>>