[Intel-gfx] [PATCH 6/9] drm/i915: driver based PASID handling

Thu Oct 8 04:29:14 PDT 2015

On 07/10/2015 17:14, Daniel Vetter wrote:
> On Wed, Oct 07, 2015 at 08:16:42AM -0700, Jesse Barnes wrote:
>> On 10/07/2015 06:00 AM, David Woodhouse wrote:
>>> On Fri, 2015-09-04 at 09:59 -0700, Jesse Barnes wrote:
>>>> +
>>>> +       ret = handle_mm_fault(mm, vma, address,
>>>> +                             desc.wr_req ? FAULT_FLAG_WRITE : 0);
>>>> +       if (ret & VM_FAULT_ERROR) {
>>>> +               gpu_mm_segv(tsk, address, SEGV_ACCERR); /* ? */
>>>> +               goto out_unlock;
>>>> +       }
>>>> +
>>>
>>> Hm, do you need to force the SEGV there, in what ought to be generic
>>> IOMMU code?
>>>
>>> Can you instead just let the fault handler return an appropriate
>>> failure code to the IOMMU request queue and then deal with the
>>> resulting error on the i915 device side?
>>
>> I'm not sure if we get enough info on the i915 side to handle it
>> reasonably, we'll have to test that out.
>
> We do know precisely which context blew up, but without the TDR work we
> can't yet just kill the offender selective without affecting the other
> active gpu contexts.

Could someone clarify what this means from the TDR point of view, 
please? When you say "context blew up" I'm guessing that you mean that 
come context caused the fault handler to get involved somehow?

Does this imply that the offending context will hang and the driver will 
have to detect this hang? If so, then yes - if we have the per-engine 
hang recovery mode as part of the upcoming TDR work in place then we 
could handle it by stepping over the offending batch buffer and moving 
on with a minimum of side-effects on the rest of the driver/GPU.

Or does this imply that we have some new integration path to deal with? 
(something that I should be aware of when upstreaming TDR?)

Sorry if I missed something blatantly obvious in the patch ;).

Thanks,
Tomas

>
> But besides that I really don't see a reason why we need to kill the
> process if the gpu faults. After all if a thread sigfaults then signal
> goes to that thread and not some random one (or the one thread that forked
> the thread that blew up). And we do have interfaces to tell userspace that
> something bad happened with the gpu work it submitted.
>
> Chris made a similar patch for userptr and I didn't like that one either.
> Worst case userspace has a special SEGV handler and then things really go
> down badly when that handler gets triggered at an unexpected place.
> -Daniel
>
>
>>> That way, you should hopefully get to gracefully cope with reporting
>>> errors for a specific *context*, rather than killing the whole process.
>>
>> It would be best to get per-context error info, but killing the process
>> may be unavoidable (just as if a single thread clobbers memory in your
>> process).
>>
>> Jesse
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
>