[Intel-gfx] [PATCH v2 10/12] drm/i915/ttm: handle blitter failure on DG2

Mon Jun 27 10:48:45 UTC 2022

On Thu, 23 Jun 2022 at 16:31, Matthew Auld <matthew.auld at intel.com> wrote:
>
> On 23/06/2022 15:52, Christian König wrote:
> > Am 23.06.22 um 16:13 schrieb Matthew Auld:
> >> [SNIP]
> >>>> TTM_BO_VM_NUM_PREFAULT);
> >>>> +               /*
> >>>> +                * Ensure we check for any fatal errors if we had to
> >>>> move/clear
> >>>> +                * the object. The device should already be wedged if
> >>>> we hit
> >>>> +                * such an error.
> >>>> +                */
> >>>> +               if (i915_gem_object_wait_moving_fence(obj, true))
> >>>> +                       ret = VM_FAULT_SIGBUS;
> >>>
> >>> We should check with Christian here whether it's ok to export
> >>> ttm_bo_vm_fault_idle() as a helper, so that we release the proper locks
> >>> while waiting. The above is not a bug, but causes us to wait for the
> >>> moving fence under the mmap_lock, which is considered bad.
> >>
> >> Christian, any chance we can export ttm_bo_vm_fault_idle() for use
> >> here? Or is that NACK?
> >
> > Well question is why you want to do this? E.g. what's the background?
>
> Right, so basically we need to prevent userspace from being able to
> access the pages for the object, if the ttm blit/move hits an error
> (some kind of GPU error). Normally we can just fall back to
> memcpy/memset to ensure we never leak anything (i915 is never allowed to
> hand userspace non-zeroed memory even for VRAM), but with small-BAR
> systems this might not be possible. Anyway, if we do hit an error during
> the ttm move we might now mark the object as being in an "unknown state"
> before signalling the fence. Later when binding the GPU page-tables we
> check for the "unknown state" and skip the bind (it will end up just
> pointing to some scratch pages instead). And then here on the CPU side,
> we need to sync against all the kernel fences, before then checking for
> the potential "unknown state", which is then handled by returning SIBUS.
> The i915_gem_object_wait_moving_fence() is basically doing exactly that,
> but it looks dumb compared to what ttm_bo_vm_fault_idle() is doing. And
> then while all this going on we then also "wedge" the device to
> basically signal that it's busted, which should prevent further work
> being submitted to the gpu.

Gentle ping?

>
> >
> > Regards,
> > Christian.