Re: ✗ Fi.CI.BAT: failure for series starting with [1/6] dma-buf: add dynamic DMA-buf handling v13

Thu Aug 8 07:59:19 UTC 2019

On Thu, Aug 8, 2019 at 9:09 AM Koenig, Christian
<Christian.Koenig at amd.com> wrote:
>
> Am 07.08.19 um 23:19 schrieb Daniel Vetter:
> > On Wed, Jul 31, 2019 at 10:55:02AM +0200, Daniel Vetter wrote:
> >> On Thu, Jun 27, 2019 at 09:28:11AM +0200, Christian König wrote:
> >>> Hi Daniel,
> >>>
> >>> those fails look like something random to me and not related to my patch
> >>> set. Correct?
> >> First one I looked at has the reservation_obj all over:
> >>
> >> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-cml-u/igt@gem_exec_fence@basic-busy-default.html
> >>
> >> So 5 second guees is ... probably real?
> >>
> >> Note that with the entire lmem stuff going on right now there's massive
> >> discussions about how we're doing resv_obj vs obj->mm.lock the wrong way
> >> round in i915, so I'm not surprised at all that you managed to trip this.
> >>
> >> The way I see it right now is that obj->mm.lock needs to be limited to
> >> dealing with the i915 shrinker interactions only, and only for i915 native
> >> objects. And for dma-bufs we need to make sure it's not anywhere in the
> >> callchain. Unfortunately that's a massive refactor I guess ...
> > Thought about this some more, aside from just breaking i915 or waiting
> > until it's refactored (Both not awesome) I think the only option is get
> > back to the original caching. And figure out whether we really need to
> > take the direction into account for that, or whether upgrading to
> > bidirectional unconditionally won't be ok. I think there's only really two
> > cases where this matters:
> >
> > - display drivers using the cma/dma_alloc helpers. Everything is allocated
> >    fully coherent, cpu side wc, no flushing.
> >
> > - Everyone else (on platforms where there's actually some flushing going
> >    on) is for rendering gpus, and those always map bidirectional and want
> >    the mapping cached for as long as possible.
> >
> > With that we could go back to creating the cached mapping at attach time
> > and avoid inflicting the reservation object lock to places that would keel
> > over.
> >
> > Thoughts?
>
> Actually we had a not so nice internal mail thread with our hardware
> guys and it looks like we have tons of hardware bugs/exceptions that
> sometimes PCIe BARs are only readable or only writable. So it turned out
> that always caching with bidirectional won't work for us either.
>
> Additional to that I'm not sure how i915 actually triggered the issue,
> cause with the current code that shouldn't be possible.

Forgot to explain this: i915 has it's own lock for managing buffer
state, originally struct_mutex, now also i915_gem_obj->mm.lock. When
importing we take both of these before calling into the exporter, when
exporting we need these when getting called from the import. If the
importer uses the reservation_object lock you get a classic ABBA
deadlock.

I thought the plan was to push struct_mutex down and obj->mm.lock up
until they meet in the middle and we can start using the resv_obj
ww_mutex for everything. But looking at some of the latest in-flight
patches (I cc'ed you on them) that seems to not really be the plan,
which is bad :-/
-Daniel

> But independent of that I came to the conclusion that we first need to
> get to a common view of what the fences in the reservation mean or
> otherwise the whole stuff here isn't going to work smooth either.
>
> So working on that for now and when that's finished I will come back to
> this problem here again.
>
> Regards,
> Christian.
>
>
> > -Daniel
> >
>

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch