[PATCH 1/5] drm/vmwgfx: unbind in vmw_ttm_unpopulate
Dave Airlie
airlied at gmail.com
Mon Jul 26 00:03:22 UTC 2021
On Fri, 23 Jul 2021 at 19:40, Christian König
<ckoenig.leichtzumerken at gmail.com> wrote:
>
> Am 23.07.21 um 11:21 schrieb Daniel Vetter:
> > On Fri, Jul 23, 2021 at 11:13 AM Christian König
> > <ckoenig.leichtzumerken at gmail.com> wrote:
> >> Am 23.07.21 um 10:47 schrieb Daniel Vetter:
> >>> On Thu, Jul 22, 2021 at 02:41:23PM +0200, Christian König wrote:
> >>>> Doing this in vmw_ttm_destroy() is to late.
> >>>>
> >>>> It turned out that this is not a good idea at all because it leaves pointers
> >>>> to freed up system memory pages in the GART tables of the drivers.
> >>> So I wanted to review this series, and I can't reconcile your claim here
> >>> with the demidlayering Dave has done. The driver patches here don't
> >>> ouright undo what Dave has done, but that means the bug has been
> >>> preexisting since forever (or is due to some other change?), and your
> >>> commit message is a bit confusing here.
> >>>
> >>> The final patch just undoes the demidlayering from Dave, and I really
> >>> don't see where there's even a functional change there.
> >>>
> >>> And even these patches here don't really change a hole lot with the
> >>> calling sequence for at least final teardown: ttm_tt_destroy_common calls
> >>> ttm_tt_unpopulate as the first thing, so at least there there's no change.
> >>>
> >>> Can you pls elaborate more clearly what exactly you're fixing and what
> >>> exactly needs to be reordered and where this bug is from (commit sha1)? As
> >>> is I'm playing detective and the evidence presented is extremely since and
> >>> I can't reconcile it at all.
> >>>
> >>> I mean I know you don't like typing commit message and documentation, but
> >>> it does get occasionally rather frustrating on the reviewer side if I have
> >>> to interpolate between some very sparse hints for this stuff :-/
> >> Yeah, when have seen the history it's rather obvious what's wrong here
> >> and I expected Dave to review it himself.
> >>
> >> Previously we had three states in TTM for a tt object: Allocated ->
> >> Populated -> Bound which on destruction where done in the order unbind
> >> -> unpopulate -> free.
> >>
> >> Dave moved handling of the bound state into the drivers since it is
> >> basically a driver decision and not a TTM decision what should be bound
> >> and what not (that part perfectly makes sense).
> > I haven't reviewed all the patches from Dave, only the one you pointed
> > at (in the last patch). And that one I still can't match up with your
> > description. If there's other commits relevant, can you pls dig them
> > out?
> >
> > Like it all makes sense what you're saying and matches the code, I
> > just can't match it up with the commit you're referencing.
>
> That is the patch directly following the one I've mentioned:
>
> commit 37bff6542c4e140a11657406c1bab50a40329cc1
> Author: Dave Airlie <airlied at redhat.com>
> Date: Thu Sep 17 13:24:50 2020 +1000
>
> drm/ttm: move unbind into the tt destroy.
>
> This moves unbind into the driver side on destroy paths.
>
> I will add a Fixes tag to make that clear.
>
> But this patch also just moves the undbind from the TTM destroy path to
> the driver destroy path.
>
> To be honest I'm not 100% sure either when the when the unbind moved
> from the unpopulate path into the destroy path, but I think that this
> wasn't always the case. Let me try to dig that up.
>
> >> The problem is that he also moved doing the unbind into the free
> >> callback instead of the unpopulate callback. This result in stale page
> >> pointers in the GART if that unpopulate operation isn't immediately
> >> followed by a free.
> >>
> >> Thinking more about it if we do populated->unpopulated->populated then
> >> we would also have stale pointers to the old pages which is even worse.
> >>
> >> This is also not de-midlayering since we already have a proper
> >> ttm_tt_init()/ttm_tt_fini() functions which should work nicely for the
> >> tt object.
> > Well you're last patch moves the ttm_tt_destroy_common stuff back into
> > ttm, which kinda is de-demidlayering. So I'm confused.
>
> Ah, yes that is correct. I've also considered to move this in
> ttm_tt_fini instead of there.
>
> But that would be a larger change and I wanted to fix the problem at
> hand first, potentially even adding a CC stable tag.
>
> > Other bit: I think it'd be good to document this properly in the
> > callbacks, and maybe ideally go about and kerneldoc-ify the entire
> > ttm_tt.h header. Otherwise when we eventually (never?) get around to
> > that, everyone has forgotten these semantic details and issues again.
>
> Already working towards including more of the TTM headers and code files
> in kerneldoc. But not quite there yet.
>
> But you know, normal human: Only equipped with one head and two hands
> and not cloneable.
I'm the same, but I'm not seeing where this problem happens at all, do
we have any backtraces or demos for this?
I split bind/unbind into the driver, but the driver should still
always be moving things to unbound states before an unpopulate is
called, is there a driver doing something unexpected here?
at worst I'd like to see a WARN_ON put in first and a test in igt that
triggers it, because right now I'm not see that path through the
drivers/ttm that leads to unpopulated pages ending up happening while
bound.
>From 5.14-rc3 unpopulate is called from ttm_bo_pipeline_gutting in
non-ghost path and there is no unbind,
pipeline gutting is called from evict/validate, when there is no
placement suggested for the object, is this case getting hit somewhere
without the driver having previously unbound things?
ttm_tt_destroy_common: calls unpopulate, everyone calls this directly
after unbinding
ttm_tt_swapout: calls unpopulate, we don't swapout objects from TT
directly, we should always unbind first, this used to have an assert
against that,
ttm_tt_populate: call unpopulate in failure path
So any place I can see unpopulate getting called with a bound TT
should be a bug, and fixed, we could protect against it better but I'm
not seeing the need for this series to outright revert things back as
helping.
Dave.
More information about the dri-devel
mailing list