[PATCH] drm/ttm: fix two regressions since move_notify changes

Ben Skeggs skeggsb at gmail.com
Wed Jan 25 01:41:05 PST 2012


On Wed, 2012-01-25 at 09:39 +0100, Thomas Hellstrom wrote:
> On 01/25/2012 09:05 AM, Ben Skeggs wrote:
> > On Wed, 2012-01-25 at 08:43 +0100, Thomas Hellstrom wrote:
> >> On 01/25/2012 06:34 AM, Ben Skeggs wrote:
> >>> From: Ben Skeggs<bskeggs at redhat.com>
> >>>
> >>> Both changes in dc97b3409a790d2a21aac6e5cdb99558b5944119 cause serious
> >>> regressions in the nouveau driver.
> >>>
> >>> move_notify() was originally able to presume that bo->mem is the old node,
> >>> and new_mem is the new node.  The above commit moves the call to
> >>> move_notify() to after move() has been done, which means that now, sometimes,
> >>> new_mem isn't the new node at all, bo->mem is, and new_mem points at a
> >>> stale, possibly-just-been-killed-by-move node.
> >>>
> >>> This is clearly not a good situation.  This patch reverts this change, and
> >>> replaces it with a cleanup in the move() failure path instead.
> >> While I have nothing against having move_notify() happening before the
> >> move, but
> >> I still very much dislike  the failure path thing, since it puts a
> >> requirement on TTM to support
> >> driver moves through move_notify(), which is done by Radeon.
> >> I've kindly asked Jerome to change that, stating a number of reasons,
> >> but he refuses, and I'm
> >> not prepared to let support for that mode of operation sneak in like this.
> > What requirement is there?  All it's asking the driver is to do a
> > move_notify with old/new nodes swapped, which would effectively undo the
> > previous move_notify().
> >
> > move_notify() *cannot* happen after the move for the reasons I mentioned
> > already, so the choice is apparently either to completely ignore
> > cleaning up if the subsequent move() fails (previous behaviour prior to
> > the patch which caused these regressions), or to make the change I
> > did...
> 
> I agree. What I'm proposing is that we should go through with the first 
> part of
> your patch and ignore the cleanup.
> 
> >
> >>
> >> The second issue is that the call to move_notify() from cleanup_memtype_use()
> >> causes the TTM ghost objects to get passed into the driver.  This is clearly
> >> bad as the driver knows nothing about these "fake" TTM BOs, and ends up
> >> accessing uninitialised memory.
> >>
> >> I worked around this in nouveau's move_notify() hook by ensuring the BO
> >> destructor was nouveau's.  I don't particularly like this solution, and
> >> would rather TTM never pass the driver these objects.  However, I don't
> >> clearly understand the reason why we're calling move_notify() here anyway
> >> and am happy to work around the problem in nouveau instead of breaking the
> >> behaviour expected by other drivers.
> >>
> >>
> >> move_notify() here gives the driver a chance to unbind any GPU maps
> >> remaining on the BO
> >> before it changes placement or is destroyed.
> >> We've previously required the driver to support any type of BO in the
> >> driver hooks, I agree
> >> with you that it would be desirable to make that go away.
> > Okay, that's fine I guess.  As the commit says, I'm happy to make
> > nouveau just ignore operations on BOs it doesn't "own".
> >
> > I know *you* think of move_notify() as only being around to deal with
> > unmapping, but as I mentioned previously, it'd have never taken the new
> > node as a parameter if this is were the case.
> That's not exactly true. After you pointed that out, I went back and 
> check the
> old vmwgfx use, and the new node was used to determine whether
> an unbind was actually necessary, or whether it could leave the gpu map
> untouched.
Ok, but even so, prior to these changes there was no good reason nor
mention in any of the headers/code that drivers weren't supposed to
assume "new_mem" was anything useful..

> 
> >   I have zero issue with
> > nouveau using it to set up the GPU mappings currently.  I know you've
> > suggested alternatives previously, which may be possible down the track
> > if it's *really* necessary, but it seems like a bad idea to pursue this
> > for 3.3.
> 
> My main concern is that we blindly and unnecessarily set up GPU bindings and
> end up with unnecessary code in TTM, and furthermore that we communicate
> that bad practice to future driver writers.
This "unnecessary code" is like 5 lines of cleanup if something fails,
hardly anything to be jumping up and down about :)

> 
> Thomas, what do you suggest to move forward with this? Both of these 
> bugs are serious regressions that make nouveau unusable with the current 
> 3.3-rc series. Ben.
> 
> My number one choice would of course be to have the drivers set up their 
> private GPU mappings after a
> successful validate call, as I originally suggested and you agreed to.
> 
> If that's not possible (I realize it's late in the release series), I'll 
> ack this patch if you and Jerome agree not to block
> attempts to move in that direction for future kernel releases.
I can't say I'm entirely happy with the plan honestly.  To me, it still
seems more efficient to handle this when a move happens (comparatively
rare) and "map new backing storage into every vm that has a reference"
than to (on every single buffer of every single "exec" call) go "is this
buffer mapped into this channel's vm? yes, ok; no, lets go map it".

I'm not even sure how exactly I plan on storing this mapping efficiently
yet.. Scanning the BO's linked list of VMs it's mapped into for "if
(this_vma == chan->vma)" doesn't exactly sound performant.

Thanks,
Ben.

> 
> Thanks,
> Thomas
> 
> 
> 
> >> Signed-off-by: Ben Skeggs<bskeggs at redhat.com>
> >> Cc: Jerome Glisse<j.glisse at gmail.com>
> >> ---
> >>    drivers/gpu/drm/nouveau/nouveau_bo.c |    4 ++++
> >>    drivers/gpu/drm/ttm/ttm_bo.c         |   17 +++++++++++++----
> >>    2 files changed, 17 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> >> index 724b41a..ec54364 100644
> >> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> >> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> >> @@ -812,6 +812,10 @@ nouveau_bo_move_ntfy(struct ttm_buffer_object *bo, struct ttm_mem_reg *new_mem)
> >>    	struct nouveau_bo *nvbo = nouveau_bo(bo);
> >>    	struct nouveau_vma *vma;
> >>
> >> +	/* ttm can now (stupidly) pass the driver bos it didn't create... */
> >> +	if (bo->destroy != nouveau_bo_del_ttm)
> >> +		return;
> >> +
> >>    	list_for_each_entry(vma,&nvbo->vma_list, head) {
> >>    		if (new_mem&&   new_mem->mem_type == TTM_PL_VRAM) {
> >>    			nouveau_vm_map(vma, new_mem->mm_node);
> >> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> >> index 2f0eab6..7c3a57d 100644
> >> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> >> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> >> @@ -404,6 +404,9 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
> >>    		}
> >>    	}
> >>
> >> +	if (bdev->driver->move_notify)
> >> +		bdev->driver->move_notify(bo, mem);
> >> +
> >>    	if (!(old_man->flags&   TTM_MEMTYPE_FLAG_FIXED)&&
> >>    	!(new_man->flags&   TTM_MEMTYPE_FLAG_FIXED))
> >>    		ret = ttm_bo_move_ttm(bo, evict, no_wait_reserve, no_wait_gpu, mem);
> >> @@ -413,11 +416,17 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
> >>    	else
> >>    		ret = ttm_bo_move_memcpy(bo, evict, no_wait_reserve, no_wait_gpu, mem);
> >>
> >> -	if (ret)
> >> -		goto out_err;
> >> +	if (ret) {
> >> +		if (bdev->driver->move_notify) {
> >> +			struct ttm_mem_reg tmp_mem = *mem;
> >> +			*mem = bo->mem;
> >> +			bo->mem = tmp_mem;
> >> +			bdev->driver->move_notify(bo, mem);
> >> +			bo->mem = *mem;
> >> +		}
> >>
> >> -	if (bdev->driver->move_notify)
> >> -		bdev->driver->move_notify(bo, mem);
> >> +		goto out_err;
> >> +	}
> >>
> >>    moved:
> >>    	if (bo->evicted) {
> >>
> >>
> >>
> >>
> >
> 
> 
> 




More information about the dri-devel mailing list