[Intel-gfx] Losing write domains and breaking flushing list

Eric Anholt eric at anholt.net
Tue Oct 28 19:07:52 CET 2008


On Fri, 2008-10-24 at 22:25 -0700, Keith Packard wrote:
> Ok, first the obvious bugs.
> 
> i915_gem_object_set_domain computes new read and write domains for each
> object. It does all of the necessary operations to transfer from/to CPU
> domains, but all intra-GPU domain transfers are handled by pending flush
> and invalidate data in the device structure. This allows a single flush
> operation to handle all of the necessary activity for a single exec.
> 
> However. If the exec is aborted, then the domain information in the
> objects will have been made invalid -- no flushes or invalidates will be
> pushed into the ring. The old domains will now be lost, so future
> operations will do the wrong thing.
> 
> i915_gem_object_set_domain is also used from the set_domain ioctl. In
> this case, the domains will be mashed and a flush operation pushed into
> the ring, but the object will not be placed on the active list when that
> flush is posted. This means that no-one knows to wait for the operation
> to complete before accessing the object.
> 
>         Now for the more subtle bug.
> 
> When you call i915_gem_object_set_domain with write_domain == 0, then
> sometimes (depending on the previous domain set), it will set
> obj->write_domain to zero. If this object happened to be sitting on the
> flushing list, it will now never get pulled off of that list -- the only
> way off that list is to have some flush command which affects your
> write_domain, and with a write_domain of zero, that's not going to
> happen.
> 
> This will cause a BUG_ON when you try to close the driver as the
> flushing list will not empty, even if all of the pending operations are
> complete and every domain flushed.
> 
>         A proposed solution (this patch is untested, although it does
>         compile).
> 
> I've split out the i915_gem_object_set_domain function into two pieces:
> i915_gem_object_set_pending_domain and
> i915_gem_object_set_current_domain. set_pending_domain does precisely
> what set_domain used to do, but instead of modifying obj->write_domain
> and obj->read_domains, it leaves the new values for those fields in
> obj->pending_write_domain and obj->pending_read_domains. Yeah, this
> conflates those fields between two different operations, but it's fine
> in practice as the two users are obviously temporally separate.
> 
> I've also regularized the sequence of operations that one does to do
> domain migration:
> 
>      1. i915_gem_dev_prepare_set_domain. This sets up the device to
>         queue flushing and invalidation domains
>      2. i915_gem_object_set_pending_domain. This marks an object for
>         domain transition and pushes pending flush and invalidate bits
>         into the device.
>      3. i915_gem_dev_set_domain. This dumps a flush command into the
>         ring to flush and invalidate any pending device bits. This
>         returns the domains flushed.
>      4. <any other ring commands inserted here>. After this point, there
>         are no more error returns possible as the operation is
>         functionally complete.
>      5. i915_gem_object_set_current_domain. Do this for each object
>         which was given to i915_gem_object_set_pending_domain above.
>      6. If any domains are being flushed, or a batch buffer is queued to
>         the ring:
>              1. i915_add_request. If any domains are flushed, or any
>                 batch buffer is added to the ring, this queues a marker
>                 to let users wait for that operation to complete. Mark
>                 any buffers involved with the new sequence number.
>              2. i915_gem_object_move_to_active. Each obj which has
>                 flushing or batchbuffer involvement should end up on the
>                 active list.
> 
> Ok, so the actual modification of obj->read_domains and
> obj->write_domain is now delayed until after the flush commands (and
> possibly the batch buffer) have been put into the ring, at which point
> the operation is complete and we're just cleaning up and marking what
> happened. That happens without sleeping and without any error returns.
> 
> Any obj with a write_domain must either end up on the flushing list (if
> no flush or batch buffer stuff was dumped into the ring), or it must end
> up on the active list.
> 
> Similarly, any obj without a write_domain must not end up on the
> flushing list. obj with some involvement in the ring must end on the
> active list.
> 
> Those invariants seem easy to check for at least.
> 
> Here's a patch which purports to do all of this. Review appreciated.


> -	(void)i915_add_request(dev, flush_domains);
> -

NAK for this reversion of a bugfix.  Suppose you have a batchbuffer that
sees the following state:

obj->domains starts at (RENDER, RENDER)
reloc1 = (0, TEXTURE)		flush_domains = RENDER
				obj->domains = (0, TEXTURE|RENDER)
(batchbuffer has an MI_FLUSH here)
reloc2 = (RENDER, RENDER)	obj->domains = (RENDER, RENDER)

obj->domains should be (RENDER, RENDER) at the end of this batchbuffer's
execution.  However, since you removed this add_request, the
flush_domains = RENDER that now gets add_requested after batchbuffer
execution will cause the state at the end to be (0, RENDER).
 
>  	/* Exec the batchbuffer */
>  	ret = i915_dispatch_gem_execbuffer(dev, args, exec_offset);
>  	if (ret) {
> @@ -1906,7 +1943,7 @@ i915_gem_execbuffer(struct drm_device *dev, void *data,
>  	 * Ensure that the commands in the batch buffer are
>  	 * finished before the interrupt fires
>  	 */
> -	flush_domains = i915_retire_commands(dev);
> +	flush_domains |= i915_retire_commands(dev);
>  
>  	i915_verify_inactive(dev, __FILE__, __LINE__);
>  
> @@ -1924,6 +1961,7 @@ i915_gem_execbuffer(struct drm_device *dev, void *data,
>  		struct drm_gem_object *obj = object_list[i];
>  		struct drm_i915_gem_object *obj_priv = obj->driver_private;
>  
> +		i915_gem_object_set_current_domain(obj);
>  		i915_gem_object_move_to_active(obj);
>  		obj_priv->last_rendering_seqno = seqno;
>  #if WATCH_LRU
> @@ -2171,16 +2209,28 @@ i915_gem_set_domain(struct drm_gem_object *obj,
>  	struct drm_device *dev = obj->dev;
>  	int ret;
>  	uint32_t flush_domains;
> +	uint32_t seqno;
>  
>  	BUG_ON(!mutex_is_locked(&dev->struct_mutex));
>  
> -	ret = i915_gem_object_set_domain(obj, read_domains, write_domain);
> +	i915_gem_dev_prepare_set_domain(dev);
> +
> +	ret = i915_gem_object_set_pending_domain(obj, read_domains,
> +						 write_domain);
>  	if (ret)
>  		return ret;
> +
>  	flush_domains = i915_gem_dev_set_domain(obj->dev);
>  
> -	if (flush_domains & ~(I915_GEM_DOMAIN_CPU|I915_GEM_DOMAIN_GTT))
> -		(void) i915_add_request(dev, flush_domains);
> +	i915_gem_object_set_current_domain(obj);
> +
> +	if (flush_domains & ~(I915_GEM_DOMAIN_CPU|I915_GEM_DOMAIN_GTT)) {
> +		struct drm_i915_gem_object *obj_priv = obj->driver_private;
> +		
> +		seqno = i915_add_request(dev, flush_domains);
> +		i915_gem_object_move_to_active(obj);
> +		obj_priv->last_rendering_seqno = seqno;
> +	}

For flush_domains & ~(I915_GEM_DOMAIN_CPU | I915_GEM_DOMAIN_GTT) we
already did bo_wait_rendering inside i915_gem_object_set_pending_domain,
right?  It seems like this new block at the end either adds an
irrelevant request to the command stream, or it returns too early since
the job of getting the domains flushed and ready for the CPU isn't done
yet.

-- 
Eric Anholt
eric at anholt.net                         eric.anholt at intel.com


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.freedesktop.org/archives/intel-gfx/attachments/20081028/ee98e769/attachment.sig>


More information about the Intel-gfx mailing list