[Intel-gfx] [PATCH 14/15] drm/i915: Async GPU relocation processing

Fri Mar 17 11:15:33 UTC 2017

On to, 2017-03-16 at 13:20 +0000, Chris Wilson wrote:
> If the user requires patching of their batch or auxiliary buffers, we
> currently make the alterations on the cpu. If they are active on the GPU
> at the time, we wait under the struct_mutex for them to finish executing
> before we rewrite the contents. This happens if shared relocation trees
> are used between different contexts with separate address space (and the
> buffers then have different addresses in each), the 3D state will need
> to be adjusted between execution on each context. However, we don't need
> to use the CPU to do the relocation patching, as we could queue commands
> to the GPU to perform it and use fences to serialise the operation with
> the current activity and future - so the operation on the GPU appears
> just as atomic as performing it immediately. Performing the relocation
> rewrites on the GPU is not free, in terms of pure throughput, the number
> of relocations/s is about halved - but more importantly so is the time
> under the struct_mutex.
> 
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>

<SNIP>

>  	/* Must be a variable in the struct to allow GCC to unroll. */
> +	cache->gen = INTEL_GEN(i915);
>  	cache->has_llc = HAS_LLC(i915);
> -	cache->has_fence = INTEL_GEN(i915) < 4;
> -	cache->use_64bit_reloc = HAS_64BIT_RELOC(i915);
> +	cache->has_fence = cache->gen < 4;
> +	cache->use_64bit_reloc = cache->gen >= 8;

Keep using HAS_64BIT_RELOC(i915)...

> +static u32 *reloc_gpu(struct i915_execbuffer *eb,
> +		      struct i915_vma *vma,
> +		      unsigned int len)
> +{
> +	struct reloc_cache *cache = &eb->reloc_cache;
> +	u32 *cmd;
> +
> +	if (cache->rq_size > PAGE_SIZE/sizeof(u32) - (len + 1))

There's no overflow chance here, so I'd rq_size + len + 1. 

> +		reloc_gpu_flush(cache);
> +
> +	if (!cache->rq) {
> +		struct drm_i915_gem_object *obj;
> +		struct drm_i915_gem_request *rq;
> +		struct i915_vma *batch;
> +		int err;
> +
> +		GEM_BUG_ON(vma->obj->base.write_domain & I915_GEM_DOMAIN_CPU);

Use ==.

I just close my eyes for the reminder of this function and expect v2 to
have a proper teardown :P

Also vma / obj pair naming varies from what they usually are, so I'd
consider renaming one of them to lessen confusion.

> @@ -1012,6 +1148,67 @@ relocate_entry(struct i915_vma *vma,
>  	bool wide = eb->reloc_cache.use_64bit_reloc;
>  	void *vaddr;
>  
> +	if (!eb->reloc_cache.vaddr &&
> +	    (DBG_FORCE_RELOC == FORCE_GPU_RELOC ||
> +	     !reservation_object_test_signaled_rcu(obj->resv, true))) {
> +		const unsigned int gen = eb->reloc_cache.gen;
> +		unsigned int len;
> +		u32 *batch;
> +		u64 addr;
> +
> +		if (wide)
> +			len = offset & 7 ? 8 : 5;
> +		else if (gen >= 4)
> +			len = 4;
> +		else if (gen >= 3)
> +			len = 3;
> +		else /* On gen2 MI_STORE_DWORD_IMM uses a physical address */
> +			goto repeat;
> +
> +		batch = reloc_gpu(eb, vma, len);
> +		if (IS_ERR(batch))
> +			goto repeat;
> +
> +		addr = gen8_canonical_addr(vma->node.start + offset);
> +		if (wide) {
> +			if (offset & 7) {
> +				*batch++ = MI_STORE_DWORD_IMM_GEN4;

This indent level calls for a new function. __relocate_entry_gpu

Couldn't we share some of this STORE IMM code we have around? I don't
want to crawl the specs every time :(

Regards, Joonas
-- 
Joonas Lahtinen
Open Source Technology Center
Intel Corporation