[Intel-gfx] [RFC] drm/i915 : Reduce the shmem page allocation time by using blitter engines for clearing pages.

Wed May 7 12:03:37 CEST 2014

On Tue, 2014-05-06 at 17:56 +0000, Eric Anholt wrote:
> sourab.gupta at intel.com writes:
> 
> > From: Sourab Gupta <sourab.gupta at intel.com>
> >
> > This patch is in continuation of and is dependent on earlier patch
> > series to 'reduce the time for which device mutex is kept locked'.
> > (http://lists.freedesktop.org/archives/intel-gfx/2014-May/044596.html)
> 
> One of userspace's assumptions is that when you allocate a new BO, you
> can map it and start writing data into it without needing to wait on the
> GPU.  I expect this patch to mostly hurt performance on apps (and I note
> that the patch doesn't come with any actual performance data) that get
> more stalls as a result.
> 
Hi Eric,
Yes, it may hurt the performance on apps, in case of small buffers and 
if blitter engine is busy as there is a synchronous wait for rendering 
in the gem_fault handler. If that is the case, we can drop this from the 
gem_fault routine and employ it only in the do_execbuffer routine. Its 
useful there because there is no synchronous wait required in sw, due 
to cross ring synchronization.
We'll gather the numbers to quantify the performance benefit we have
while using blitter engines in this way for different buffer sizes.

> More importantly, though, it breaks existing userspace that relies on
> buffers being idle on allocation, for the unsychronized maps used in
> intel_bufferobj_subdata() and
> intel_bufferobj_map_range(GL_INVALIDATE_BUFFER_BIT |
> GL_UNSYNCHRONIZED_BIT)

Sorry, I miss your point here. It may not break this assumption due to
the fact that we employ this method only in case of the preallocate
routine, which will be called in the first page fault of the object
(gem_fault handler) resulting in fresh allocation of pages. 

So, in case of unsynchronized maps, there may be a wait involved in the
first page fault. Also, that wait time may be lesser than the time
required for CPU memset (resulting in no performance hit).
There won't be any subsequent waits afterwards for that buffer object.

Though, we'll have performance hit in the case when blitter engine is
already busy and may not be available to immediately start the memset of
freshly allocated mmaped buffers.

Am I missing something here? Does the userspace requirement for
unsynchronized mapped objects involve complete idleness of object on gpu
even when object page faults for the first time?

Regards,
Sourab