[Intel-gfx] [RFC] drm/i915 : Reduce the shmem page allocation time by using blitter engines for clearing pages.

Wed May 7 21:44:31 CEST 2014

"Gupta, Sourab" <sourab.gupta at intel.com> writes:

> On Tue, 2014-05-06 at 17:56 +0000, Eric Anholt wrote:
>> sourab.gupta at intel.com writes:
>> 
>> > From: Sourab Gupta <sourab.gupta at intel.com>
>> >
>> > This patch is in continuation of and is dependent on earlier patch
>> > series to 'reduce the time for which device mutex is kept locked'.
>> > (http://lists.freedesktop.org/archives/intel-gfx/2014-May/044596.html)
>> 
>> One of userspace's assumptions is that when you allocate a new BO, you
>> can map it and start writing data into it without needing to wait on the
>> GPU.  I expect this patch to mostly hurt performance on apps (and I note
>> that the patch doesn't come with any actual performance data) that get
>> more stalls as a result.
>> 
> Hi Eric,
> Yes, it may hurt the performance on apps, in case of small buffers and 
> if blitter engine is busy as there is a synchronous wait for rendering 
> in the gem_fault handler. If that is the case, we can drop this from the 
> gem_fault routine and employ it only in the do_execbuffer routine. Its 
> useful there because there is no synchronous wait required in sw, due 
> to cross ring synchronization.
> We'll gather the numbers to quantify the performance benefit we have
> while using blitter engines in this way for different buffer sizes.
>
>> More importantly, though, it breaks existing userspace that relies on
>> buffers being idle on allocation, for the unsychronized maps used in
>> intel_bufferobj_subdata() and
>> intel_bufferobj_map_range(GL_INVALIDATE_BUFFER_BIT |
>> GL_UNSYNCHRONIZED_BIT)
>
> Sorry, I miss your point here. It may not break this assumption due to
> the fact that we employ this method only in case of the preallocate
> routine, which will be called in the first page fault of the object
> (gem_fault handler) resulting in fresh allocation of pages. 
>
>
> So, in case of unsynchronized maps, there may be a wait involved in the
> first page fault. Also, that wait time may be lesser than the time
> required for CPU memset (resulting in no performance hit).
> There won't be any subsequent waits afterwards for that buffer object.
>
> Though, we'll have performance hit in the case when blitter engine is
> already busy and may not be available to immediately start the memset of
> freshly allocated mmaped buffers.
>
> Am I missing something here? Does the userspace requirement for
> unsynchronized mapped objects involve complete idleness of object on gpu
> even when object page faults for the first time?

Oh, I mised how this works.  So at pagefault time, you're firing off the
blit, then immediately stalling on it?  This sounds even less like a
possible performance win than I was initially thinking.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/intel-gfx/attachments/20140507/bfc38586/attachment.sig>