EXA

Wed Aug 8 05:36:17 PDT 2007

On Wed, Aug 08, 2007 at 11:19:19AM +0200, Michel Dänzer wrote:
> > > P.S. I still doubt this is the bottleneck of your virtual desktop
> > > switching, as the numbers you're getting translate to filling the screen
> > > in just tens of milliseconds.
> > 
> > So, I did profiling of virtual desktop switching.
> 
> Could you share the profiles?

I prefer sysprof as it shows the tree of calls. It is awailable here:
http://www.fi.muni.cz/~xhejtman/desktopswitch-exa.bz2
You need to unzip the file and then it can be opened by the sysprof (in debian
packages).

> > 1) exaCopyDirtyToSys calls exaMemcpyBox which uses plain memcpy instead of
> >    pixman_blt_mmx. Which is result of initial call to GetImage.
> 
> As S??ren pointed out, pixman_blt_mmx would be unlikely to make a
> difference here. The bottleneck is probably reading from uncacheable
> memory.
> 
> Generally, if memcpy doesn't do its job as quickly as possible, that
> should be fixed.

People around mplayer believes that memcpy using just x86 instructions is much
slower than memcpy using MMX (or SSE) namely if run on the latest Intel CPU's.

On the other hand, it depends what did your patch regarding exaShmPutImage,
whether it changed only memcpy to pixman_blt_mmx or if the fall back version
of miShmPutImage really uploaded the image twice. From your previous replies,
I'm not sure which one was the issue.

When reading from uncacheable memory, we should carefully read memory in the
larges possible chunks to avoid re-reading which basically memcpy_mmx does.

> >    Btw, is it possible to expose offscreen pixmap to the application so that
> >    PutImage and GetImage can be safely ignored?
> 
> Not sure that's what you mean, but you can try commenting out the
> exaDoMigration call in exaGetImage to see if that makes any difference.

Forget that question, I meant that for me, it looks like wasting time for
the migration in case of shared video memory because all the memory is system
and easily reachable by the CPU. 

But the point here is that EXA seems to spend huge amount of time to Put and
Get images. XAA does only PutImage and it seems to be faster. I understand
that with EXA, it is cost of acceleration because we need to keep pixmap in
offscreen area to do acceleration and then to migrate to the system memory to
use CPU for some raster operations. Am I right? While in the case of XAA, the 
most operations is done by CPU and the result is only transfered to offscreen
area.

> >    I think that initial approach to optimization could be to call only
> >    GetTimeInMillis each 1000th iteration or something like that.
> 
> Then the cycles would just be burned in I830WaitLpRing instead of
> GetTimeInMillis, wouldn't they?

Yes, that's true. Some nanosleep would be more useful but it may cause much
higher than desired delay. The only true way is to use IRQ.

> As has been pointed out, it could be done using the IRQ. This will
> probably happen via fences when using TTM, in the meantime you could try
> using DRM_IOCTL_I915_IRQ_EMIT and DRM_IOCTL_I915_IRQ_WAIT.

Using this approach, IRQ should be generated after the accelerator has
finished a request?

On the other hand, I still think that such approach is not optimal as
accelerator should be kept busy as much as possible and the ring buffer should
be kept pipelined all the time. But I understand that it is not possible right 
now.

-- 
Lukáš Hejtmánek