input thread [was Re: [PATCH v2] input: constify valuators passed in by input drivers.]

Fri Aug 13 08:09:24 PDT 2010

On Fri, 2010-08-13 at 08:51 -0300, Fernando Carrijo wrote:
> Adam Jackson <ajax at nwnk.net> wrote:
> > The design docs were shipped in the R6 source but fell away once it
> > became clear that MTX was a dead end.  I've got PDF versions of them up
> > at:
> > 
> > http://people.freedesktop.org/~ajax/mtx/
> 
> Thanks a lot, Ajax. What an invaluable favor!

Just so no one thinks this was some herculean effort, all the historical
release sources are available:

http://www.x.org/pub/

The extent of the labor I put into this was remembering which release
they were in and converting them to PDF.

> This works - surprisingly for me and maybe not for others - but for
> reasons which are beyond my comprehension, the cursor damaging code
> shows some artifacts.

I'm assuming you're referring to the software cursor code here.  That
makes some sense: any rendering done from the cursor thread will race
with rendering done by the main thread.

I kind of feel like any software-rendered cursor is inherently too slow
to benefit from threading, but MPX means lots of software cursors.
Interesting question.

> I understand the case for GPU-level parallelism, but I know almost
> nothing about how better approach the problem from this perspective.
> I really would like to have deeper knowledge about it though. Maybe
> if I started getting acquainted with GPU specs?

That's not a necessary level of understanding, I don't think.  Most of
the work in threading at that level is purely software.  For example,
the rendering dispatch code assumes callers are allowed to modify the
request data, so Xinerama has to malloc() a copy of the original and
then pass it afresh down to every screen.  So you'd either need to
rewrite the lower layers to _not_ modify their arguments in place, or
malloc enough space to pass a copy of the request to every screen at
once.  (I'm strongly in favor of the former, if it's not obvious.)

> So may I suppose that doing this for the sole purpose of guaranteeing
> fairness among clients isn't worth either? I ask because we all hear
> people complaining about selfish clients eating their machine cycles,
> but seldom a reasonable solution pops up. At least not that I know
> of.

X is a server.  It has no choice but to do what clients ask of it.  If
you have one client and it wants to do CreateWindow/DestroyWindow in a
loop...  So the mere complaint of "eating all your cycles" isn't
necessarily anything we can do something about.  If however client A is
monopolizing the server's time at the expense of client B, then
(assuming a reasonably fair scheduler) all we can do is make the thing A
is doing faster.

Now we might have a bad scheduler.  I don't know that anyone's looked
into recording the decisions it makes and deciding whether they were
sane enough (at least, not since it was written).  So that's certainly
worth double-checking.

Moving to a more direct-rendered model doesn't really change the
problem, it just moves it to the kernel.  That may be beneficial, since
the kernel has more information about the state of the world, but it's
still a fairness problem that's basically unsolved in the open kernels.
IRIX used to be really good at this...

> And to be honest, it seems to me that the precautions taken by the
> smart scheduler to penalize hungry clients isn't enough in certain
> circumstances. Firefox abusive usage of XPutImage comes to my mind,
> even though I'm completely unaware of the reasons that make the cost
> of this operation to be so expensive.

PutImage sucks because it's at least two copies through the unix socket,
one from the client down to the kernel and one back up from the kernel
to the server.  Once you've done that you still have to get the data
into the driver's storage, which is either a memcpy into host memory
somewhere or DMA up to the video card.  However, the way it's
implemented now, that's all synchronous; even if you DMA it up to VRAM,
you have to block the server until the DMA completes.

ShmPutImage is a bit better in that it elides the socket copies, but
that last memcpy or DMA still has to fire, and it still completes
synchronously; the server won't advance to the next request until it's
done.  And of course {Shm,}GetImage have all the same problems.

Those are really the only requests that move any significant amount of
data, so it seems like there's a few obvious performance hacks you could
try.  If you assume DMA-capable hardware, changing the image hooks to
allow async completion would let the server advance to other clients
while the DMA is in flight; and actually a dedicated memcpy thread would
be a pretty good approximation of this for non-DMA hardware, at least if
you have enough CPUs to go around.

You could also modify the PutImage dispatch path to make it possible to
readv() directly into driver storage.  This would be pretty complicated,
and would only make PutImage faster, so it might not be worth it.
However the storage requirements on ShmPutImage are kind of onerous (you
have to know ahead of time that you're going to want to unpack that jpeg
into a shm segment, which is a little convoluted), which is why cairo
basically always ends up going through PutImage instead of ShmPutImage,
so, maybe.  But it's also only a win for non-DMA hardware, which is
already pretty lame, so, maybe not worth it at all.

- ajax
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <http://lists.x.org/archives/xorg-devel/attachments/20100813/34682055/attachment.pgp>