[Libdlo] [PATCH] udlfb: high-throughput urb pool

Mon Dec 20 10:18:37 PST 2010

Hi Andrew,

On Mon, Dec 20, 2010 at 8:22 AM, Andrew Kephart <akephart at akephart.org> wrote:
> A smoothed contraction mechanism wouldn't be too bad, either, if we find
> that the expansion is too hungry for general use.
>
> What do you think?

Thanks again for continuing to poke at ways to improve the defio path
performance.

Some somewhat random thoughts from over the weekend:

* The end goal (of this line of concern) is to keep the USB pipe full.
* Those datapoints on the effect of the delayed_work to release the
semaphore -- they were dramatic.  The schedule operation seems to be
killing us (it would be interesting to look at the damage path and
double-check to make sure there no similar delay with the semaphore
release alone).
* I'm fairly confident we could keep the USB pipe nearly full with 4
(or even 2) 64K buffers, if upstream was delivering data more reliably
(as I believe it is on the damage path today which most everyone is
using).

So dynamically growing the pool is fine, but it has downsides -- I
worry about the effect on performance in a setup with 10 or so USB
terminals, with this additional memory consumption; A goal for udlfb
is to have minimized impact on the rest of system - mem, mmu, cpu -
anything complex tends to run afoul of that; pre-mapping so many
buffers for DMA might be problematic; With on-demand growth, I don't
think we can avoid needing dynamic shrink, as any full-screen change
will tend to cause us to grow the pool to store (a little less than) a
full frame @ current resolution (approaching 5MB for 2048x1152 16bpp).
 If we do make any changes, how do we do them for defio, without also
causing any downsides for the damage path?

Some thoughts on other ways we might attack the problem:

1)  Figure out whether there's a way of meeting defio's context needs
without a long scheduling delay when a buffer is freed up -- there
must be a wicked long (and unpredictable) delay here, based on the
test results you're seeing. Different mechanism? Different way of
transferring control (faster)? Seems like there should be a small
change possible to a better method (but I don't know it).

2) Because on the defio path, waiting doesn't block the application
from continuing to render more, we just fall farther behind (the
damage path does not have this problem, as long as the application
keeps the damage notification in the context of their main rendering
path).  So the best strategy for defio would be completely different
than damage -- no waiting.  I'm imaging something like:

If we could read the contents of framebuffer pages without triggering
any defio faults, then we have a better alternative:  When we get a
defio callback to inform us that a list of pages have been dirtied by
writes, we add those pages to a list (or page bitmap) we track.  Then
we render as many dirty pixels as available URB buffers allow,
removing dirty pages from our list as we complete them.  When we have
no more URB buffers free, rather than wait in that context for more to
come free, we just return from the defio callback.  Defio thinks we're
done, but we've just queued up the work we couldn't complete
immediately.  Then in the URB completion context, we work on the next
dirty page in our list until we've filled another buffer, etc and
finally processed all dirties.

I haven't thought through the locking issues yet, as we'd have at
least 3 locations and 2 contexts needing to queue/dequeue dirty pages
from this new list/bitmap.  We'd need to keep it to a non-scheduled
(spin or lock-free) locking mechanism.

And there could be some tearing, etc. that we don't see currently on
defio, as we read from pages that may be getting writes at the same
time (I'm not sure defio guarantees that today anyway).  But I'm
thinking that tearing will never persist - we should always get
another dirty page notification in the future (since we told defio the
previous writes to that page were already handled).

Stepping back -- all this complexity is partially why defio is
defaulted off right now in udlfb -- the damage path is so much simpler
because there is simple, direct communication between client and
framebuffer about what parts of the framebuffer have been dirtied.

The other opportunity to spend our effort improving udlfb, is to get a
simple damage notification API added to fbdev, and add support for
that interface to the various fbdev clients (of which xf86-video-fbdev
is 90% importance, and directfb+mplayer+fbi+etc is probably the other
10%).  There has been talk of this simple damage notification api
several times over the past few years on lists. Several types of
devices need it (anyone who's using defio, for example). And there
seemed to be mostly support, not opposition, to the idea - but getting
a api/patch accepted is always difficult when proposed from the
periphery, and has not happened yet.

On that thought, would you have cycles to help push for a standard
fbdev damage notifcation ioctl, possibly as simple as:

struct fb_rect {
	uint32_t x;
	uint32_t y;
	uint32_t width;
	uint32_t height;
};
#define FBIOPUT_DAMAGE _IOW('F', 0x1A, struct fb_rect)

(I'm not sure if 'F' 0x1A is still available - may need a bump; would
also want to be explicit that the ioctl can take a long time to
complete, but that intentionally keeps apps from flooding the
downstream rendering, etc.)

Once the interface is settled, add to xf86-video-fbdev (prototyped
here http://git.plugable.com/gitphp/index.php?p=xf86-video-fbdev&a=commitdiff&h=dbfe3f839ee00135a7cccf13db5926fb1019abde
) and the other major fbdev clients like mplayer.

As part of moving udlfb from staging to kernel hopefully in 2.6.38,
I've recently sent Plugable devices to Paul Mundt (the fbdev owner),
and will be working on changes he has asked for, so perhaps that'll
also help get enough awareness and attention to other udlfb issues -
like this standard damage interface - done.  This work is where I'll
have to focus the time I have - it'll have the most end-user benefit.

What's your thoughts on the most important priority: defio vs. damage?
 And the best approaches for progress for each - can you see a way to
help?

Best wishes,
Bernie