[Libdlo] [PATCH] udlfb: high-throughput urb pool

Sat Dec 18 13:40:55 PST 2010

Bernie, 

   Thanks for the quick response...

   More inline below, but I should say up front that any claims below
should be interpreted in terms of a fairly narrow testing scope (1
machine with a couple of different kernels and 1 application driving it
all), so take everything with the requisite grain of salt.

   Onwards.....

On December 18, 2010 at 12:35 PM Bernie Thompson
<bernie.thompson at gmail.com> wrote:

> Hi Andrew,
>
> >> > On Fri, Dec 17, 2010 at 02:17:34PM -0600, Andrew Kephart wrote:
> >> > > udlfb: add parameters for URB pool size and for nonblocking URB
> >> > > pool
>
> Thanks for submitting this patch!  And I'm very glad the defio path is
> getting this attention.
>
> [some deleted]
> I'm emailing all this background to put the value of the defio path in
> perspective (since it's currently an option that's default off).  It
> would be great to have this new attention lead to it being on turned
> on, without the need for an option, if possible.

[AK] That's my hope -- in spite of the fact that the previous patch has
parameters and debug bits out the proverbial wazoo, I'd like to end up
with something that has few, if any, options -- and just works.

>
> So the first point that's important is: these high-level issues of
> "can we enable defio by default?" are more important than any
> relatively minor differences in performance, like the extra scheduling
> latency of releasing buffers via workitem in the defio path.  I'm very

[AK] Definitely, although, the scheduling latency (on my test system)
turned out not to be a minor difference (more on this below).

> excited to have you looking at defio, in hopes this will lead to
> clarity on the high-level issues, since I think we're close.
>
> Then on to this particular patch, the nonblocking URB pool is
> fundamentaly problematic, unfortunately.  There's just so much data
> that gets generated here with graphics, that we can flood any pool
> scheme of any size, with some pretty standard user scenarios (full
> screen video playback at high resolutions).  So we must deal with the

[AK] For reference, my test set up is focused on this model --
full-screen video (my tests were all DVD-quality; 720x400 or 720x480).
 I'm actually avoiding the extended desktop model for now, and just
dealing with the DL device as a video monitor.

> pool running out.  In the choice of dropping pixels (without telling
> the client) vs. blocking the client until the next buffer comes free,
> there are some big simplicity advantages of blocking.  We can't drop
> pixels without scheduling a full-screen repaint for some point in the
> future - and it just gets tricky to work that out.  Right now in
> udlfb, dropped pixels are rare enough (basically only errors), that we
> haven't had to walk into any of that complexity.  But I'd be happy to
> have more back-and-forth on these options.
>

[AK] So here's where my test results diverge from the expected -- with
the blocking model, we lost pixels constantly, regardless of pool size.
 That is, we'd time out without acquiring the semaphore -- often.  In
almost every case of throughput vs. pool size, the nonblocking mode lost
fewer pixels than the blocking model.  As far as I can tell, this was
because of the latency introduced by the scheduled delay in returning
URBs to the pool. Unfortunately, I don't have my raw test numbers in
front of me, but the difference approached an order of magnitude for a
given throughput requirement.  It's certainly possible that I've got
something funky or unique in my configuration that makes this slower
than normal, but I doubt that it is too far off.

> In terms of pool size, I'd agree that we should not torture the user
> with module options for this.
>
> What I'd recommend is this:  4 64K buffers works great for the damage
> path in all the scenarios I've observed .. we only rarely make it to
> the 3rd or 4th URB, before the earlier ones in flight come back.

[AK] <keanu>Whoa</keanu>.  I get to the 4th URB almost instantly every
time with the 480p video.

>
> The defio path should be the same, except for that scheduling latency
> to free the buffer (and transfer control to the current waiter) in a
> normal context via delayed work (defio introduces difficult context
> considerations - we crash if we don't get it right).  First, off,
> perhaps there's a solution to defio's context limitations, that don't
> involve an extra context switch.

[AK] Agreed on the context issues -- if we could avoid the context
switches, the in-completion up() calls might eliminate the huge
performance impact that I'm seeing.

>
> Barring that, there's a point where that scheduling latency shouldn't
> matter (where we'd be waiting anyway).  I would bump up with number of
> buffers from 4 until we find that point of diminishing returns.  I
> suspect it's not too many more.  Can you run a few tests to see an
> adjustment of this parameter in the header can get us what we need?
> Also, if we're not filling the 64K buffers (if defio is sending us
> just a few pages at a time), then also try reducing the 64K buffer
> size, and increasing the number of buffers accordingly.

[AK] Certainly -- that kind of testing is exactly why the tuning
parameters are in there in the first place.
So, here's my test scenario:
(1) Moderately high-motion video at 720x480
(2) Average compression of 17-20% (using the existing RLX command)
(3) Average USB on-the-wire b/w required: 165-180 Mb/s
(4) AMD core at 3.0GHz (nominal)
(5) "success" defined as 2 minutes of video with no visible loss or
stuttering (subjective, I know).
(6) mplayer -vo fbdev (without damage notification -- all defio)

Test runs with both blocking and nonblocking models, starting with a
pool size of 4.

For the blocking case, no runs were successful.  Measured output for a
4-entry pool was less than 10Mb/s, increasing linearly to about 20Mb/s
for a pool size of 64, where the output performance curve leveled off.
 Even ridiculously large pools (2048 or 4096 64K buffers) failed to
yield over 30Mb/s sustained on the wire (they of course did very well
for the first few tens of seconds....)

For the non-blocking case, the measured output scaled linearly until the
USB pipe itself was the limiting factor.
A non-blocking pool of 4 gave about 50Mb/s, with each addition of 4 URBs
providing approximately 50Mb/s of output bandwidth.  A non-blocking pool
size of 13 was "successful", while a non-blocking pool size of 14 was
required for zero lost pixels.

In all cases (both blocking and non-blocking), the URBs were universally
full (over 60K per URB average utilization).

>
> So, again, thanks so much for this patch!  While it's not one that can
> get applied as-is, the focus it puts on defio is great (what do we
> need to finish to enable defio by default?), and hopefully the
> discussion it triggers will help us find a good way to optimize perf
> in the defio path, that's also simple.

[AK] True -- I'm wishing now that I had used a different subject line
and format in the original post; the patch wasn't so much intended to be
applied as-is but rather to spark the discussion.  Would've saved us all
some thrashing, I think.

>
> Thanks!
> Bernie