<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Fri, Jun 15, 2018 at 4:44 PM, Eric Anholt <span dir="ltr"><<a href="mailto:eric@anholt.net" target="_blank">eric@anholt.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-">Michel Dänzer <<a href="mailto:michel@daenzer.net">michel@daenzer.net</a>> writes:<br> <br> > On 2018-06-15 05:25 PM, Jason Ekstrand wrote:<br> >> On June 15, 2018 01:14:24 Michel Dänzer <<a href="mailto:michel@daenzer.net">michel@daenzer.net</a>> wrote:<br> >>> On 2018-06-15 07:31 AM, Jason Ekstrand wrote:<br> >>>><br> >>>> I did some testing and x11perf -copywinwin500 is... exactly the same<br> >>>> with<br> >>>> or without my patches. If anything they might improve it by just a<br> >>>> hair.<br> >>><br> >>> Possible explanations I can think of:<br> >>><br> >>> 1. Your glamor still has its own FBO cache. Which version of xserver are<br> >>> you testing with?<br> >>><br> >> 1.19 I think<br> ><br> > Okay, that doesn't have the glamor FBO cache anymore.<br> ><br> ><br> >>> 2. The i965 driver cache isn't hit even before these changes.<br> >> <br> >> It's definitely getting hit in both cases, it just may require a<br> >> slightly larger cache of we aren't recycling BOs until they're idle.<br> ><br> > It might be more than just slightly, -copywinwin500 can queue many<br> > overlapping copies between flushes. Can you compare the maximum total<br> > cache size with and without this series?<br> <br> </span>I suspect it'll be only about a factor of<br> how-many-batchbuffers-before-<wbr>throttling difference -- while the<br> batchbuffer still references the BO, the bufmgr wouldn't see the buffer<br> to reuse it anyway. I suspect we hit the aperture limit and flush in<br> the copywinwin500 case.<br> </blockquote></div></div><div class="gmail_extra"><br></div><div class="gmail_extra">At Ken's suggestion, I ran some statistics for hits/misses. I did three runs each with master and with my branch:</div><div class="gmail_extra"><br>Master:<br><br>hits = 455868,<br>misses = 388,<br>max_bucket_size = 160<br><br>hits = 404358,<br>misses = 113,<br>max_bucket_size = 34<br><br>hits = 497731,<br>misses = 363,<br>max_bucket_size = 148<br></div><div class="gmail_extra"><br></div><div class="gmail_extra">With patches:<br><br>hits = 493634<br>misses = 253,<br>max_bucket_size = 85<br><br>hits = 495667,<br>misses = 237,<br>max_bucket_size = 83<br><br>hits = 454738,<br>misses = 358,<br>max_bucket_size = 132<br></div><div class="gmail_extra"><br></div><div class="gmail_extra">Some of the numbers, as you can see, are rather noisy but the end result is about the same: we get at least 1000x as many cache hits as misses when running that test. I don't think the choice to recycle busy BOs is really gaining us anything whatsoever. It is worth noting that I did both of those runs in debug builds because I had to use gdb to get the data back out of the driver (prints inside the GL driver used by glamor don't work too well). That probably affected things a bit but I doubt the end result would have been that much different.<br></div><div class="gmail_extra"><br></div><div class="gmail_extra">Which begs the question, why does Michel see such a big difference on radeon? Is there something else that's causing the slow-down? Is recomputing surface layouts expensive? Is there more VMA shuffling that's causing problems?</div></div>