<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, Jun 14, 2017 at 2:00 AM, Chris Wilson <span dir="ltr"><<a href="mailto:chris@chris-wilson.co.uk" target="_blank">chris@chris-wilson.co.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Quoting Jason Ekstrand (2017-06-13 22:53:20)<br> <div><div class="gmail-h5">> As I've been working on converting more things in the GL driver over to<br> > blorp, I've been highly annoyed by all of the hangs on Haswell. About one<br> > in 3-5 Jenkins runs would hang somewhere. After looking at about a<br> > half-dozen error states, I noticed that all of the hangs seemed to be on<br> > fast-clear operations (clear or resolve) that happen at the start of a<br> > batch, right after STATE_BASE_ADDRESS.<br> ><br> > Haswell seems to be a bit more picky than other hardware about having<br> > fast-clear operations in flight at the same time as regular rendering and<br> > hangs if the two ever overlap. (Other hardware can get rendering<br> > corruption but not usually hangs.) Also, Haswell doesn't fully stall if<br> > you just do a RT flush and a CS stall. The hardware docs refer to<br> > something they call an "end of pipe sync" which is a CS stall with a write<br> > to the workaround BO. On Haswell, you also need to read from that same<br> > address to create a memory dependency and make sure the system is fully<br> > stalled.<br> ><br> > When you call brw_blorp_resolve_color it calls brw_emit_pipe_control_flush<br> > and does the correct flushes and then calls into core blorp to do the<br> > actual resolve operation. If the batch doesn't have enough space left in<br> > it for the fast-clear operation, the batch will get split and the<br> > fast-clear will happen in the next batch. I believe what is happening is<br> > that while we're building the second batch that actually contains the<br> > fast-clear, some other process completes a batch and inserts it between our<br> > PIPE_CONTROL to do the stall and the actual fast-clear. We then end up<br> > with more stuff in flight than we can handle and the GPU explodes.<br> ><br> > I'm not 100% convinced of this explanation because it seems a bit fishy<br> > that a context switch wouldn't be enough to fully flush out the GPU.<br> > However, what I do know is that, without these patches I get a hang in one<br> > out of three to five Jenkins runs on my wip/i965-blorp-ds branch. With the<br> > patches (or an older variant that did the same thing), I have done almost 20<br> > Jenkins runs and have yet to see a hang. I'd call that success.<br></div></div></blockquote><div><br></div><div>For the record, I *think* this also improves Sky Lake. I believe I saw hangs (less often, maybe 1 in 10) without this and have seen none with it.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div class="gmail-h5"> </div></div>Note that a context switch is itself just a batch that restores the registers<br> and GPU state.<br> <br> The kernel does<br> <br> PIPE_CONTROLs for invalidate-caches<br> MI_SET_CONTEXT<br> MI_BB_START<br> PIPE_CONTROLs for flush-caches<br> MI_STORE_DWORD (seqno)<br> MI_USER_INTERRUPT<br> <br> What I believe you are seeing is that MI_SET_CONTEXT is leaving the GPU<br> in an active state requiring a pipeline barrier before adjusting. It<br> will be the equivalent of switching between GL and blorp in the middle of<br> a batch.<br></blockquote><div><br></div><div>That's also a reasonable theory (or maybe even better). However, the work-around is the same.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> The question I have is whether we apply the fix in the kernel, i.e. do a<br> full end of pipe sync after every MI_SET_CONTEXT. Userspace has the<br> advantage of knowing if/when such a hammer is required, but equally we<br> have to learn where by trial-and-error and if a second context user ever<br> manifests, they will have to be taught the same lessons.<span class="gmail-HOEnZb"><font color="#888888"><br></font></span></blockquote><div><br></div><div>Right.<br><br>Here's arguments for doing it in the kernel:<br><br></div><div> 1) It's the "right" place to do it because it appears to be a cross-context issue.<br></div><div> 2) The kernel knows whether or not you're getting an actual context switch and can insert the end-of-pipe sync when an actual context switch happens rather than on every batch.<br><br></div><div>Here's arguments for doing it in userspace:<br><br></div><div> 1) Userspace knows whether or not we're doing an actual fast-clear operation and can only flush for fast-clears at the beginning of the batch.<br></div><div> 2) The kernel isn't flushing now so we'll get hangs until people update kernels unless we do it in userspace.<br><br></div><div>My gut says userspace but that's because I tend to have a mild distrust of the kernel. There are some things that are the kernel's job (dealing with context switches, for instance) but I'm a big fan of putting anything in userspace that can reasonably go there.<br></div><div><br></div><div>Here's some more data. Knowing this was a big giant hammer, I ran a full suite of benchmarks overnight on my Haswell GT3 and this is what I found:<br><br>Test 0-master 1-i965-end-of-pipe diff<br>bench_manhattan 4442.510 4430.870 -11.640<br>bench_manhattanoff 4683.300 4663.000 -20.300<br>bench_OglBatch0 773.523 771.027 -2.496<br>bench_OglBatch1 775.858 771.802 -4.056<br>bench_OglBatch4 747.629 745.522 -2.107<br>bench_OglPSBump2 513.528 514.944 1.416<br><br></div><div>So the only statistically different things were manhattan and some batch tests and all by around 0.5% or less which may easily have been noise (though ministat seems to think it's significant). So, if the big hammer is hurting us, it's not hurting us badly.<br></div></div></div></div>