[Intel-gfx] [PATCH 1/1] drm/i915/guc: Fix GuC error capture sizing estimation and reporting

Teres Alexis, Alan Previn alan.previn.teres.alexis at intel.com
Mon Oct 3 18:28:29 UTC 2022



On Fri, 2022-09-30 at 15:35 -0700, Harrison, John C wrote:
> On 9/30/2022 14:08, Teres Alexis, Alan Previn wrote:
> > I disagree because its unlikely that all engines can reset all at once (we probably have bigger problems at the at
> > point) and if they all occurred within the same G2H handler scheduled worker run, our current gpu_coredump framework
> > would just discard the ones after the first one and so it wouldnt even matter if we did catch it.
> So min_size is not actually the minimal size for a meaningful capture? 
> So what is? And remember that for compute class engines, there is 
> dependent engine reset. So a reset of CCS2 also means a reset of RCS, 
> CCS0, CCS1 and CCS3. So having at least four engines per capture is not 
> unreasonable.
> 
Alan: min_size is a meaningful size for the worst case scenario of collecting the guc-report. However due to gpu-core-
dump, its not meaningful in terms of reporting that information out to the user. Thats not a limitation of this
subsystem.

> It seems pointless to go through a lot of effort to calculate the 
> minimum and recommend sizes only to basically ignore them by just 
> whispering very, very quietly that there might be a problem.
> 
Alan: I already responded that i will re-rev as per your recommendation and switch back to drm_notice.

> It also 
> seems pointless to complain about a minimum size that actually isn't the 
> minimum size. That's sort of worse - now you are telling the user there 
> is a problem when really there isn't.
> 
Alan: the min size is accurate - but whether we report to the use about them possibly facing a problem is where it gets
a bit unclear because of the limitation in the gpu-core-dump framework. We could drop the message completely if you like
- but then we'd have to remember to re-add it if we fix gpu-core-dump in future. For now, as i mentioned in the last
reply, i have already changed it back to "notice" as per your last comment. Perhaps you should have looked at rev2 which
was posted before your responses above. As mentioned in last reply, i disagreed but i am committing to your request
which was fixed in rev2 as per your request.

> IMHO, the min_size check should be meaningful and should be visible to 
> the user if it fails.
> 
> Also, are we still hitting the minimum size failure message? Now that 
> the calculation has been fixed, what sizes does it come up with for min 
> and spare? Are they within the allocation now or not?
> 
Yes - we are within the allocation with this patch (the fix of removing the redundant register-struct-size
multiplication brought the number down significantly).

> John.

So how would you like to proceed? Could we reply on rev2 btw?


More information about the Intel-gfx mailing list