[PATCH v4 2/5] drm/xe/eustall: Introduce API for EU stall sampling

Fri Nov 29 04:31:59 UTC 2024

Hi @Chegondi, Harish,

From L0 UMD side we would prefer to calculate the supported sampling rates and pass it on to KMD. For this we need an query that can expose

1. Sampling clock granularity in nanoseconds - E.g 251 * gpuClockPeriodNs
2. Min and max sampling input rates supported - [1, 7]

Thanks,
Shubham

-----Original Message-----
From: Chegondi, Harish <harish.chegondi at intel.com> 
Sent: Thursday, November 28, 2024 12:18 AM
To: Dixit, Ashutosh <ashutosh.dixit at intel.com>
Cc: Nerlige Ramappa, Umesh <umesh.nerlige.ramappa at intel.com>; intel-xe at lists.freedesktop.org; Ausmus, James <james.ausmus at intel.com>; Degrood, Felix J <felix.j.degrood at intel.com>; Souza, Jose <jose.souza at intel.com>; Cabral, Matias A <matias.a.cabral at intel.com>; Ranjan, Joshua Santhosh <joshua.santosh.ranjan at intel.com>; Kumar, Shubham <shubham.kumar at intel.com>
Subject: Re: [PATCH v4 2/5] drm/xe/eustall: Introduce API for EU stall sampling

On Fri, Nov 22, 2024 at 10:18:14AM -0800, Dixit, Ashutosh wrote:
> On Wed, 20 Nov 2024 19:18:24 -0800, Dixit, Ashutosh wrote:
> >
> > On Wed, 20 Nov 2024 17:05:27 -0800, Umesh Nerlige Ramappa wrote:
> > >
> > > On Wed, Nov 20, 2024 at 11:04:28AM -0800, Dixit, Ashutosh wrote:
> > > > On Tue, 19 Nov 2024 15:59:12 -0800, Harish Chegondi wrote:
> > > >>
> > > >> > > +/**
> > > >> > > + * enum drm_xe_eu_stall_property_id - EU stall sampling input property ids.
> > > >> > > + *
> > > >> > > + * These properties are passed to the driver as a chain of
> > > >> > > + * @drm_xe_ext_set_property structures with @property set 
> > > >> > > +to these
> > > >> > > + * properties' enums and @value set to the corresponding 
> > > >> > > +values of these
> > > >> > > + * properties. @drm_xe_user_extension base.name should be 
> > > >> > > +set to
> > > >> > > + * @DRM_XE_EU_STALL_EXTENSION_SET_PROPERTY.
> > > >> > > + */
> > > >> > > +enum drm_xe_eu_stall_property_id {
> > > >> > > +#define DRM_XE_EU_STALL_EXTENSION_SET_PROPERTY		0
> > > >> > > +	/**
> > > >> > > +	 * @DRM_XE_EU_STALL_PROP_SAMPLE_RATE: Sampling rate
> > > >> > > +	 * in multiples of 251 cycles. Valid values are 1 to 7.
> > > >> > > +	 * If the value is 1, sampling interval is 251 cycles.
> > > >> > > +	 * If the value is 7, sampling interval is 7 x 251 cycles.
> > > >> > > +	 */
> > > >> > > +	DRM_XE_EU_STALL_PROP_SAMPLE_RATE = 1,
> > > >> >
> > > >> > What is the rate of 251 cycles? If that can be clearly 
> > > >> > defined, then at first glance, I would think it's better to 
> > > >> > define this in terms of frequency. The implementation can 
> > > >> > decide how to translate that to HW configuration.
> > > >> >
> > > >> Since the duration of a cycle depends on the GPU clock, it can 
> > > >> very from GPU to GPU. So, if there is any translation in the 
> > > >> driver, it will have to be different for each GPU. I think 
> > > >> keeping this input as a multiplier of cycles may be more future 
> > > >> proof for the uAPI. I am trying to get more information and feedback from the user space regarding your suggestion.
> > > >> If it is feasible, I will implement in v6.
> > > >
> > > > Umesh has a point but I sort of agree with Harish because this 
> > > > value is directly fed into a register. But we do need some changes:
> > > >
> > > > 1. This 251 value showing up here doesn't make any sense and needs to go.
> > > > 2. According to Bspec 64036, HW supports "127 * N" sampling rates
> > > >   (in terms of cycles), so we should support those too.
> > > > 3. Even higher sampling rates (say 10x) are being proposed for the
> > > >   future. So these should also be supported.
> > > >
> > > > So my proposal is simple, but let's see if it can be made to 
> > > > work. The uapi will directly input the sampling rate in number 
> > > > of cycles (so the value coming in is what the GPU freq is 
> > > > divided by). So e.g. if "3 * 251" is required "3 * 251" will 
> > > > come in through the uapi. If UMD wants "7 * 127", they will send 
> > > > in "7 * 127". The driver will internally map this value into the "closest" sampling rate supported by HW.
> > > >
> > > > I am assuming that UMD's already know what sampling rates are 
> > > > supported by a particular HW platform so they can send in the 
> > > > exact value they need. Otherwise the driver can always map the 
> > > > value sent by userspace. Say UMD sends a value 10, this will be 
> > > > mapped into "1 * 127" which is the closest sampling rate supported to 10.
> > > >
> > > > So this way all sampling rates can be supported. UMD just says I 
> > > > want a sampling rate of "GPU_freq divided by 10" and they 
> > > > automatically get whatever is the closest available. They 
> > > > probably do need to have an idea of what rates are supported on 
> > > > a particular HW platform, I am assuming they have this 
> > > > information from Bspec, so they can send in exact values if they know and driver will be able to set the exact value UMD has specified.
> 
> This is an important point here. Can we assume UMD's already know the 
> sampling rates available or not? KMD can handle whatever value UMD 
> sends in but UMD still has to have an idea of what sampling rates are 
> supported by KMD, to be able to set an acceptable sampling rate.
> 
> Otherwise, KMD will need to expose what sampling rates are supported 
> to UMD. There seem to be a couple of ways of doing this:
> 
> 1. KMD can expose the available sampling rates in 
> drm_xe_query_eu_stall
> 
> Or,
> 
> 2. The UMD can set their sampling rate and the actual sampling rate set by
>    KMD for that stream can exposed through DRM_XE_OBSERVATION_IOCTL_INFO
>    ioctl. If UMD doesn't like the sampling rate set by KMD they can close
>    the stream and reopen a new stream. Basically they would need to iterate
>    to arrive at an acceptable sampling rate.
> 
> Or,
> 
> 3. KMD does nothing, assumes UMD's know the sampling rate "out of band" via
>    things such as Bspec.
> 
> I'm thinking maybe should expose the min and max sampling rates through 1.
> and UMD's can select something in between and KMD can map that to one 
> of the sampling rates available in HW and maybe also expose the final 
> sampling rate through 2.
> 
> So e.g. for now min and max exposed would be 1*251 through 7*251 
> (ignoring the 127 stuff for now). Min/max might be more future proof 
> since HW changes to a different scheme, rather than have 7 or 14 fixed 
> sampling rates they have now.

As of now, the driver sets a medium sampling rate as default if the user doesn't pass any sampling rate through the properties. In this patch series, 4 is the default sampling rate which is the mid point between 1, the fastest sampling rate and 7, the slowest sampling rate. It would be good to get feedback from the UMDs on which sampling rates are typically used. As of now, Mesa is setting the sampling rate to the fastest - 1.
If the driver exposes the min and max sampling rates and sets the mid point as the default sampling rate, the user can override with the fastest or the slowest sampling rate or any other rate in between.

Thanks
Harish.
> 
> > >
> > > I am okay with whatever makes sense from a UMD perspective and 
> > > whatever can be extended easily in future. Just the cycles as you 
> > > are suggesting should be good as well.
> >
> > My suggestion is just the inverse of sampling freq, in a sense, i.e.
> >
> >	eu stall sampling freq = gpu freq / N
> >
> > So we get N through the uapi.
> >
> > > Just curious, since the gpu frequency is may vary, any idea how 
> > > the UMDs map this data to their timeline?
> >
> > Note that we are dealing with the sampling freq here, the freq at 
> > which eu stall data is sampled, say  1 / n * 251 or 1 / n * 127 of 
> > gpu freq. But yes sampling freq can vary with gpu freq.
> >
> > But otherwise UMD's don't deal with time very much, only the IP 
> > (instruction pointer). So e.g. eu stall data contains: basically IP 
> > x was stalled for y cycles in this snapshot/sample. Since IP is 
> > unique, it can be mapped to a line of code for an eu kernel e.g.
> >
> > So I guess there is also an approximate notion of time, so when this 
> > eu kernel was sampled (or ran) the first time, IP x was stalled for 
> > y cycles, when it was sampled the next time, IP w was stalled for z 
> > cycles. So you would have a kind of moving window for the stall 
> > times for a particular IP. So it's based on IP, rather than time directly.
> >
> > At least that's how I think it works :/