[PATCH v8 4/7] drm/xe/eustall: Return -EIO error from read() if HW drops data

Thu Jan 30 17:05:10 UTC 2025

On Wed, 29 Jan 2025 20:45:59 -0800, Dixit, Ashutosh wrote:
>
> On Wed, 15 Jan 2025 12:02:10 -0800, Harish Chegondi wrote:
> >
> > If the user space doesn't read the EU stall data fast enough,
> > it is possible that the EU stall data buffer can get filled,
> > and if the hardware wants to write more data, it simply drops
> > data due to unavailable buffer space. In that case, hardware
> > sets a bit in a register. If the driver detects data drop,
> > the driver read() returns -EIO error to let the user space
> > know that HW has dropped data. The -EIO error is returned
> > even if there is EU stall data in the buffer. A subsequent
> > read by the user space returns the remaining EU stall data.
>
> As I mentioned earlier, entire dropped packet handling should be in this
> patch, so we can see the entire logic around this. So data_drop struct
> should be defined in this patch.
>
> >
> > Signed-off-by: Harish Chegondi <harish.chegondi at intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_eu_stall.c | 12 ++++++++++++
> >  drivers/gpu/drm/xe/xe_eu_stall.h |  1 +
> >  2 files changed, 13 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_eu_stall.c b/drivers/gpu/drm/xe/xe_eu_stall.c
> > index c388d733b857..437782f8433c 100644
> > --- a/drivers/gpu/drm/xe/xe_eu_stall.c
> > +++ b/drivers/gpu/drm/xe/xe_eu_stall.c
> > @@ -472,6 +472,7 @@ xe_eu_stall_stream_read_locked(struct xe_eu_stall_data_stream *stream,
> >   * before calling read().
> >   *
> >   * Returns: The number of bytes copied or a negative error code on failure.
> > + *	    -EIO if HW drops any EU stall data when the buffer is full.
> >   */
> >  static ssize_t xe_eu_stall_stream_read(struct file *file, char __user *buf,
> >				       size_t count, loff_t *ppos)
> > @@ -485,6 +486,16 @@ static ssize_t xe_eu_stall_stream_read(struct file *file, char __user *buf,
> >		return -EINVAL;
> >	}
> >
> > +	if (bitmap_weight(stream->data_drop.mask, XE_MAX_DSS_FUSE_BITS)) {
>
> Since data_drop.mask is being touched elsewhere under xecore_buf->lock,
> here also it should be accessed under the same lock. So this returning -EIO
> should probably be moved into xe_eu_stall_stream_read_locked?
>
> > +		if (!stream->data_drop.reported_to_user) {
> > +			stream->data_drop.reported_to_user = true;
> > +			xe_gt_dbg(gt, "EU stall data dropped in XeCores: %*pb\n",
> > +				  XE_MAX_DSS_FUSE_BITS, stream->data_drop.mask);
> > +			return -EIO;
> > +		}
> > +		stream->data_drop.reported_to_user = false;
>
> I don't think this logic is correct. We should set this to false only after
> we have cleared all set bits (e.g. only after bitmap_weight) otherwise we
> might keep returning -EIO multiple times?
>
> If HW continues to drop data and keep setting the line, while we are
> resetting the bit, it is possible bitmap_weight might never become 0. I
> think that is ok, we have returned -EIO at least once to indicate to
> userspace that it is not reading data fast enough and HW is dropping data.
>
> Or we may return -EIO multiple times as is happening here, where
> reported_to_user is set to 0 before all bits might have been cleared. So
> what is happening here might be ok too.
>
> To see this clearly and evaluate it is why I am saying move all of this
> data drop handling and -EIO return into this one patch. So we can decide
> which approach to take: return -EIO just once or return multiple times.
>
> We can also maybe defer this patch and merge the other stuff first if it's
> a separate patch.
>
> So maybe this is ok, maybe not, anyway something to think about.

Because we don't want to complicate this, here's another simple idea:

In eu_stall_data_buf_check(), when we see the drop bit set for a particular
dss, move the read pointer (effectively discarding any data from that dss)
and also clear_dropped_eviction_line_bit(), but the dss with dropped data
is saved off in data_drop.mask. So basically we are clearing and saving the
error condition. Discarding data from that dss might be ok, since HW has
already dropped data.

Now in the next read_locked, if any data_drop.mask bit is set, return -EIO
and clear data_drop.mask. This way we don't need
data_drop.reported_to_user. So this way the entire situation is cleared and
handled once we have returned -EIO.

If more data gets dropped before the next read, same process repeats and
-EIO will be returned again.

>
> > +	}
> > +
> >	if (!(file->f_flags & O_NONBLOCK)) {
> >		do {
> >			if (!stream->pollin) {
> > @@ -680,6 +691,7 @@ static int xe_eu_stall_stream_init(struct xe_eu_stall_data_stream *stream,
> >	if (!stream->xecore_buf)
> >		return -ENOMEM;
> >
> > +	stream->data_drop.reported_to_user = false;
> >	bitmap_zero(stream->data_drop.mask, XE_MAX_DSS_FUSE_BITS);
>
> Stream is kzalloc'd, why do you need to init these?
>
> >
> >	xe_pm_runtime_get(gt_to_xe(gt));
> > diff --git a/drivers/gpu/drm/xe/xe_eu_stall.h b/drivers/gpu/drm/xe/xe_eu_stall.h
> > index f97c8bf8e852..8bc44e9e98af 100644
> > --- a/drivers/gpu/drm/xe/xe_eu_stall.h
> > +++ b/drivers/gpu/drm/xe/xe_eu_stall.h
> > @@ -31,6 +31,7 @@ struct xe_eu_stall_data_stream {
> >	struct xe_bo *bo;
> >	struct per_xecore_buf *xecore_buf;
> >	struct {
> > +		bool reported_to_user;
> >		xe_dss_mask_t mask;
> >	} data_drop;
> >	struct hrtimer poll_check_timer;
> > --
> > 2.47.1
> >