[PATCH v8 4/7] drm/xe/eustall: Return -EIO error from read() if HW drops data

Fri Jan 31 19:30:03 UTC 2025

On Wed, Jan 29, 2025 at 08:45:59PM -0800, Dixit, Ashutosh wrote:
> On Wed, 15 Jan 2025 12:02:10 -0800, Harish Chegondi wrote:
> >
> > If the user space doesn't read the EU stall data fast enough,
> > it is possible that the EU stall data buffer can get filled,
> > and if the hardware wants to write more data, it simply drops
> > data due to unavailable buffer space. In that case, hardware
> > sets a bit in a register. If the driver detects data drop,
> > the driver read() returns -EIO error to let the user space
> > know that HW has dropped data. The -EIO error is returned
> > even if there is EU stall data in the buffer. A subsequent
> > read by the user space returns the remaining EU stall data.
> 
> As I mentioned earlier, entire dropped packet handling should be in this
> patch, so we can see the entire logic around this. So data_drop struct
> should be defined in this patch.
I worded the commit message that this commit is about read() returning
-EIO when data is dropped. So, I didn't put all the data drop code in
this patch. Sure, I can reword the commit message and move the code
into this patch.
> 
> >
> > Signed-off-by: Harish Chegondi <harish.chegondi at intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_eu_stall.c | 12 ++++++++++++
> >  drivers/gpu/drm/xe/xe_eu_stall.h |  1 +
> >  2 files changed, 13 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_eu_stall.c b/drivers/gpu/drm/xe/xe_eu_stall.c
> > index c388d733b857..437782f8433c 100644
> > --- a/drivers/gpu/drm/xe/xe_eu_stall.c
> > +++ b/drivers/gpu/drm/xe/xe_eu_stall.c
> > @@ -472,6 +472,7 @@ xe_eu_stall_stream_read_locked(struct xe_eu_stall_data_stream *stream,
> >   * before calling read().
> >   *
> >   * Returns: The number of bytes copied or a negative error code on failure.
> > + *	    -EIO if HW drops any EU stall data when the buffer is full.
> >   */
> >  static ssize_t xe_eu_stall_stream_read(struct file *file, char __user *buf,
> >				       size_t count, loff_t *ppos)
> > @@ -485,6 +486,16 @@ static ssize_t xe_eu_stall_stream_read(struct file *file, char __user *buf,
> >		return -EINVAL;
> >	}
> >
> > +	if (bitmap_weight(stream->data_drop.mask, XE_MAX_DSS_FUSE_BITS)) {
> 
> Since data_drop.mask is being touched elsewhere under xecore_buf->lock,
> here also it should be accessed under the same lock. So this returning -EIO
> should probably be moved into xe_eu_stall_stream_read_locked?
data_drop.mask is being accessed via set_bit(), clear_bit(), test_bit()
and bitmap_weight(). set_bit() and clear_bit() are atomic operations,
but test_bit() and bitmap_weight() are not atomic. So, not all the code
accessing the mask need to be under lock. The code that is under lock is
under the buffer lock, whereas xe_eu_stall_stream_read_locked() is under
gt->eu_stall->lock. So, moving this code into xe_eu_stall_stream_read_locked
would not make any difference. I think this code can exist outside of a
lock. If one read() just misses a data drop, the subsequent read would
report the data drop.
> 
> > +		if (!stream->data_drop.reported_to_user) {
> > +			stream->data_drop.reported_to_user = true;
> > +			xe_gt_dbg(gt, "EU stall data dropped in XeCores: %*pb\n",
> > +				  XE_MAX_DSS_FUSE_BITS, stream->data_drop.mask);
> > +			return -EIO;
> > +		}
> > +		stream->data_drop.reported_to_user = false;
> 
> I don't think this logic is correct. We should set this to false only after
> we have cleared all set bits (e.g. only after bitmap_weight) otherwise we
> might keep returning -EIO multiple times?
If the subsequent read() reads all the data from all the subslices, it
would clear all the bits. But if the user buffer is small and it doesn't
read all the data from all the subslices, some bits can continue to be
set and can cause multiple alternate -EIO returns. Ideally, the user
buffer should be big enough to accomodate all the data from the kernel
buffer.
> 
> If HW continues to drop data and keep setting the line, while we are
> resetting the bit, it is possible bitmap_weight might never become 0. I
> think that is ok, we have returned -EIO at least once to indicate to
> userspace that it is not reading data fast enough and HW is dropping data.
> 
> Or we may return -EIO multiple times as is happening here, where
> reported_to_user is set to 0 before all bits might have been cleared. So
> what is happening here might be ok too.
> 
> To see this clearly and evaluate it is why I am saying move all of this
> data drop handling and -EIO return into this one patch. So we can decide
> which approach to take: return -EIO just once or return multiple times.
> 
> We can also maybe defer this patch and merge the other stuff first if it's
> a separate patch.
> 
> So maybe this is ok, maybe not, anyway something to think about.
> 
> > +	}
> > +
> >	if (!(file->f_flags & O_NONBLOCK)) {
> >		do {
> >			if (!stream->pollin) {
> > @@ -680,6 +691,7 @@ static int xe_eu_stall_stream_init(struct xe_eu_stall_data_stream *stream,
> >	if (!stream->xecore_buf)
> >		return -ENOMEM;
> >
> > +	stream->data_drop.reported_to_user = false;
> >	bitmap_zero(stream->data_drop.mask, XE_MAX_DSS_FUSE_BITS);
> 
> Stream is kzalloc'd, why do you need to init these?
> 
> >
> >	xe_pm_runtime_get(gt_to_xe(gt));
> > diff --git a/drivers/gpu/drm/xe/xe_eu_stall.h b/drivers/gpu/drm/xe/xe_eu_stall.h
> > index f97c8bf8e852..8bc44e9e98af 100644
> > --- a/drivers/gpu/drm/xe/xe_eu_stall.h
> > +++ b/drivers/gpu/drm/xe/xe_eu_stall.h
> > @@ -31,6 +31,7 @@ struct xe_eu_stall_data_stream {
> >	struct xe_bo *bo;
> >	struct per_xecore_buf *xecore_buf;
> >	struct {
> > +		bool reported_to_user;
> >		xe_dss_mask_t mask;
> >	} data_drop;
> >	struct hrtimer poll_check_timer;
> > --
> > 2.47.1
> >