[Intel-gfx] [PATCH 4/4] drm/i915/perf: Map OA buffer to user space for gen12 performance query

Fri Jul 24 18:55:06 UTC 2020

Quoting Umesh Nerlige Ramappa (2020-07-24 19:47:37)
> On Fri, Jul 24, 2020 at 05:34:11PM +0100, Chris Wilson wrote:
> >Quoting Umesh Nerlige Ramappa (2020-07-24 17:29:56)
> >> On Fri, Jul 24, 2020 at 01:42:33PM +0100, Chris Wilson wrote:
> >> >Quoting Umesh Nerlige Ramappa (2020-07-24 01:19:01)
> >> >> From: Piotr Maciejewski <piotr.maciejewski at intel.com>
> >> >>
> >> >> i915 used to support time based sampling mode which is good for overall
> >> >> system monitoring, but is not enough for query mode used to measure a
> >> >> single draw call or dispatch. Gen9-Gen11 are using current i915 perf
> >> >> implementation for query, but Gen12+ requires a new approach for query
> >> >> based on triggered reports within oa buffer.
> >> >>
> >> >> Triggering reports into the OA buffer is achieved by writing into a
> >> >> a trigger register. Optionally an unused counter/register is set with a
> >> >> marker value such that a triggered report can be identified in the OA
> >> >> buffer. Reports are usually triggered at the start and end of work that
> >> >> is measured.
> >> >>
> >> >> Since OA buffer is large and queries can be frequent, an efficient way
> >> >> to look for triggered reports is required. By knowing the current head
> >> >> and tail offsets into the OA buffer, it is easier to determine the
> >> >> locality of the reports of interest.
> >> >>
> >> >> Current perf OA interface does not expose head/tail information to the
> >> >> user and it filters out invalid reports before sending data to user.
> >> >> Also considering limited size of user buffer used during a query,
> >> >> creating a 1:1 copy of the OA buffer at the user space added undesired
> >> >> complexity.
> >> >>
> >> >> The solution was to map the OA buffer to user space provided
> >> >>
> >> >> (1) that it is accessed from a privileged user.
> >> >> (2) OA report filtering is not used.
> >> >>
> >> >> These 2 conditions would satisfy the safety criteria that the current
> >> >> perf interface addresses.
> >> >>
> >> >> To enable the query:
> >> >> - Add an ioctl to expose head and tail to the user
> >> >> - Add an ioctl to return size and offset of the OA buffer
> >> >> - Map the OA buffer to the user space
> >> >>
> >> >> v2:
> >> >> - Improve commit message (Chris)
> >> >> - Do not mmap based on gem object filp. Instead, use perf_fd and support
> >> >>   mmap syscall (Chris)
> >> >> - Pass non-zero offset in mmap to enforce the right object is
> >> >>   mapped (Chris)
> >> >> - Do not expose gpu_address (Chris)
> >> >> - Verify start and length of vma for page alignment (Lionel)
> >> >> - Move SQNTL config out (Lionel)
> >> >>
> >> >> v3: (Chris)
> >> >> - Omit redundant checks
> >> >> - Return VM_FAULT_SIGBUS is old stream is closed
> >> >> - Maintain reference counts to stream in vm_open and vm_close
> >> >> - Use switch to identify object to be mapped
> >> >>
> >> >> v4: Call kref_put on closing perf fd (Chris)
> >> >> v5:
> >> >> - Strip access to OA buffer from unprivileged child of a privileged
> >> >>   parent. Use VM_DONTCOPY
> >> >> - Enforce MAP_PRIVATE by checking for VM_MAYSHARE
> >> >>
> >> >> Signed-off-by: Piotr Maciejewski <piotr.maciejewski at intel.com>
> >> >> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
> >> >> ---
> >> >> @@ -3314,12 +3427,113 @@ static int i915_perf_release(struct inode *inode, struct file *file)
> >> >>         i915_perf_destroy_locked(stream);
> >> >>         mutex_unlock(&perf->lock);
> >> >>
> >> >> +       unmap_mapping_range(file->f_mapping, 0, OA_BUFFER_SIZE, 1);
> >> >
> >> >You can just used unmap_mapping_range(file->f_mapping, 0, -1, 1);
> >> >It scales with the number of vma present, so no worries, be conservative.
> >> >(Otherwise, you need s/0/OA_BUFFER_OFFSET/.)
> >> >
> >> >> +
> >> >>         /* Release the reference the perf stream kept on the driver. */
> >> >>         drm_dev_put(&perf->i915->drm);
> >> >>
> >> >>         return 0;
> >> >>  }
> >> >>
> >> >> +static void vm_open_oa(struct vm_area_struct *vma)
> >> >> +{
> >> >> +       struct i915_perf_stream *stream = vma->vm_private_data;
> >> >> +
> >> >> +       GEM_BUG_ON(!stream);
> >> >> +       perf_stream_get(stream);
> >> >> +}
> >> >> +
> >> >> +static void vm_close_oa(struct vm_area_struct *vma)
> >> >> +{
> >> >> +       struct i915_perf_stream *stream = vma->vm_private_data;
> >> >> +
> >> >> +       GEM_BUG_ON(!stream);
> >> >> +       perf_stream_put(stream);
> >> >> +}
> >> >> +
> >> >> +static vm_fault_t vm_fault_oa(struct vm_fault *vmf)
> >> >> +{
> >> >> +       struct vm_area_struct *vma = vmf->vma;
> >> >> +       struct i915_perf_stream *stream = vma->vm_private_data;
> >> >> +       struct i915_perf *perf = stream->perf;
> >> >> +       struct drm_i915_gem_object *obj = stream->oa_buffer.vma->obj;
> >> >> +       int err;
> >> >> +       bool closed;
> >> >
> >> >So vm_area_struct has a reference to the stream, that looks good now.
> >> >But there's no reference held to the vma itself.
> >>
> >> How do I get a reference to the vma.
> >
> >That would be i915_vma_get(), but you don't need to if we control the
> >order correctly, as then neither the PTE nor the ongoing faulthandler
> >last longer than the i915_perf_stream
> 
> I see that the do_mmap()->mmap_region() takes a reference to file
> 
> vma->vm_file = get_file(file);

By vma, I meant stream-oa_buffer.vma

> In our case this is perf_fd. do_munmap does a corresponding fput.
> 
> so unmap_mapping_range() is never called unless both unmap() and 
> close(perf_fd) are called by user (or process terminates).

The unmap_mapping_range() is called by destroy, so that should be
independent of userspace munmap.

> Is that good to take care of this ordering?
> 
> This also explains why I cannot get a VM_FAULT_SIGBUS with the IGTs.

To hit the stream->close you'll need to race mmap/munmap against perf
destroy. You may like to challenge yourself to hit it :)

> >> >> +       mutex_lock(&perf->lock);
> >> >> +       closed = READ_ONCE(stream->closed);
> >> >> +       mutex_unlock(&perf->lock);
> >> >
> >> >We do WRITE_ONCE(stream->closed, true) then invalidate all the mappings,
> >> >so that part looks good. The invalidate is serialised with the
> >> >vm_fault_oa, so we can just use a plain READ_ONCE(stream->closed) here
> >> >and not worry about the perf->lock.
> >>
> >> will do
> >> >
> >> >However... I think it should close&invalidate before releasing
> >> >stream->oa_buffer.
> >>
> >> will do
> >> >
> >> >And the read here of stream->oa_buffer should be after checking
> >> >stream->closed.
> >>
> >> I don't understand. I am checking for closed before remap_io_sg.
> >
> >It's the
> >
> >struct drm_i915_gem_object *obj = stream->oa_buffer.vma->obj;
> >
> >that's before the stream->closed check. That's dereferencing vma, but vma
> >will be set to NULL in i915_perf_destroy.
> 
> I will not use stream->oa_buffer.vma->obj in vm_fault_oa based on your 
> earlier comments, so this should be taken care of.

Aye.
-Chris
---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.