[Intel-gfx] [PATCH v2 1/1] i915: additional GEM documentation

Tue Mar 27 10:51:02 UTC 2018

On 02/03/2018 14:09, kevin.rogovin at intel.com wrote:
> From: Kevin Rogovin <kevin.rogovin at intel.com>
> 
> This patch provides additional overview documentation to the
> i915 kernel driver GEM. In addition, it presents already written
> documentation to i915.rst as well.
> 
> Signed-off-by: Kevin Rogovin <kevin.rogovin at intel.com>
> ---
>   Documentation/gpu/i915.rst                 | 194 +++++++++++++++++++++++------
>   drivers/gpu/drm/i915/i915_gem_execbuffer.c |   3 +-
>   drivers/gpu/drm/i915/i915_vma.h            |  11 +-
>   drivers/gpu/drm/i915/intel_lrc.c           |   3 +-
>   drivers/gpu/drm/i915/intel_ringbuffer.h    |  64 ++++++++++
>   5 files changed, 235 insertions(+), 40 deletions(-)
> 
> diff --git a/Documentation/gpu/i915.rst b/Documentation/gpu/i915.rst
> index 41dc881b00dc..cd23da2793ec 100644
> --- a/Documentation/gpu/i915.rst
> +++ b/Documentation/gpu/i915.rst
> @@ -13,6 +13,18 @@ Core Driver Infrastructure
>   This section covers core driver infrastructure used by both the display
>   and the GEM parts of the driver.
>   
> +Initialization
> +--------------
> +
> +The real action of initialization for the i915 driver is handled by
> +:c:func:`i915_driver_load`; from this function one can see the key
> +data (in paritcular :c:struct:'drm_driver' for GEM) of the entry points

particular

> +to to the driver from user space.
> +
> +.. kernel-doc:: drivers/gpu/drm/i915/i915_drv.c
> +   :functions: i915_driver_load
> +
> +
>   Runtime Power Management
>   ------------------------
>   
> @@ -243,32 +255,148 @@ Display PLLs
>   .. kernel-doc:: drivers/gpu/drm/i915/intel_dpll_mgr.h
>      :internal:
>   
> -Memory Management and Command Submission
> -========================================
> +GEM: Memory Management and Command Submission
> +=============================================
>   
>   This sections covers all things related to the GEM implementation in the
>   i915 driver.
>   
> -Batchbuffer Parsing
> --------------------
> +Intel GPU Basics
> +----------------
>   
> -.. kernel-doc:: drivers/gpu/drm/i915/i915_cmd_parser.c
> -   :doc: batch buffer command parser
> +An Intel GPU has multiple engines. There are several engine types.
> +The user-space value `I915_EXEC_DEFAULT` is an alias to the user
> +space value `I915_EXEC_RENDER`.
> +
> +- RCS engine is for rendering 3D and performing compute, this is named `I915_EXEC_RENDER` in user space.
> +- BCS is a blitting (copy) engine, this is named `I915_EXEC_BLT` in user space.
> +- VCS is a video encode and decode engine, this is named `I915_EXEC_BSD` in user space
> +- VECS is video enhancement engine, this is named `I915_EXEC_VEBOX` in user space.
> +
> +The Intel GPU family is a familiy of integrated GPU's using Unified Memory

family

> +Access. For having the GPU "do work", user space will feed the GPU batch buffers
> +via one of the ioctls `DRM_IOCTL_I915_GEM_EXECBUFFER2` or
> +`DRM_IOCTL_I915_GEM_EXECBUFFER2_WR` (the ioctl `DRM_IOCTL_I915_GEM_EXECBUFFER`

It is actually the same ioctl from i915 point of view. It used to be 
read-only (from the i915 point of view) and when fences were added 
(AFAIR) it needed to be made writable as well.

> +is deprecated). Most such batchbuffers will instruct the GPU to perform work
> +(for example rendering) and that work needs memory from which to read and memory
> +to which to write. All memory is encapsulated within GEM buffer objects (usually
> +created with the ioctl `DRM_IOCTL_I915_GEM_CREATE`). An ioctl providing a batchbuffer
> +for the GPU to create will also list all GEM buffer objects that the batchbuffer

I think you meant "to execute", not "to create"?

> +reads and/or writes. For implementation details of memory management see
> +`GEM BO Management Implementation Details`_.
> +
> +A GPU pipeline (mostly strongly so for the RCS engine) has a great deal of state
> +which is to be programmed by user space via the contents of a batchbuffer. Starting
> +in Gen6 (SandyBridge), hardware contexts are supported. A hardware context
> +encapsulates GPU pipeline state and other portions of GPU state and it is much more
> +efficient for the GPU to load a hardware context instead of re-submitting commands
> +in a batchbuffer to the GPU to restore state. In addition, using hardware contexts
> +provides much better isolation between user space clients. The ioctl
> +`DRM_IOCTL_I915_GEM_CONTEXT_CREATE` is used by user space to create a hardware context
> +which is identified by a 32-bit integer. The non-deprecated ioctls to submit batchbuffer
> +work can pass that ID (in the lower bits of drm_i915_gem_execbuffer2::rsvd1) to
> +identify what HW context to use with the command. When the kernel submits the
> +batchbuffer to be executed by the GPU it will also instruct the GPU to load the HW
> +context prior to executing the contents of a batchbuffer.
> +
> +The GPU has its own memory management and address space. The kernel driver
> +maintains the memory translation table for the GPU. For older GPUs (i.e. those
> +before Gen8), there is a single global such translation table, a global
> +Graphics Translation Table (GTT). For newer generation GPUs each hardware
> +context has its own translation table, called Per-Process Graphics Translation
> +Table (PPGTT). Of important note, is that although PPGTT is named per-process it
> +is actually per hardware context. When user space submits a batchbuffer, the kernel
> +walks the list of GEM buffer objects used by the batchbuffer and guarantees
> +that not only is the memory of each such GEM buffer object resident but it is
> +also present in the (PP)GTT. If the GEM buffer object is not yet placed in
> +the (PP)GTT, then it is given an address. Two consequences of this are:

Maybe expand that object can be moved even if it is already present in 
(PP)GTT under certain circumstances.

> +the kernel needs to edit the batchbuffer submitted to write the correct
> +value of the GPU address when a GEM BO is assigned a GPU address and
> +the kernel might evict a different GEM BO from the (PP)GTT to make address
> +room for a GEM BO.
> +
> +Consequently, the ioctls submitting a batchbuffer for execution also include
> +a list of all locations within buffers that refer to GPU-addresses so that the
> +kernel can edit the buffer correctly. This process is dubbed relocation. The
> +ioctls allow user space to provide what the GPU address could be. If the kernel
> +sees that the address provided by user space is correct, then it skips performing
> +relocation for that GEM buffer object. In addition, the kernel provides to what
> +addresses the kernel relocates each GEM buffer object.

Maybe clarify what you mean by "kernel provides" - kernel copies back to 
userspace the graphics virtual addresses of each buffer object?

> +
> +There is also an interface for user space to directly specify the address location
> +of GEM BO's, the feature soft-pinning and made active within an execbuffer2 ioctl

the feature "is called" soft-pinning?

> +with `EXEC_OBJECT_PINNED` bit up. If user-space also specifies `I915_EXEC_NO_RELOC`,

Suggest to split paragraphs for soft-pin and NO_RELOC.

"bit up" = "bit set" ? Or "flag set" even better I think.

> +then the kernel is to not execute any relocation and user-space manages the address
> +space for its PPGTT itself. The advantage of user space handling address space is
> +that then the kernel does far less work and user space can safely assume that
> +GEM buffer object's location in GPU address space do not change.
> +
> +GEM BO Management Implementation Details
> +----------------------------------------
>   
> -.. kernel-doc:: drivers/gpu/drm/i915/i915_cmd_parser.c
> +.. kernel-doc:: drivers/gpu/drm/i915/i915_vma.h
> +   :doc: Virtual Memory Address
> +
> +Buffer Object Eviction
> +~~~~~~~~~~~~~~~~~~~~~~
> +
> +This section documents the interface functions for evicting buffer
> +objects to make space available in the virtual gpu address spaces. Note

Suggest to use upper case GPU for consistency with the text so far.

> +that this is mostly orthogonal to shrinking buffer objects caches, which
> +has the goal to make main memory (shared with the gpu through the
> +unified memory architecture) available.

I think more customary way of saying "to make main memory available" 
would be "to free up system memory when under memory pressure"?

English speaker could say if "goal of making" would be better than "goal 
to make" in this context.

> +
> +.. kernel-doc:: drivers/gpu/drm/i915/i915_gem_evict.c
>      :internal:
>   
> -Batchbuffer Pools
> ------------------
> +Buffer Object Memory Shrinking
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>   
> -.. kernel-doc:: drivers/gpu/drm/i915/i915_gem_batch_pool.c
> -   :doc: batch pool
> +This section documents the interface function for shrinking memory usage

functions plural I think.

> +of buffer object caches. Shrinking is used to make main memory
> +available. Note that this is mostly orthogonal to evicting buffer
> +objects, which has the goal to make space in gpu virtual address spaces.
>   
> -.. kernel-doc:: drivers/gpu/drm/i915/i915_gem_batch_pool.c
> +.. kernel-doc:: drivers/gpu/drm/i915/i915_gem_shrinker.c
>      :internal:
>   
> +
> +Batchbuffer Submission
> +----------------------
> +
> +Depending on GPU generation, the i915 kernel driver will submit batchbuffers
> +in one of the several ways. However, the top code logic is shared for all
> +methods, see `Common: At the bottom`_ and `Common: Processing requests`_
> +for details. In addition, the kernel may filter the contents of user space
> +provided batchbuffers. To that end the i915 driver has a
> +`Command Buffer Parser`_ and a pool from which to allocate buffers to place
> +filtered user space batchbuffers, see section `Batchbuffer Pools`_.
> +
> +Common: At the bottom
> +~~~~~~~~~~~~~~~~~~~~~
> +
> +.. kernel-doc:: drivers/gpu/drm/i915/intel_ringbuffer.h
> +   :doc: Ringbuffers to submit batchbuffers
> +
> +Common: Processing requests
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +.. kernel-doc:: drivers/gpu/drm/i915/i915_gem_execbuffer.c
> +   :doc: User command execution
> +
> +Batchbuffer Submission Varieties
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +.. kernel-doc:: drivers/gpu/drm/i915/intel_ringbuffer.h
> +   :doc: Batchbuffer Submission Backend
> +
> +The two varieties for submitting batchbuffer to the GPU are the following.
> +
> +1. Batchbuffers are subbmitted directly to a ring buffer; this is the most basic way to submit batchbuffers to the GPU and is for generations strictly before Gen8.
> +2. Batchbuffer are submitting via execlists are a features supported by Gen8 and new devices; the macro :c:macro:'HAS_EXECLISTS'

singular vs plural - hm actually the whole sentence needs rework - 
"Batchbuffer submission via execlists is a feature supported by Gen8 and 
newer devices" ?

  is used to determine if a GPU supports submitting via execlists, see 
`Logical Rings, Logical Ring Contexts and Execlists`_.
> +
>   Logical Rings, Logical Ring Contexts and Execlists
> ---------------------------------------------------
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>   
>   .. kernel-doc:: drivers/gpu/drm/i915/intel_lrc.c
>      :doc: Logical Rings, Logical Ring Contexts and Execlists
> @@ -276,6 +404,24 @@ Logical Rings, Logical Ring Contexts and Execlists
>   .. kernel-doc:: drivers/gpu/drm/i915/intel_lrc.c
>      :internal:
>   
> +Command Buffer Parser
> +---------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/i915/i915_cmd_parser.c
> +   :doc: batch buffer command parser
> +
> +.. kernel-doc:: drivers/gpu/drm/i915/i915_cmd_parser.c
> +   :internal:
> +
> +Batchbuffer Pools
> +-----------------
> +
> +.. kernel-doc:: drivers/gpu/drm/i915/i915_gem_batch_pool.c
> +   :doc: batch pool
> +
> +.. kernel-doc:: drivers/gpu/drm/i915/i915_gem_batch_pool.c
> +   :internal:
> +
>   Global GTT views
>   ----------------
>   
> @@ -312,28 +458,6 @@ Object Tiling IOCTLs
>   .. kernel-doc:: drivers/gpu/drm/i915/i915_gem_tiling.c
>      :doc: buffer object tiling
>   
> -Buffer Object Eviction
> -----------------------
> -
> -This section documents the interface functions for evicting buffer
> -objects to make space available in the virtual gpu address spaces. Note
> -that this is mostly orthogonal to shrinking buffer objects caches, which
> -has the goal to make main memory (shared with the gpu through the

Ah you are only moving this text.. feel free to keep it as is then.

> -unified memory architecture) available.
> -
> -.. kernel-doc:: drivers/gpu/drm/i915/i915_gem_evict.c
> -   :internal:
> -
> -Buffer Object Memory Shrinking
> -------------------------------
> -
> -This section documents the interface function for shrinking memory usage
> -of buffer object caches. Shrinking is used to make main memory
> -available. Note that this is mostly orthogonal to evicting buffer
> -objects, which has the goal to make space in gpu virtual address spaces.
> -
> -.. kernel-doc:: drivers/gpu/drm/i915/i915_gem_shrinker.c
> -   :internal:
>   
>   GuC
>   ===
> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> index 8c170db8495d..6c8b8e2041f1 100644
> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> @@ -81,7 +81,8 @@ enum {
>    * but this remains just a hint as the kernel may choose a new location for
>    * any object in the future.
>    *
> - * Processing an execbuf ioctl is conceptually split up into a few phases.
> + * Processing an execbuf ioctl is handled by i915_gem_do_execbuffer() which
> + * conceptually splits up processing of an execbuf ioctl into a few phases.
>    *
>    * 1. Validation - Ensure all the pointers, handles and flags are valid.
>    * 2. Reservation - Assign GPU address space for every object
> diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h
> index 8c5022095418..d0feb4f9e326 100644
> --- a/drivers/gpu/drm/i915/i915_vma.h
> +++ b/drivers/gpu/drm/i915/i915_vma.h
> @@ -38,13 +38,18 @@
>   enum i915_cache_level;
>   
>   /**
> - * A VMA represents a GEM BO that is bound into an address space. Therefore, a
> - * VMA's presence cannot be guaranteed before binding, or after unbinding the
> - * object into/from the address space.
> + * DOC: Virtual Memory Address
> + *
> + * An `i915_vma` struct represents a GEM BO that is bound into an address
> + * space. Therefore, a VMA's presence cannot be guaranteed before binding, or
> + * after unbinding the object into/from the address space. The struct includes
> + * the bookkepping details needed for tracking it in all the lists with which
> + * it interacts.
>    *
>    * To make things as simple as possible (ie. no refcounting), a VMA's lifetime
>    * will always be <= an objects lifetime. So object refcounting should cover us.
>    */
> +
>   struct i915_vma {
>   	struct drm_mm_node node;
>   	struct drm_i915_gem_object *obj;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 14288743909f..bc4943333090 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -34,7 +34,8 @@
>    * Motivation:
>    * GEN8 brings an expansion of the HW contexts: "Logical Ring Contexts".
>    * These expanded contexts enable a number of new abilities, especially
> - * "Execlists" (also implemented in this file).
> + * "Execlists" (also implemented in this file,
> + * drivers/gpu/drm/i915/intel_lrc.c).

Why is self-reference to filename required?

>    *
>    * One of the main differences with the legacy HW contexts is that logical
>    * ring contexts incorporate many more things to the context's state, like
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index bbacf4d0f4cb..390f63479565 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -300,6 +300,70 @@ struct intel_engine_execlists {
>   
>   #define INTEL_ENGINE_CS_MAX_NAME 8
>   
> +/**
> + * DOC: Ringbuffers to submit batchbuffers
> + *
> + * At the lowest level, submitting work to a GPU engine is to add commands to
> + * a ringbuffer. A ringbuffer in the kernel driver is essentially a location
> + * from which the GPU reads its next command. To avoid copying the contents
> + * of a batchbuffer in order to submit it, the GPU has native hardware support
> + * to perform commands specified in another buffer; the command to do so is
> + * a batchbuffer start and the i915 kernel driver uses this to avoid copying
> + * batchbuffers to the ringbuffer. At the very bottom of the stack, the i915
> + * adds the following to a ringbuffer to submit a batchbuffer to the GPU.
> + *
> + * 1. Add a batchbuffer start command to the ringbuffer.
> + *      The start command is essentially a token together with the GPU
> + *      address of the batchbuffer to be executed
> + *
> + * 2. Add a pipeline flush to the the ring buffer.
> + *      This is accomplished by the function pointer

Full stops to end the above two.

> + *
> + * 3. Add a register write command to the ring buffer.

This is a memory address write.

> + *      This register write writes the the request ID,

Maybe request sequence number? We don't use the term request ID 
elsewhere I think.

> + *      ``i915_request::global_seqno``; the i915 kernel driver uses
> + *      the value in the register to know what requests are completed.
> + *
> + * 4. Add a user interrupt command to the ringbuffer.
> + *      This command instructs the GPU to issue an interrupt
> + *      when the command (and pipeline flush) are completed.
> + */
> +
> +/**
> + * DOC: Batchbuffer Submission Backend
> + *
> + * The core logic of submitting a batchbuffer for the GPU to execute
> + * is shared across all engines for all GPU generations. Through the use
> + * of functions pointers, we can customize submission to different GPU
> + * capabilities. The struct ``intel_engine_cs`` has the following member
> + * function pointers for the following purposes in the scope of batchbuffer
> + * submission.
> + *
> + * - context_pin
> + *     pins the context and also returns to  what ``intel_ringbuffer``
> + *     to write to submit a batchbuffer.

Uppercase letter to start since there's a full stop and to be consistent 
with the previous topic.

> + *
> + * - request_alloc
> + *     is used to reserve space in an ``intel_ringbuffer``
> + *     for submitting a batchbuffer to the GPU.

More correct would be to say:

Backend specific portion of request allocation - for instance waiting on 
available space in an ``intel_ringbuffer``...

> + *
> + * - emit_flush
> + *     writes a pipeline flush command to the ring buffer.
> + *
> + * - emit_bb_start
> + *     writes the batchbuffer start command to the ringer buffer.
> + *
> + * - emit_breadcrumb
> + *     writes to the ring buffer both the regiser write of the

register

> + *     request ID (`i915_request::global_seqno`) and the command to
> + *     issue an interrupt.
> + *
> + * - submit_request
> + *     See the comment on this member in ``intel_engine_cs``, declared
> + *     in intel_ringbuffer.h.
> + *
> + */
> +
>   struct intel_engine_cs {
>   	struct drm_i915_private *i915;
>   	char name[INTEL_ENGINE_CS_MAX_NAME];
> 

Overall I think this is useful - a good balance of not going too deep as 
to make maintenance of it to difficult.

Regards,

Tvrtko