[Freedreno] MSM-DRM: Help in understanding the rol of relocs in command submission.

Thu Feb 27 18:53:32 PST 2014

Now that I had a chance to digest the information, I take that the main
reason for reloc is to give the user-space applications the ability to
operate without the knowledge of gpu addresses (operate only with bo
handles). Thanks again for the detailed explanation.
Aravind

-----Original Message-----
From: Rob Clark [mailto:robdclark at gmail.com] 
Sent: Thursday, February 27, 2014 12:29 PM
To: Aravind Ganesan
Cc: freedreno at lists.freedesktop.org
Subject: Re: [Freedreno] MSM-DRM: Help in understanding the rol of relocs in
command submission.

On Thu, Feb 27, 2014 at 1:14 PM, Aravind Ganesan <aravindg at codeaurora.org>
wrote:
> Hi Guys,
>
>                 I'm trying to understand why we need relocs while 
> submitting commands and what the shift and offset represents. I 
> couldn't find any explanation for this other than the comment in 
> msm_drm.h  and some intel specific comments in 
> http://lwn.net/Articles/283798/. Can anyone clarify this or point me to
some better resources?

Might be useful to compare to the kgsl backend in libdrm, since that is
doing the equivalent thing with kgsl kernel interface, which you may already
familar with:

-------
static void kgsl_ringbuffer_emit_reloc(struct fd_ringbuffer *ring,
        const struct fd_reloc *r)
{
    struct kgsl_bo *kgsl_bo = to_kgsl_bo(r->bo);
    uint32_t addr = kgsl_bo_gpuaddr(kgsl_bo, r->offset);
    assert(addr);
    if (r->shift < 0)
        addr >>= -r->shift;
    else
        addr <<= r->shift;
    (*ring->cur++) = addr | r->or;
    kgsl_pipe_add_submit(to_kgsl_pipe(ring->pipe), kgsl_bo); }
-------

Basically, for msm drm, that address calculation moves to the kernel.
Userspace puts what it *assumes* is the correct address, but that is just a
sort of optimization to avoid cmdstream patching in the kernel in the common
case, so you can ignore that.

So, to answer one part of your question, the value that ends up in the
cmdstream that the gpu sees is:

  ((bo->gpuaddr + offset) >> shift) | or

That lets us accommodate the various ways that a gpu addr ends up in the
cmdstream.  Ie. there are a handful of places where it is left or right
shifted by a few bits, or has some other flags OR'd in the low bits (which
would otherwise always be zero), etc.

But I'm guessing the other part of the question is "why reloc's"?  The short
version is that it gives gives the kernel more information for memory
management and gives it more room to play some nice tricks:

1) kernel knows *all* bo's referenced in cmdstream.. kernel is then able to
hold an extra reference to buffers referenced by in-flight submits.
Userspace can always immediately free a buffer without waiting (a *very*
common pattern for x11 pixmaps,  vertex/texture upload buffers, etc),
without any free_at_timestamp type ioctl.  And cleanup for a crashed process
does not cause any GPU fault.

Also, since kernel knows when a bo is referenced (for read and/or write
access by gpu) it can implement fence stuff properly.  Yes, you can do the
fencing other ways.. but this approach doesn't have to worry about userspace
forgetting to tell the kernel about a some buffer or another.

2) kernel can defer mapping (or possibly even allocating pages) for a buffer
until needed..  mapping to IOMMU is relatively quick[1], and not every
buffer allocated needs to be mapped to every piece of hw (ie. if buffer is
only used for scanout, or (hypothetically) only used w/ 2d core, etc.  There
certainly are places in the graphics/UI stack where buffers/textures/etc get
allocated because they *might* be used.

[1] the slow thing with mapping/unmapping appears to be TLB flush..
with some improvement to the linux iommu interface to add an explict flush
operation, and iommu_{map,unmap}_unflushed(), we could batch up mappings for
buffers, and map all the unmapped buffers at the time of submit ioctl
(rather than for each allocate ioctl).

3) I do have one device without a working IOMMU, so I use a physically
contiguous VRAM carveout.  But due to CMA vs highmem lolz (at least in the
3.4 kernel) I end up needing the entire VRAM carveout in lowmem.
Which limits it to ~384MiB.  This is not enough to, for example, run
gnome-shell and xonotic at the same time.  If you have a swap file, and
userspace is managing buffers via handle rather than gpu addr, I could in
theory swap out unused buffers, and later swap them back in at a different
address, without confusing userspace.  Yeah, swapping is going to suck for
performance.  But it will be mostly swapping gnome-shell's buffers, other
window pixmaps, which are not needed when the game is running fullscreen.

----

Managing gpu buffers by handle also enables some things that might be useful
some day.  For example, older snapdragon stuff (I don't think I've seen this
since a2xx days) potentially had fast stacked memory.
With kernel managing buffer addresses we could hypothetically do some things
like move frequently used buffers into fast memory, transparently to
userspace.  This case is a bit similar to VRAM in a desktop GPU.  Maybe this
sort of arrangement will not come back.  No idea if qcom has plans are in
this dept... I do know other SoC makers have at least kicked around the idea
of non-uniform memory (it makes a lot of sense.. GPU and CPU's need
different performance characteristics out of memory).

BR,
-R

> Thanks,
>
> Aravind
>
>
> _______________________________________________
> Freedreno mailing list
> Freedreno at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/freedreno
>