[Intel-gfx] [PATCH 21/21] drm/i915: Introduce vmap (mapping of user pages into video memory) ioctl

Daniel Vetter daniel at ffwll.ch
Mon Apr 18 16:58:17 CEST 2011


Hi Chris,

First things first, vmap scares me. It's scariness is threefold:
- We haven't really used snooped memory seriously anywhere. And every time
  we newly use hw features, stuff tends to blow up in hilarious ways.
- Using snooped memory is rather different than using uc gtt mappings and
  has the potential for tons of funny api issues.
- With vmap gem is suddenly no longer in full control of its backing
  storage - imo a rather fundamental change.

With that admission out of the way follows the hopefully more rational
part ;-) First I look a bit at the use case for vmap and the problems (as
I see them) it tries to solve. Then comes fear rationalization in the form
of issues I see concerning the above three points. At the end then some
ideas as how to procedd.

I haven't yet looked in-depth into the code, so there's a decent chance
I've got some misconceptions. Please correct me.

Usage
-----

... to tightly integrate gpu rendering with rendering on the cpu, in
whatever form. Your current code uses this for really fast X fallbacks
(optimized with damage tracking to avoid unnecessary syncing), but I think
it might also prove useful for vbo upload, especially with swtnl. In
short: transfers.

There are two ways to further differentiate this: One is whether the
driver can allocate the storage for the cpu renderer or whether that is
handed down to it as fixed (X shm). The other is whether the transfer is a
oneshot or can reasonably be expected to be reused a lot of times (and
hence setup time be amortized).

vmap is squarely designed to for case of externally fixed memory which
will be reused. It obviously subsumes the driver provided memory case (for
not-oneshot stuff). In your X rework you also use it for small oneshot
stuff by amalgating many smaller transfers into larger vmaps.

Other contenders in the transfer business are:
- uc gtt maps, which just suck (besides oneshot uploads of giant stuff)
- and pwrite/pread, which due to abi-design is unusable for 2d/tiled
  objects.

Issues/questions: hw
-------------------------

- It newly uses a hw features. If that is not broken in certain special
  ways itself, it will surely turn up other bugs by changing the timing of
  all kind of things. It would be great to have a smoke (stress) test to
  have at least a vague idea of how well it works on different machines.

- While strolling through docs, I've found the following gem (Public ilk
  docs, Vol1Part2, Table 3-4 on page 53).

    "System Memory Address. Accesses are snooped in processor cache,
    allowing shared graphics/ processor access to (locked) cacheable
    memory data."

  With the range being "[0, 4GB]" Device range is unspecified, and might
  be DevBW-DevILK. If that's true, it sounds like a showstopper for
  vmapping random userspace address ranges.

  (Aside: that comment about (locked) is interesting. Could that explain
  why we sometimes miss seqno interrupts because we read a stale seqno
  from the hw page due to lack of a asm "lock" prefix? That might be
  necessary to avoid reading stale (virtual address indexed) L1 contents
  which the gpu might not be able to snoop.)

- Another thing I've noticed: You're using MLC_LLC caching for vmaps. Is
  that required to get coherent behaviour? If so we might need to change
  the if (cache_level != NONE) return; to if (cache_level == MLC_LLC)
  return; in clflush_object (for snb).

Issues/questions: api
---------------------

- One thing we already agreed on: the kernel won't magically redirect gtt
  mappings to cpu mappings. Userspace has to adapt.

- Your userspace already tracks the damage in both the cpu and the gpu
  portion of a pixmap. So I think it makes sense to fully offload
  coherency management to userspace, including gpu synchronization.
  Shooting down ptes would likely kill all benefits, anyway ;-)

- You're using set_domain_ioctl for synchronization. It seems to be enough
  for uber-synchronous X. GL's a bit better and might benefit from
  asynchronous execution fences (hello ttm). Nothing pressing, but
  something to keep in mind.

Issues/questions: vmap specifically
-----------------------------------

I can't do fear rationalization here, it essentially boils down to me not
being an -mm dev.

The other thing I'm wondering is what other use cases than X shm accel
this has. Single-purpose api extensions at the kernel level cause a
somewhat irky feeling for me.

I've also thought a bit about backwards compat of the new userspace code.
But given that vmap is just a special scheme for transfers I don't see a
problem in hiding it behind a bit of abstraction.

Ideas for moving forward
------------------------

In conclusion I think the idea of vmap - to use snoopeable mem for tight
cpu/gpu integration - is awesome and your experimental branches serve a
great landmark for where to head to. But my gut feeeling also says that
it's a too big leap to mainline in just one step. Hence this proposal for
a tad bit more piecemeal approach.  The goal being not to tackle
everything at once and to have the opportunity for some adjustment
mid-way.

First start with a new ioctl to enable snooped bos. We might prepare for a
kernel object cache and introduce a new gem_create ioctl, or just a
set_cacheable_ioctl. In short, just implement the "driver provided memory"
use case.

I think a smoke test should exist at merge time. Not just to have an idea
of which chipsets actually work, but also to have a clear example of the
intended use of snoopable bos wrt to synchronization. I.e. a simpel
example to think through the api implications.

Then convert over userspace. By looking at your snb branch, there's quite
some work to do. But there's also about 4-5 months until the 2.640
release, so should be plenty of time.

Using snoopable mem in less synchronous settings like mesa perhaps also
shows some api-hole that needs stuffing.

Then, when users adopt this in masses (probably somwhere in Q3) and we've
weeded out all the api kinks and worked around hw oddities, I think it's
time to re-evaluate the "externally provided memory" use-case and take a
look at vmap (the ioctl as proposed in your patch).

I have hopes that we might be able to subsume that use-case into the
single-shot use-case by beefing up pwrite/read with a blt variant that
does the right thing for 2d/tiled buffers (and also handles stride
mismatches). That feels a bit more generally useful, which is why I lean a
bit towards it. On the other hand we might turn vmap on it's head and
create an shm id out of a bo for X to use (if X shm turns out to be the
only user for such a thing).

Comments, ideas, flames highly welcome.

Yours, Daniel
-- 
Daniel Vetter
Mail: daniel at ffwll.ch
Mobile: +41 (0)79 365 57 48



More information about the Intel-gfx mailing list