[Intel-gfx] [PATCH 00/68] Broadwell 48b addressing and prelocations (no relocs)
Ben Widawsky
benjamin.widawsky at intel.com
Fri Aug 22 05:11:23 CEST 2014
The primary goal of these patches is to introduce what I've started
calling, "prelocations" on Broadwell. A prelocation is like a
relocation, except not. When a GPU client specifies a prelocation, it is
instructing the kernel where in the GPU address the buffer should be
mapped. The mechanic works very similarly to a relocation except it uses
the execbuffer object to obtain the offset, and bind if needed. If a GPU
client uses only prelocations, the relocation process can be entirely
skipped. This sounds like a big win initially, but realistically with
full PPGTT and 48b address space it's unlikely to noticeably improve
anything. Doing this work leaves the address space allocation up to
libc/malloc [1] instead of drm_mm which I believe has some upside due to
the hits on creating new VMAs. Not specific to prelocations, dynamic
page table allocations by themselves can save measurable memory on systems
running multiple GPU clients. As previously mentioned, this kind of thing is
needed for OCL 2.0 SVM. One other advantage I've discussed with Ken... [2].
The difficult part to enable this [for 64b platforms] is supporting the
48b address space. As mentioned in previous versions of this cover
letter, and my blog post [3], it's not feasible to allocate the entire 48b
address space's page tables. Dynamic page table allocation and teardown
required a lot of plumbing and rework, and to make the interfaces as
neat as possible, I also had to put a good deal of work into GEN7 PPGTT
well. The other really difficult part is taking the malloc'd memory and
turning it into GPU usable pages. Luckily, Chris already did that for me
with userptr, so I simply reused his work.
The kernel patches are lightly tested at best. Previous iterations of
this series were more thoroughly tested, but enough has changed since
then that I would assume the code is unstable. If miraculously it is
almost stable, there are still a lot of cosmetic things to clean up, and
a performance optimization to reduce re-mapping already mapped objects.
I started on a patch to do this but ran into too many stability problems
(See Optimize PDP loads from previous posts). It's likely memory leaks
are introduced with the dynamic page tables; plugging those would nice.
One could also implement the reaper I refer to in the comments.
With the kernel prelocation support are the libdrm patches, an
intel-gpu-tools test, and a mesa patch. Some parts of the code are in
rough shape, and were meant for demonstration only. The userspace
components in particular were mostly meant as sample code. [4]
The series is fundamental 5 parts with some bleeding between 2-3, and
3-4.
1. [00-18] Provide fixes to make a stable branch for test with full
PPGTT. I've previously posted this as a separate series. In the
meanwhile, many similar fixes have gone in, and some of these may be
dropped. So this is mostly here for completeness.
2. [19-42] Rework code to avoid as much future churn
as possible. Nothing special here. Some of this is arguably #3.
3. [43-46] Make page table allocations dynamic. I tried to keep this
generic, but since the current code supported very specific page table
depths, it's really mostly GEN7.
4. [47-67] GEN8 dynamic page table support with 64b page table support.
This was very hard to split up, and is definitely the majority of the
work.
5. [68] A basic SVM interface. I opted not to use create2 IOCTL since
there are patches for that already, and I wanted to have something
that's as reusable as possible. X. the rest are
workaround/libdrm/mesa/igt
Kernel:
http://cgit.freedesktop.org/~bwidawsk/drm-intel/log/?h=prelocate
libdrm:
http://cgit.freedesktop.org/~bwidawsk/drm/log/?h=prelocate
mesa:
http://cgit.freedesktop.org/~bwidawsk/mesa/log/?h=prelocate
IGT:
http://cgit.freedesktop.org/~bwidawsk/intel-gpu-tools/log/?h=prelocate
Final thoughts:
* Due to time pressure, the ability to go back and test on GEN7 was lost.
The original patches I posted back in March did work fine on GEN7, but I
cannot speak to the quality now. That said, I did the work, so I figured
I may as well provide it. For the sake of progress, someone should
test/fix GEN7, or simply drop the GEN7 support.
* Broadwell is currently hanging with this patch series when I run piglit.
I have gone through plenty of software bugs, and this current hang is
baffling. Therefore I think it makes sense to either parameterize, or
CONFIG_ dynamic page table allocations until that's solved.
* Again on the stability, there are a lot of extra flushes introduced as a
result of this series. I believe if we can figure out the case of some
of these issues, we can remove some flushes.
* I haven't tested aliasing PPGTT only in a while. Someone should do that.
* I'll bet 32b is broken.
* A lot of issues I had were related to the complexities when dealing with
legacy contexts. It's possible, and I am hopeful that with execlists
these issues go away, and so do the hangs.
* The patches have been rebased SOOOOO many times that they really need to
be reviewed closely to make sure they're bisectable. They were at one
time, but I doubt it's the case now.
[1] We have to use mmap in certain situations due to a hardware
limitation. I'm not sure how libc manages these things together. I hope
it's efficient...
[2] We can potentially always set the state base to be 0, and rely on HW
contexts to save restore this information, thus eliminating this
non-pipelined state upload. It turns out this is not possible for all
cases because of hardware limitations, but it's a neat idea that someone
can possibly turn into something useful. It's also probably a premature
optimization given how many PIPE CONTROL stalls we have.
[3] https://bwidawsk.net/blog/index.php/2014/07/future-ppgtt-part-4-dynamic-page-table-allocations-64-bit-address-space-gpu-mirroring-and-yeah-something-about-relocs-too/
[4] This was the best I could do on short notice. I won't be improving,
rebasing, or fixing these patches any longer, but someone is welcome to take
them over. Consider this my parting gift before I go on sabbatical [tomorrow].
--
Ben Widawsky (68):
drm/i915: Split up do_switch
drm/i915: Extract l3 remapping out of ctx switch
drm/i915/ppgtt: Load address space after mi_set_context
drm/i915: Fix another another use-after-free in do_switch
drm/i915/ctx: Return earlier on failure
drm/i915/error: vma error capture prettyify
drm/i915/error: Do a better job of disambiguating VMAs
drm/i915/error: Capture vmas instead of BOs
drm/i915: Add some extra guards in evict_vm
drm/i915: Make an uninterruptible evict
drm/i915: More correct (slower) ppgtt cleanup
drm/i915: Defer PPGTT cleanup
drm/i915/bdw: Enable full PPGTT
drm/i915: Get the error state over the wire (HACKish)
drm/i915/gen8: Invalidate TLBs before PDP reload
drm/i915: Remove false assertion in ppgtt_release
Revert "drm/i915/bdw: Use timeout mode for RC6 on bdw"
drm/i915/trace: Fix offsets for 64b
drm/i915: Wrap VMA binding
drm/i915: Make pin global flags explicit
drm/i915: Split out aliasing binds
drm/i915: fix gtt_total_entries()
drm/i915: Rename to GEN8_LEGACY_PDPES
drm/i915: Split out verbose PPGTT dumping
drm/i915: s/pd/pdpe, s/pt/pde
drm/i915: rename map/unmap to dma_map/unmap
drm/i915: Setup less PPGTT on failed pagedir
drm/i915: clean up PPGTT init error path
drm/i915: Un-hardcode number of page directories
drm/i915: Make gen6_write_pdes gen6_map_page_tables
drm/i915: Range clearing is PPGTT agnostic
drm/i915: Page table helpers, and define renames
drm/i915: construct page table abstractions
drm/i915: Complete page table structures
drm/i915: Create page table allocators
drm/i915: Generalize GEN6 mapping
drm/i915: Clean up pagetable DMA map & unmap
drm/i915: Always dma map page table allocations
drm/i915: Consolidate dma mappings
drm/i915: Always dma map page directory allocations
drm/i915: Track GEN6 page table usage
drm/i915: Extract context switch skip logic
drm/i915: Track page table reload need
drm/i915: Initialize all contexts
drm/i915: Finish gen6/7 dynamic page table allocation
drm/i915/bdw: Use dynamic allocation idioms on free
drm/i915/bdw: pagedirs rework allocation
drm/i915/bdw: pagetable allocation rework
drm/i915/bdw: Make the pdp switch a bit less hacky
drm/i915: num_pd_pages/num_pd_entries isn't useful
drm/i915: Extract PPGTT param from pagedir alloc
drm/i915/bdw: Split out mappings
drm/i915/bdw: begin bitmap tracking
drm/i915/bdw: Dynamic page table allocations
drm/i915/bdw: Make pdp allocation more dynamic
drm/i915/bdw: Abstract PDP usage
drm/i915/bdw: Add dynamic page trace events
drm/i915/bdw: Add ppgtt info for dynamic pages
drm/i915/bdw: implement alloc/teardown for 4lvl
drm/i915/bdw: Add 4 level switching infrastructure
drm/i915/bdw: Generalize PTE writing for GEN8 PPGTT
drm/i915: Plumb sg_iter through va allocation ->maps
drm/i915: Introduce map and unmap for VMAs
drm/i915: Depend exclusively on map and unmap_vma
drm/i915: Expand error state's address width to 64b
drm/i915/bdw: Flip the 48b switch
drm/i915: Provide a soft_pin hook
XXX: drm/i915: Unexplained workarounds
drivers/gpu/drm/i915/i915_debugfs.c | 114 +-
drivers/gpu/drm/i915/i915_drv.h | 61 +-
drivers/gpu/drm/i915/i915_gem.c | 231 +++-
drivers/gpu/drm/i915/i915_gem_context.c | 276 ++++-
drivers/gpu/drm/i915/i915_gem_evict.c | 39 +-
drivers/gpu/drm/i915/i915_gem_execbuffer.c | 27 +-
drivers/gpu/drm/i915/i915_gem_gtt.c | 1838 +++++++++++++++++++++-------
drivers/gpu/drm/i915/i915_gem_gtt.h | 379 +++++-
drivers/gpu/drm/i915/i915_gem_stolen.c | 2 +-
drivers/gpu/drm/i915/i915_gem_userptr.c | 7 +-
drivers/gpu/drm/i915/i915_gpu_error.c | 171 ++-
drivers/gpu/drm/i915/i915_reg.h | 1 +
drivers/gpu/drm/i915/i915_sysfs.c | 2 +-
drivers/gpu/drm/i915/i915_trace.h | 156 ++-
drivers/gpu/drm/i915/intel_pm.c | 16 +-
drivers/gpu/drm/i915/intel_ringbuffer.c | 2 +-
include/uapi/drm/i915_drm.h | 3 +-
17 files changed, 2588 insertions(+), 737 deletions(-)
--
2.0.4
More information about the Intel-gfx
mailing list