[PATCH 0/9] Visible VRAM Management Improvements

Sat Jun 24 21:50:05 UTC 2017

On Sat, Jun 24, 2017 at 08:20:22PM +0200, Christian König wrote:
> Am 24.06.2017 um 01:16 schrieb John Brooks:
> >On Fri, Jun 23, 2017 at 05:02:58PM -0400, Felix Kuehling wrote:
> >>Hi John,
> >>
> >>I haven't read your patches. Just a question based on the cover letter.
> >>
> >>I understand that visible VRAM is the biggest pain point. But could the
> >>same reasoning make sense for invisible VRAM? That is, doing all the
> >>migrations to VRAM in a workqueue?
> >>
> >>Regards,
> >>   Felix
> >>
> >I don't see why not. In theory, all non-essential buffer moves could be done
> >this way, and it would be relatively trivial to extend it to that.
> >
> >But I wanted to limit the scope of my changes, at least for this series.
> >Testing takes a long time and I wanted to focus those testing efforts as much
> >as possible, produce something well-tested (I hope), and get feedback on this
> >limited application of the concept before expanding its reach.
> 
> Yeah, sorry I have to say that but the whole approach is utterly nonsense.
> 
> What happens is that the delayed BO can only be moved AFTER the command
> submission which wants it to be in VRAM.
> 
> So you use the BO in a CS and *then* move it to where the CS wants it to be,
> no matter if the BO is then needed there or not.
> 
> Regards,
> Christian.
> 

I'm aware of the effect it has. The BO won't be in VRAM for the current command
submission, but it'll be there for a future one. If a BO is used at a given
time, then it's likely it'll be used again soon. In which case you'll come out
ahead on latency even if the GPU has to read it from GTT a few times. In any
case, it's never going to hurt as much as full-stop waiting for a worst-case BO
move that needs a lot of evictions.

Feel free to correct my understanding; you'd certainly know any of this better
than I do. But my tests indicate that immediate forced moves during CS cause
stalls, and the average framerate with delayed moves is the almost (~2%) the
same as with immediate ones, which is about 9% higher than with no forced moves
during CS at all.

DiRT Rally average framerates:
    With the whole patch set (n=3):
        89.56
    Without it (drm-next-4.13 5ac55629d6b3fcde69f46aa772c6e83be0bdcbbf)
    (n=3):
        91.16 (+stalls)
    Patches 1 and 3 only, and with GTT set as the only busy placement for
    CPU_ACCESS_REQUIRED BOs in amdgpu_cs_bo_validate (n=3):
        82.15

John

> 
> >
> >John
> >
> >>On 17-06-23 01:39 PM, John Brooks wrote:
> >>>This patch series is intended to improve performance when limited CPU-visible
> >>>VRAM is under pressure.
> >>>
> >>>Moving BOs into visible VRAM is essentially a housekeeping task. It's faster to
> >>>access them in VRAM than GTT, but it isn't a hard requirement for them to be in
> >>>VRAM. As such, it is unnecessary to spend valuable time blocking on this in the
> >>>page fault handler or during command submission. Doing so translates directly
> >>>into a longer frame time (ergo stalls and stuttering).
> >>>
> >>>The problem worsens when attempting to move BOs into visible VRAM when it is
> >>>full. This takes much longer than a simple move because other BOs have to be
> >>>evicted, which involves finding and then moving potentially hundreds of other
> >>>BOs, which is very time consuming. In the case of limited visible VRAM, it's
> >>>important to do this sometime to keep the contents of visible VRAM fresh, but
> >>>it does not need to be a blocking operation. If visible VRAM is full, the BO
> >>>can be read from GTT in the meantime and the BO can be moved to VRAM later.
> >>>
> >>>Thus, I have made it so that neither the command submission code nor page fault
> >>>handler spends time evicting BOs from visible VRAM, and instead this is
> >>>deferred to a workqueue function that's queued when CS requests BOs flagged
> >>>AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED.
> >>>
> >>>Speaking of CPU_ACCESS_REQUIRED, I've changed the handling of that flag so that
> >>>the kernel driver can clear it later even if it was set by userspace. This is
> >>>because the userspace graphics library can't know whether the application
> >>>really needs it to be CPU_ACCESS_REQUIRED forever. The kernel driver can't know
> >>>either, but it does know when page faults occur, and if a BO doesn't appear to
> >>>have any page faults when it's moved somewhere inaccessible, the flag can be
> >>>removed and it doesn't have to take up space in CPU-visible memory anymore.
> >>>This change was based on IRC discussions with Michel.
> >>>
> >>>Patch 7 fixes a problem with BO moverate throttling that causes visible VRAM
> >>>moves to not be throttled if total VRAM isn't full enough.
> >>>
> >>>I've also added a vis_vramlimit module parameter for debugging purposes. It's
> >>>similar to the vramlimit parameter except it limits only visible VRAM.
> >>>
> >>>I have tested this patch set with the two games I know to be affected by
> >>>visible VRAM pressure: DiRT Rally and Dying Light. It practically eliminates
> >>>eviction-related stuttering in DiRT Rally as well as very low performance if
> >>>visible VRAM is limited to 64MB. It also fixes severely low framerates that
> >>>occurred in some areas of Dying Light. All my testing was done with an R9 290
> >>>with 4GB of visible VRAM with an Intel i7 4790.
> >>>
> >>>--
> >>>John Brooks (Frogging101)
> >>>
> >>>_______________________________________________
> >>>amd-gfx mailing list
> >>>amd-gfx at lists.freedesktop.org
> >>>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 
>