[Bug 109830] New: CPU iowait higher when run the GPU multi decoding threads

Tue Mar 5 03:25:46 UTC 2019

https://bugs.freedesktop.org/show_bug.cgi?id=109830

            Bug ID: 109830
           Summary: CPU iowait higher when run the GPU multi decoding
                    threads
           Product: DRI
           Version: DRI git
          Hardware: x86-64 (AMD64)
                OS: Linux (All)
            Status: NEW
          Severity: blocker
          Priority: medium
         Component: DRM/Intel
          Assignee: intel-gfx-bugs at lists.freedesktop.org
          Reporter: owen.zhang at intel.com
        QA Contact: intel-gfx-bugs at lists.freedesktop.org
                CC: intel-gfx-bugs at lists.freedesktop.org

Created attachment 143528
  --> https://bugs.freedesktop.org/attachment.cgi?id=143528&action=edit
iostat_in_4.19

Hi,

our customer report one CPU iowait higher issue when run the multi decoding
threads, i reproduced this issue in kernel 4.19 and kernel 5.0 RC. 

the logs: (the full log in attach)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.99    0.00    1.87   93.14    0.00    0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   71.00    0.00  8256.00     0.00   232.56    
0.11    1.66    1.66    0.00   0.87   6.20
dm-0              0.00     0.00    6.00    0.00    60.00     0.00    20.00    
0.00    0.17    0.17    0.00   0.17   0.10
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00    
0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00   65.00    0.00  8196.00     0.00   252.18    
0.11    1.75    1.75    0.00   0.94   6.10

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.25    0.00    1.38   91.88    0.00    1.50

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   11.00    0.00    92.00     0.00    16.73    
0.00    0.18    0.18    0.00   0.09   0.10
dm-0              0.00     0.00   11.00    0.00    92.00     0.00    16.73    
0.00    0.09    0.09    0.00   0.09   0.10
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00    
0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00    
0.00    0.00    0.00    0.00   0.00   0.00

i also test the drm-tip(the last commit:
3dd976663b985375368c6229903a1c47971c2a71), this issue has been a little bit
less than the kernel 4.19/5.0.  the cpu iowait reduced to below (~80%). so i do
the bisect the patch on drm-tip, and find the following patch saved this.

"
drm/i915: Replace global breadcrumbs with per-context interrupt tracking

    A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the
    thundering i915_wait_request herd"), the issue of handling multiple
    clients waiting in parallel was brought to our attention. The
    requirement was that every client should be woken immediately upon its
    request being signaled, without incurring any cpu overhead.

    To handle certain fragility of our hw meant that we could not do a
    simple check inside the irq handler (some generations required almost
    unbounded delays before we could be sure of seqno coherency) and so
    request completion checking required delegation.

    Before commit 688e6c725816, the solution was simple. Every client
    waiting on a request would be woken on every interrupt and each would do
    a heavyweight check to see if their request was complete. Commit
    688e6c725816 introduced an rbtree so that only the earliest waiter on
    the global timeline would woken, and would wake the next and so on.
    (Along with various complications to handle requests being reordered
    along the global timeline, and also a requirement for kthread to provide
    a delegate for fence signaling that had no process context.)

    The global rbtree depends on knowing the execution timeline (and global
    seqno). Without knowing that order, we must instead check all contexts
    queued to the HW to see which may have advanced. We trim that list by
    only checking queued contexts that are being waited on, but still we
    keep a list of all active contexts and their active signalers that we
    inspect from inside the irq handler. By moving the waiters onto the fence
    signal list, we can combine the client wakeup with the dma_fence
    signaling (a dramatic reduction in complexity, but does require the HW
    being coherent, the seqno must be visible from the cpu before the
    interrupt is raised - we keep a timer backup just in case).

    Having previously fixed all the issues with irq-seqno serialisation (by
    inserting delays onto the GPU after each request instead of random delays
    on the CPU after each interrupt), we can rely on the seqno state to
    perfom direct wakeups from the interrupt handler. This allows us to
    preserve our single context switch behaviour of the current routine,
    with the only downside that we lose the RT priority sorting of wakeups.
    In general, direct wakeup latency of multiple clients is about the same
    (about 10% better in most cases) with a reduction in total CPU time spent
    in the waiter (about 20-50% depending on gen). Average herd behaviour is
    improved, but at the cost of not delegating wakeups on task_prio.

    v2: Capture fence signaling state for error state and add comments to
    warm even the most cold of hearts.
    v3: Check if the request is still active before busywaiting
    v4: Reduce the amount of pointer misdirection with list_for_each_safe
    and using a local i915_request variable inside the loops
    v5: Add a missing pluralisation to a purely informative selftest message.

    Change-Id: Ibf5251cc874f4ee338266641df072cfa1012faae
    References: 688e6c725816 ("drm/i915: Slaughter the thundering
i915_wait_request herd")
    Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
"

the bisect log in attach.

si try to backport this patch to kernel 4.19, but this patch is big, need your
helps to fix and give the suggestions. 

the reproduce steps:
1) Build this stack:
https://software.intel.com/en-us/articles/build-and-debug-open-source-media-stack
2) run the ./decode_video_avc.sh
3) meanwhile use the top or iostat to check the cpu iowait value.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
You are the QA Contact for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx-bugs/attachments/20190305/01a4d3e2/attachment.html>