[Intel-gfx] [PATCH v1] drm/i915/gem: Don't evict unmappable VMAs when pinning with PIN_MAPPABLE

Fri Mar 18 09:43:44 UTC 2022

On 18/03/2022 07:39, Kasireddy, Vivek wrote:
> Hi Tvrtko,
> 
>>
>> On 17/03/2022 07:23, Vivek Kasireddy wrote:
>>> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
>>> more framebuffers/scanout buffers results in only one that is mappable/
>>> fenceable. Therefore, pageflipping between these 2 FBs where only one
>>> is mappable/fenceable creates latencies large enough to miss alternate
>>> vblanks thereby producing less optimal framerate.
>>>
>>> This mainly happens because when i915_gem_object_pin_to_display_plane()
>>> is called to pin one of the FB objs, the associated vma is identified
>>> as misplaced -- because there is no space for it in the aperture --
>>> and therefore i915_vma_unbind() is called which unbinds and evicts it.
>>> This misplaced vma gets subseqently pinned only when
>>> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This whole
>>> thing results in a latency of ~10ms and happens every other repaint cycle.
>>
>> Just out of curiosity - have you looked at where does this 10ms come
>> from? Like is it simply clearing/writing PTEs so expensive, or there is
>> more to it? Apologies if I asked this before..
> [Kasireddy, Vivek] It appears the ~10ms latency seems to come from the
> execution of gen8_ggtt_clear_range() as seen in the trace log:
>            weston-4124  [008] 161210.397563: funcgraph_entry:                   |                          __i915_vma_evict() {
>            weston-4124  [008] 161210.397564: funcgraph_entry:                   |                            ggtt_unbind_vma() {
>            weston-4124  [008] 161210.397564: funcgraph_entry:                   |                              gen8_ggtt_clear_range() {
>            weston-4124  [008] 161210.408012: funcgraph_exit:       # 10416.281 us |                              }
>            weston-4124  [008] 161210.408012: funcgraph_exit:       # 10448.740 us |                            }
>            weston-4124  [008] 161210.408012: funcgraph_entry:                   |                            __vma_put_pages() {
>            weston-4124  [008] 161210.408013: funcgraph_entry:        0.083 us   |                              clear_pages();
>            weston-4124  [008] 161210.408013: funcgraph_exit:         0.578 us   |                            }
>            weston-4124  [008] 161210.408013: funcgraph_exit:       # 10449.622 us |                          }
> 
> And, for some reason, I can't get trace-cmd to capture more symbols to
> gain more insights. However, gen8_ggtt_clear_range() seems to just do
> this:
>          for (i = 0; i < num_entries; i++)
>                  gen8_set_pte(&gtt_base[i], scratch_pte);
> where,
> vma's start = 182190080, length = 132710400, first = 44480, num_entries = 32400
> and we have
> void gen8_set_pte(void __iomem *addr, gen8_pte_t pte)
> {
>          writeq(pte, addr);
> }
> 
> Any idea why executing writeq 32400 times would result in a latency
> of ~10ms?

Uncached writes I guess, due VT-d workaround being required to avoid 
display engine faulting due overfetching.

Something like https://patchwork.freedesktop.org/series/97492/ would 
have probably resolved this issue as well. I guess time to go and read 
that series in detail.

Regards,

Tvrtko

> 
>>
>>> Therefore, to fix this issue, we just ensure that the misplaced VMA
>>> does not get evicted when we try to pin it with PIN_MAPPABLE -- by
>>> returning early if the mappable/fenceable flag is not set.
>>>
>>> Testcase:
>>> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
>>> with a 8K at 60 mode results in only ~40 FPS (compared to ~59 FPS with
>>> this patch). Since upstream Weston submits a frame ~7ms before the
>>> next vblank, the latencies seen between atomic commit and flip event
>>> are 7, 24 (7 + 16.66), 7, 24..... suggesting that it misses the
>>> vblank every other frame.
>>>
>>> Here is the ftrace snippet that shows the source of the ~10ms latency:
>>>                 i915_gem_object_pin_to_display_plane() {
>>> 0.102 us   |    i915_gem_object_set_cache_level();
>>>                   i915_gem_object_ggtt_pin_ww() {
>>> 0.390 us   |      i915_vma_instance();
>>> 0.178 us   |      i915_vma_misplaced();
>>>                     i915_vma_unbind() {
>>>                     __i915_active_wait() {
>>> 0.082 us   |        i915_active_acquire_if_busy();
>>> 0.475 us   |      }
>>>                     intel_runtime_pm_get() {
>>> 0.087 us   |        intel_runtime_pm_acquire();
>>> 0.259 us   |      }
>>>                     __i915_active_wait() {
>>> 0.085 us   |        i915_active_acquire_if_busy();
>>> 0.240 us   |      }
>>>                     __i915_vma_evict() {
>>>                       ggtt_unbind_vma() {
>>>                         gen8_ggtt_clear_range() {
>>> 10507.255 us |        }
>>> 10507.689 us |      }
>>> 10508.516 us |   }
>>>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
>>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy at intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/i915_gem.c | 8 +++++++-
>>>    1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>> index 9747924cc57b..7307c5de1c58 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> @@ -939,8 +939,14 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object
>> *obj,
>>>    			if (i915_vma_is_pinned(vma) || i915_vma_is_active(vma))
>>>    				return ERR_PTR(-ENOSPC);
>>>
>>> +			/*
>>> +			 * If this misplaced vma is too big (i.e, at-least
>>> +			 * half the size of aperture) or just unmappable,
>>> +			 * we would not be able to pin with PIN_MAPPABLE.
>>> +			 */
>>
>> I would be tempted to describe the ping-pong issue in the comment. In
>> short would do it, but just because git blame on a line of code tends to
>> fail to lead to the correct commit message after a while.
>>
>> So maybe just say along the lines of "If the misplaced vma is too big
>> ... or hasn't been pinned mappable before, we ignore the misplacement
>> when PIN_NONBLOCK is set in order to avoid ping-pong of double (or more)
>> -buffered frame buffers into the aperture and out on every vblank."
> [Kasireddy, Vivek] Ok, will include more comments in v2.
> 
>>
>>>    			if (flags & PIN_MAPPABLE &&
>>> -			    vma->fence_size > ggtt->mappable_end / 2)
>>> +			    (vma->fence_size > ggtt->mappable_end / 2 ||
>>> +			    !i915_vma_is_map_and_fenceable(vma)))
>>>    				return ERR_PTR(-ENOSPC);
>>>    		}
>>>
>>
>> With the expanded comment it looks good to me.
>>
>> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> [Kasireddy, Vivek] Thank you for the review.
> 
> Thanks,
> Vivek
>>
>> + Daniel if he wants to double check it.
>>
>> Regards,
>>
>> Tvrtko