[PATCH 0/4] Drop wbinvd_on_all_cpus usage

Tue Mar 22 15:07:17 UTC 2022

On 3/22/22 13:53, Tvrtko Ursulin wrote:
>
> On 22/03/2022 11:37, Thomas Hellström wrote:
>> On Tue, 2022-03-22 at 11:20 +0000, Tvrtko Ursulin wrote:
>>>
>>> On 22/03/2022 10:26, Thomas Hellström wrote:
>>>> On Tue, 2022-03-22 at 10:13 +0000, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 21/03/2022 15:15, Thomas Hellström wrote:
>>>>>> On Mon, 2022-03-21 at 14:43 +0000, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> On 21/03/2022 13:40, Thomas Hellström wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On Mon, 2022-03-21 at 13:12 +0000, Tvrtko Ursulin wrote:
>>>>>>>>>
>>>>>>>>> On 21/03/2022 12:33, Thomas Hellström wrote:
>>>>>>>>>> On Mon, 2022-03-21 at 12:22 +0000, Tvrtko Ursulin
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 21/03/2022 11:03, Thomas Hellström wrote:
>>>>>>>>>>>> Hi, Tvrtko.
>>>>>>>>>>>>
>>>>>>>>>>>> On 3/21/22 11:27, Tvrtko Ursulin wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 19/03/2022 19:42, Michael Cheng wrote:
>>>>>>>>>>>>>> To align with the discussion in [1][2], this
>>>>>>>>>>>>>> patch
>>>>>>>>>>>>>> series
>>>>>>>>>>>>>> drops
>>>>>>>>>>>>>> all
>>>>>>>>>>>>>> usage of
>>>>>>>>>>>>>> wbvind_on_all_cpus within i915 by either
>>>>>>>>>>>>>> replacing
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> call
>>>>>>>>>>>>>> with certain
>>>>>>>>>>>>>> drm clflush helpers, or reverting to a previous
>>>>>>>>>>>>>> logic.
>>>>>>>>>>>>>
>>>>>>>>>>>>> AFAIU, complaint from [1] was that it is wrong to
>>>>>>>>>>>>> provide
>>>>>>>>>>>>> non
>>>>>>>>>>>>> x86
>>>>>>>>>>>>> implementations under the wbinvd_on_all_cpus
>>>>>>>>>>>>> name.
>>>>>>>>>>>>> Instead an
>>>>>>>>>>>>> arch
>>>>>>>>>>>>> agnostic helper which achieves the same effect
>>>>>>>>>>>>> could
>>>>>>>>>>>>> be
>>>>>>>>>>>>> created.
>>>>>>>>>>>>> Does
>>>>>>>>>>>>> Arm have such concept?
>>>>>>>>>>>>
>>>>>>>>>>>> I also understand Linus' email like we shouldn't
>>>>>>>>>>>> leak
>>>>>>>>>>>> incoherent
>>>>>>>>>>>> IO
>>>>>>>>>>>> to
>>>>>>>>>>>> other architectures, meaning any remaining
>>>>>>>>>>>> wbinvd()s
>>>>>>>>>>>> should
>>>>>>>>>>>> be
>>>>>>>>>>>> X86
>>>>>>>>>>>> only.
>>>>>>>>>>>
>>>>>>>>>>> The last part is completely obvious since it is a x86
>>>>>>>>>>> instruction
>>>>>>>>>>> name.
>>>>>>>>>>
>>>>>>>>>> Yeah, I meant the function implementing wbinvd()
>>>>>>>>>> semantics.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> But I think we can't pick a solution until we know
>>>>>>>>>>> how
>>>>>>>>>>> the
>>>>>>>>>>> concept
>>>>>>>>>>> maps
>>>>>>>>>>> to Arm and that will also include seeing how the
>>>>>>>>>>> drm_clflush_sg for
>>>>>>>>>>> Arm
>>>>>>>>>>> would look. Is there a range based solution, or just
>>>>>>>>>>> a
>>>>>>>>>>> big
>>>>>>>>>>> hammer
>>>>>>>>>>> there.
>>>>>>>>>>> If the latter, then it is no good to churn all these
>>>>>>>>>>> reverts
>>>>>>>>>>> but
>>>>>>>>>>> instead
>>>>>>>>>>> an arch agnostic wrapper, with a generic name, would
>>>>>>>>>>> be
>>>>>>>>>>> the
>>>>>>>>>>> way to
>>>>>>>>>>> go.
>>>>>>>>>>
>>>>>>>>>> But my impression was that ARM would not need the
>>>>>>>>>> range-
>>>>>>>>>> based
>>>>>>>>>> interface
>>>>>>>>>> either, because ARM is only for discrete and with
>>>>>>>>>> discrete
>>>>>>>>>> we're
>>>>>>>>>> always
>>>>>>>>>> coherent.
>>>>>>>>>
>>>>>>>>> Not sure what you mean here - what about flushing system
>>>>>>>>> memory
>>>>>>>>> objects
>>>>>>>>> on discrete? Those still need flushing on paths like
>>>>>>>>> suspend
>>>>>>>>> which this
>>>>>>>>> series touches. Am I missing something?
>>>>>>>>
>>>>>>>> System bos on discrete should always have
>>>>>>>>
>>>>>>>> I915_BO_CACHE_COHERENT_FOR_READ |
>>>>>>>> I915_BO_CACHE_COHERENT_FOR_WRITE
>>>>>>>>
>>>>>>>> either by the gpu being fully cache coherent (or us mapping
>>>>>>>> system
>>>>>>>> write-combined). Hence no need for cache clflushes or
>>>>>>>> wbinvd()
>>>>>>>> for
>>>>>>>> incoherent IO.
>>>>>>>
>>>>>>> Hmm so you are talking about the shmem ttm backend. It ends
>>>>>>> up
>>>>>>> depending on the result of i915_ttm_cache_level, yes? It
>>>>>>> cannot
>>>>>>> end
>>>>>>> up with I915_CACHE_NONE from that function?
>>>>>>
>>>>>> If the object is allocated with allowable placement in either
>>>>>> LMEM
>>>>>> or
>>>>>> SYSTEM, and it ends in system, it gets allocated with
>>>>>> I915_CACHE_NONE,
>>>>>> but then the shmem ttm backend isn't used but TTM's wc pools,
>>>>>> and
>>>>>> the
>>>>>> object should *always* be mapped wc. Even in system.
>>>>>
>>>>> I am not familiar with neither TTM backend or wc pools so maybe a
>>>>> missed
>>>>> question - if obj->cache_level can be set to none, and
>>>>> obj->cache_coherency to zero, then during object lifetime helpers
>>>>> which
>>>>> consult those fields (like i915_gem_cpu_write_needs_clflush,
>>>>> __start_cpu_write, etc) are giving out incorrect answers? That
>>>>> is, it
>>>>> is
>>>>> irrelevant that they would say flushes are required, since in
>>>>> actuality
>>>>> those objects can never ever and from anywhere be mapped other
>>>>> than
>>>>> WC
>>>>> so flushes aren't actually required?
>>>>
>>>> If we map other than WC somewhere in these situations, that should
>>>> be a
>>>> bug needing a fix. It might be that some of these helpers that you
>>>> mention might still flag that a clflush is needed, and in that case
>>>> that's an oversight that also needs fixing.
>>>>
>>>>>
>>>>>>> I also found in i915_drm.h:
>>>>>>>
>>>>>>>             * As caching mode when specifying
>>>>>>> `I915_MMAP_OFFSET_FIXED`,
>>>>>>> WC or WB will
>>>>>>>             * be used, depending on the object placement on
>>>>>>> creation. WB
>>>>>>> will be used
>>>>>>>             * when the object can only exist in system memory,
>>>>>>> WC
>>>>>>> otherwise.
>>>>>>>
>>>>>>> If what you say is true, that on discrete it is _always_ WC,
>>>>>>> then
>>>>>>> that needs updating as well.
>>>>>>
>>>>>> If an object is allocated as system only, then it is mapped WB,
>>>>>> and
>>>>>> we're relying on the gpu being cache coherent to avoid
>>>>>> clflushes.
>>>>>> Same
>>>>>> is actually currently true if the object happens to be accessed
>>>>>> by
>>>>>> the
>>>>>> cpu while evicted. Might need an update for that.
>>>>>
>>>>> Hmm okay, I think I actually misunderstood something here. I
>>>>> think
>>>>> the
>>>>> reason for difference bbtween smem+lmem object which happens to
>>>>> be in
>>>>> smem and smem only object is eluding me.
>>>>>
>>>>>>>>
>>>>>>>> That's adhering to Linus'
>>>>>>>>
>>>>>>>> "And I sincerely hope to the gods that no cache-incoherent
>>>>>>>> i915
>>>>>>>> mess
>>>>>>>> ever makes it out of the x86 world. Incoherent IO was
>>>>>>>> always a
>>>>>>>> historical mistake and should never ever happen again, so
>>>>>>>> we
>>>>>>>> should
>>>>>>>> not spread that horrific pattern around."
>>>>>>>
>>>>>>> Sure, but I was not talking about IO - just the CPU side
>>>>>>> access
>>>>>>> to
>>>>>>> CPU side objects.
>>>>>>
>>>>>> OK, I was under the impression that clflushes() and wbinvd()s
>>>>>> in
>>>>>> i915
>>>>>> was only ever used to make data visible to non-snooping GPUs.
>>>>>>
>>>>>> Do you mean that there are other uses as well? Agreed the wb
>>>>>> cache
>>>>>> flush on on suspend only if gpu is
>>>>>> !I915_BO_CACHE_COHERENT_FOR_READ?
>>>>>> looks to not fit this pattern completely.
>>>>>
>>>>> Don't know, I was first trying to understand handling of the
>>>>> obj->cache_coherent as discussed in the first quote block. Are
>>>>> the
>>>>> flags
>>>>> consistently set and how the Arm low level code will look.
>>>>>
>>>>>> Otherwise, for architectures where memory isn't always fully
>>>>>> coherent
>>>>>> with the cpu cache, I'd expect them to use the apis in
>>>>>> asm/cacheflush.h, like flush_cache_range() and similar, which
>>>>>> are
>>>>>> nops
>>>>>> on x86.
>>>>>
>>>>> Hm do you know why there are no-ops? Like why wouldn't they map
>>>>> to
>>>>> clflush?
>>>>
>>>> I think it mostly boils down to the PIPT caches on x86. Everything
>>>> is
>>>> assumed to be coherent. Whereas some architextures keep different
>>>> cache
>>>> entries for different virtual addresses even if the physical page
>>>> is
>>>> the same...
>>>>
>>>> clflushes and wbinvds on x86 are for odd arch-specific situations
>>>> where, for example where we change caching attributes of the linear
>>>> kernel map mappings.
>>>
>>> So in summary we have flush_cache_range which is generic, not
>>> implemented on x86 and works with virtual addresses so not directly
>>> usable even if x86 implementation was added.
>>
>> I think for the intended flush_cache_range() semantics: "Make this
>> range visible to all vms on all cpus", I think the x86 implementation
>> is actually a nop, and correctly implemented.
>
> If that is so then I agree. (I did not spend much time looking for 
> desired semantics, just noticed there was no kerneldoc next to the 
> function and stopped there.)
>
>>> There is also x86 specific clflush_cache_range which works with
>>> virtual addresses as well so no good for drm_clflush_sg.
>>>
>>> Question you implicitly raise, correct me if I got it wrong, is
>>> whether we should even be trying to extend drm_clflush_sg for Arm,
>>> given how most (all?) call sites are not needed on discrete, is that
>>> right?
>>
>> Yes exactly. No need to bother figuring this out for ARM, as we don't
>> do any incoherent IO.
>>
>>>
>>> Would that mean we could leave most of the code as is and just
>>> replace wbinvd_on_all_cpus with something like i915_flush_cpu_caches,
>>> which would then legitimately do nothing, at least on Arm if not also
>>> on discrete in general?
>>
>> Yes, with the caveat that we should, at least as a second step, make
>> i915_flush_cpu_caches() range-based if possible from a performance
>> point of view.
>
> Sounds like a plan, and I am counting on the second step part to be 
> really second step. Because that one will need to actually figure out 
> and elaborate sufficiently all three proposed reverts, which was 
> missing in this posting. So first step unblocks Arm builds very 
> cheaply and non-controversially, second step tries going the range route.
>
>>> If that would work it would make a small and easy to review series. I
>>> don't think it would collide with what Linus asked since it is not
>>> propagating undesirable things further - given how if there is no
>>> actual need to flush then there is no need to make it range based
>>> either.
>>>
>>> Exception would be the dmabuf get pages patch which needs a proper
>>> implementation of a new drm flush helper.
>>
>> I think the dmabuf get_pages (note that that's also only for integrated
>> I915_CACHE_NONE x86-only situations), can be done with
>>
>> dma_buf_vmap(dma_buf, &virtual);
>> drm_clflush_virt_range(virtual, length);
>> dma_buf_vunmap(&virtual);
>
> Looks plausible to me. Downside being it vmaps the whole object at 
> once so may regress, at least on 32-bit (!) builds. Would it work in 
> theory to fall back to page by page but would it be worth it just for 
> 32-bit I am not sure.

Back in the days IIRC there was a kmap() api also for dma-buf. But 
nobody used it, and yes, vmap is not ideal but a simple fallback to 
page-based (or even wbinvd on the rare occasion of vmap error) might be ok.

/Thomas

>
> Regards,
>
> Tvrtko