[Intel-xe] LLC configurating, mmap and bo cache management questions

Tvrtko Ursulin tvrtko.ursulin at linux.intel.com
Thu Dec 14 08:10:23 UTC 2023


On 13/12/2023 20:11, Thomas Hellström wrote:
> 
> On 12/13/23 18:50, Tvrtko Ursulin wrote:
>>
>> On 13/12/2023 17:27, Thomas Hellström wrote:
>>>
>>> On 12/13/23 12:55, Tvrtko Ursulin wrote:
>>>>
>>>> On 07/12/2023 12:01, Thomas Hellström wrote:
>>>>>
>>>>> On 12/7/23 12:11, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 06/12/2023 11:46, Hellstrom, Thomas wrote:
>>>>>>> Hi, Tvrtko.
>>>>>>>
>>>>>>> On Wed, 2023-12-06 at 10:58 +0000, Tvrtko Ursulin wrote:
>>>>>>>>
>>>>>>>> Hi Thomas,
>>>>>>>>
>>>>>>>> On 06/12/2023 08:26, Hellstrom, Thomas wrote:
>>>>>>>>> On Tue, 2023-12-05 at 14:19 +0000, Tvrtko Ursulin wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> We are working on adding xe support to ChromeOS minigbm and have
>>>>>>>>>> a
>>>>>>>>>> couple questions.
>>>>>>>>>>
>>>>>>>>>> If I follow things correctly with xe mmap caching mode is fixed
>>>>>>>>>> to
>>>>>>>>>> object caching modes set at bo create. For framebuffers it will
>>>>>>>>>> be WC
>>>>>>>>>> and for the rest userspace can choose WB or WC via
>>>>>>>>>> drm_xe_gem_create->cpu_caching. (Unless discrete, when WB cannot
>>>>>>>>>> be
>>>>>>>>>> used
>>>>>>>>>> at all.)
>>>>>>>>>>
>>>>>>>>>> AFAICT minigbm basically cares about two transition points. Lets
>>>>>>>>>> call
>>>>>>>>>> them CPU access begin and end.
>>>>>>>>>>
>>>>>>>>>> 1)
>>>>>>>>>> When a bo is mmapped it wants to invalidate the cache, which
>>>>>>>>>> looks to
>>>>>>>>>> be
>>>>>>>>>> about making sure all GPU writes have landed to the backing
>>>>>>>>>> store. In
>>>>>>>>>> the i915 world that translates to the set_domain ioctl.
>>>>>>>>>>
>>>>>>>>>> What is the uapi for this with xe, or it is somehow guaranteed to
>>>>>>>>>> not
>>>>>>>>>> be
>>>>>>>>>> needed?
>>>>>>>>>
>>>>>>>>> Signalling a user-fence or dma-fence obtained as an out-fence from
>>>>>>>>> an
>>>>>>>>> exec call will guarantee GPU caches are flushed. Currently I don't
>>>>>>>>> think there is anything like gem wait in the uAPI, although 
>>>>>>>>> Matt is
>>>>>>>>> just about to add functionality to wait on all outstanding work on
>>>>>>>>> an
>>>>>>>>> exec_queue.
>>>>>>>>
>>>>>>>> Problem I see is there are no execs or therefore fences in the
>>>>>>>> minigbm
>>>>>>>> ABI. It's just buffers, created or imported, CPU access and some
>>>>>>>> other
>>>>>>>> stuff.
>>>>>>>>
>>>>>>>> And it is quite extensively used in the OS so I assume it has to 
>>>>>>>> work
>>>>>>>> (I
>>>>>>>> mean the invalidation/flushing was not put in there for no reason),
>>>>>>>> in
>>>>>>>> other words, where the i915 backend today does
>>>>>>>> DRM_I915_GEM_SET_DOMAIN
>>>>>>>> on "cpu access begin", which is buffer based, I am not clear how to
>>>>>>>> implement that with xe.
>>>>>>>>
>>>>>>>> For the outstanding work you mention, since you say it is about
>>>>>>>> exec_queue, I assume again it will not work purely with buffers? If
>>>>>>>> so
>>>>>>>> it probably won't be useful for minigbm.
>>>>>>>>
>>>>>>>> Also if I look at all the other minigbm backends, I see a mix of
>>>>>>>> behaviours:
>>>>>>>>
>>>>>>>>    * msm and vc4 appear to not concern themselves with any of this.
>>>>>>>>
>>>>>>>>    * rockchip appears to be doing full bounce buffering via 
>>>>>>>> memcpy on
>>>>>>>> CPU
>>>>>>>> access begin/end.
>>>>>>>>
>>>>>>>>    * i915 and amdgpu respectively use
>>>>>>>> DRM_I915_GEM_SET_DOMAIN/DRM_AMDGPU_GEM_WAIT_IDLE (also buffer 
>>>>>>>> based,
>>>>>>>> not
>>>>>>>> execution queue). Andgpu curiously does not do any flushing on CPU
>>>>>>>> access end.
>>>>>>>>
>>>>>>>> Digging into git history, both DRM_I915_GEM_SET_DOMAIN on CPU 
>>>>>>>> access
>>>>>>>> begin and clflushing on end were added to fix various CTS test
>>>>>>>> failures.
>>>>>>>> So I guess we could also wait and see what happens there. If 
>>>>>>>> those or
>>>>>>>> some will be failing with xe too then propose adding some new uapi.
>>>>>>>> Or
>>>>>>>> if manual testing will start reporting visual corruption in the UI
>>>>>>>> elements or such.
>>>>>>>
>>>>>>> Indeed it sounds like we'd need a DRM_XE_GEM_WAIT_IDLE, similar to
>>>>>>> AMDGPU for this use-case. I'll bring that up for discussion.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>>>>> 2)
>>>>>>>>>> When a bo is unmapped, or CPU access finished, it wants to flush
>>>>>>>>>> the
>>>>>>>>>> CPU
>>>>>>>>>> caches. That is /almost/ completely a CPU operation, where it
>>>>>>>>>> just
>>>>>>>>>> needs
>>>>>>>>>> to either clflush or invalidate the WC buffer respectively, if
>>>>>>>>>> not
>>>>>>>>>> the
>>>>>>>>>> fact that clflush can be skipped on platforms with LLC.
>>>>>>>>>>
>>>>>>>>>> I did not see an equivalent of an I915_PARAM_HAS_LLC in xe? Did I
>>>>>>>>>> miss
>>>>>>>>>> it or what it is the plan for querying this detail?
>>>>>>>>>
>>>>>>>>> XeKMD is generally coherent, except if UMD selects a GPU PAT index
>>>>>>>>> with
>>>>>>>>> limited coherency together with WB instead of WC memory. In that
>>>>>>>>> case,
>>>>>>>>> UMD is responsible for doing the needed CLFLUSH-ing, whereas KMD
>>>>>>>>> only
>>>>>>>>> ensures initial clearing of the pages is CLFLUSHED for security
>>>>>>>>> reasons.
>>>>>>>>>
>>>>>>>>> I'm not 100% sure if UMD can actually select WB with limited
>>>>>>>>> coherency
>>>>>>>>> PAT index in the initial uAPI revision, but Matthew has received
>>>>>>>>> requests for that so any additional input here on performance
>>>>>>>>> implications is appreciated.
>>>>>>>>>
>>>>>>>>> The thinking here is otherwise that GPU PAT indices with limited
>>>>>>>>> coherency should be used together with WC memory in the same
>>>>>>>>> situations
>>>>>>>>> as VRAM/LMEM is used on DGFX.
>>>>>>>>
>>>>>>>> Hm okay, this would be the VM BIND side of things which deals with
>>>>>>>> GPU
>>>>>>>> PATs. From the CPU side it is just CPU caching modes implicitly
>>>>>>>> selected
>>>>>>>> via DRM_XE_GEM_CPU_CACHING_WB/WC.
>>>>>>>>
>>>>>>>> The question about the analogue of I915_PARAM_HAS_LLC was AFAIU 
>>>>>>>> about
>>>>>>>> a
>>>>>>>> performance optimisation where UMD is deciding whether it is 
>>>>>>>> okay to
>>>>>>>> skip issuing clflush for the mapped bo if DRM_XE_GEM_CPU_CACHING_WB
>>>>>>>> was
>>>>>>>> used. (If DRM_XE_GEM_CPU_CACHING_WC was used it obviously only 
>>>>>>>> needs
>>>>>>>> to
>>>>>>>> flush the write combine buffer.)
>>>>>>>>
>>>>>>>> Looking at what Mesa is doing it appears it is not using
>>>>>>>> I915_PARAM_HAS_LLC but has its own device info tables. So I guess
>>>>>>>> minigbm xe backend will have to do the same if xe will not provide
>>>>>>>> the
>>>>>>>> analogue query.
>>>>>>>
>>>>>>> IMO clflushing in this case should never be needed, (Again, 
>>>>>>> similar to
>>>>>>> AMD). Whatever renders from / to those buffers should make sure they
>>>>>>> are clflushed before or while accessing them. How is a rendering API
>>>>>>> made aware of these bos? Are they imported using drm prime?
>>>>>>
>>>>>> Yes they can be imported via drm prime. It's a quite a huge code 
>>>>>> base with a few abstractions and quite hard to figure out all 
>>>>>> possible paths. There are gbm_bo_map paths in the host and vm 
>>>>>> compositors (Exo and Sommelier), Skia library, camera pipeline, 
>>>>>> maybe more.
>>>>>>
>>>>>> You mean specifically on the "cpu access end" part of the story 
>>>>>> flushing would never be needed? Regardless of llc and !llc, wb or 
>>>>>> wc mappings, imported or native buffer? What mechanism would next 
>>>>>> in the pipeline (after CPU access) use to ensure the flush?
>>>>>>
>>>>> Currently we're not supporting any older !LLC hardware, and with 
>>>>> newer !LLC hardware, we're only allowing at least 1-way coherency 
>>>>> with dma-bufs, meaning the GPU will acquire the cache line when 
>>>>> accessing the bo. If the bo is not a dma-buf, and the GPU wants to 
>>>>> access it in non-coherent mode, we're currently enforcing WC 
>>>>> cpu-mappings.
>>>>>
>>>>> So:
>>>>> igfx with shared LLC: Not a problem (always coherent)
>>>>> dgfx: Not a problem (always coherent)
>>>>> newer igfx WO shared LLC (KMD is enforcing coherent access).
>>>>>
>>>>> Now this may of course, as mentions, have performance implications 
>>>>> compared to previous igfx solutions, depending on how it all is 
>>>>> written, but should have similar performance characteristics as 
>>>>> dgfx. If we want to change this and relax the coherency enforcement 
>>>>> and resort to CLFLUSHes again,  could UMD communicate the need for 
>>>>> this in the same way it communicates the format of the gbm_bos?
>>>>
>>>> I think it is probably not needed. If minigbm will know mmap is 
>>>> always WC on MTL then it can know that all it needs to do to flush 
>>>> is flush the write-combine buffer. So the logic in minigbm xe 
>>>> backend "cpu access end" would be like:
>>>>
>>>> xe_bo_flush(...)
>>>> {
>>>>   if (dgfx || (igfx && !llc))
>>>>     __builtin_ia32_sfence();
>>>> }
>>>>
>>>> Does that sound right to you?
>>>>
>>>> So far on the i915 backend this explicit sfence was not even there 
>>>> in the mmap_wc case, which I guess possibly works because something 
>>>> is likely to trigger the flush implicitly inside the complexities of 
>>>> the call stack anyway.
>>>
>>> IIRC WC buffers are flushed also on uncached register writes (does 
>>> that include updating the ring tail perhapsl?).
>>
>> If next in chain is the GPU I guess so. It would indeed be a typical 
>> use case but for correctness it probably needs a sfence since AFAIU 
>> write-combine buffer is per CPU core.
>>
>>> It might also be that these gbm_bos are created WB and the igfx PAT 
>>> is set to 1-way coherency. What code creates those gbm_bos?
>>
>> Minigbm on behalf of various callers. It is based on the allocation 
>> flags such as will it be for scanout, protected content, modifiers and 
>> such.
>>
>> When scanout is request we will set DRM_XE_GEM_CREATE_FLAG_SCANOUT. 
>> Which would mean scanount bos will be WC and the rest WB-1-way when 
>> mmaped.
>>
>>> I figure for dgfx at least we don't want any intel platform-specific 
>>> code in the stack if possible....
>>
>> True, would have to be platform specific.
> 
> So if we unconditionally add an sfence, I wouldn't expect the overhead 
> to be significant even if it's not strictly needed? Or have the 
> gbm_bo_create() forward to gbm_bo_map() whether the bo was created WC?

Unconditional is most probably fine I agree.

Otherwise the map does have the information of how the object was 
created so that part wouldn't be a problem 
(https://github.com/intel/minigbm/blob/10d9a651375efa3592ab95431783984c28a30ad4/i915.c#L529). 
It is just if we wanted the LLC presence it would be extra work.

Regards,

Tvrtko



More information about the Intel-xe mailing list