[PATCH 14/25] drm/amdkfd: Populate DRM render device minor

Wed Feb 14 16:35:43 UTC 2018

On 2018-02-14 02:42 AM, Christian König wrote:
> Am 14.02.2018 um 00:17 schrieb Felix Kuehling:
>> On 2018-02-13 02:18 PM, Felix Kuehling wrote:
>>> On 2018-02-13 01:15 PM, Christian König wrote:
>>>> Am 13.02.2018 um 18:18 schrieb Felix Kuehling:
>>>>> On 2018-02-13 12:06 PM, Christian König wrote:
>>>>>> [SNIP]
>>>>>> Ah, yeah that is also a point I wanted to to talk about with you.
>>>>>>
>>>>>> The approach of using the same buffer object with multiple amdgpu
>>>>>> devices doesn't work in general.
>>>>>>
>>>>>> We need separate TTM object for each BO in each device or
>>>>>> otherwise we
>>>>>> break A+A laptops.
>>>>> I think it broke for VRAM BOs because we enabled P2P on systems that
>>>>> didn't support it properly. But at least system memory BOs can be
>>>>> shared
>>>>> quite well between devices and we do it all the time.
>>>> Sharing VRAM BOs is one issue, but the problem goes deeper than just
>>>> that.
>>>>
>>>> Starting with Carizzo we can scanout from system memory to avoid the
>>>> extra copy on A+A laptops. For this to work we need the BO mapped to
>>>> GART (and I mean a real VMID0 mapping, not just in the GTT domain).
>>>> And for this to work in turn we need a TTM object per device and not a
>>>> global one.
>>> I still don't understand. I think what you're talking about applies
>>> only
>>> to BOs used for scan-out. Every BO is allocated from a specific device
>>> and can only be GART-mapped on that device.
>
> Exactly that assumption is incorrect. BOs can be GART mapped into any
> device.

Fine. My point is, we're not doing that.

>
>>>   What we do is map the same
>>> BO in VMs on other devices. It has no effect on GART mappings.
>
> Correct VM mapping is unaffected here.

Great.

>
>>>>> I don't see how you can have separate TTM objects referring to the
>>>>> same memory.
>>>> Well that is trivial, we do this all the time with prime and I+A
>>>> laptops.
>>> As I understand it, you use DMABuf to export/import buffers on multiple
>>> devices. I believe all devices share a single amdgpu_bo, which contains
>>> the ttm_buffer_object.
>
> That's incorrect as well. Normally multiple devices have multiple
> ttm_buffer_object, one for each device.
> Going a bit higher that actually makes sense because the status of
> each BO is deferent for each device. E.g. one device could have the BO
> in access while it could be idle on another device.

Can you point me where this is done? I'm looking at
amdgpu_gem_prime_foreign_bo. It is used if an AMDGPU BO is imported into
a different AMDGPU device. It creates a new GEM object, with a reference
to the same amdgpu BO (gobj->bo = amdgpu_bo_ref(bo)). To me this looks
very much like the same amdgpu_bo, and cosequently the same TTM BO being
shared by two GEM objects and two devices.

>
> Same is true for placement of the BO. E.g. a VRAM placement of one
> device is actually a GTT placement for another.
>
>>>   The only way you can have a new TTM buffer object
>>> per device is by using SG tables and pinning the BOs. But I think we
>>> want to avoid pinning BOs.
>>>
>>> What we do does not involve pinning of BOs, even when they're shared
>>> between multiple devices' VMs.
>>>
>>>>>> That is also the reason we had to disable this feature again in the
>>>>>> hybrid branches.
>>>>> What you disabled on hybrid branches was P2P, which only affects
>>>>> large-BAR systems. Sharing of system memory BOs between devices still
>>>>> works fine.
>>>> No, it doesn't. It completely breaks any scanout on Carizzo, Stoney
>>>> and Raven. Additional to that we found that it breaks some aspects of
>>>> the user space interface.
>>> Let me check that with my current patch series. The patches I submitted
>>> here shouldn't include anything that breaks the use cases you describe.
>>> But I'm quite sure it will support sharing BOs between multiple
>>> devices'
>>> VMs.
>> Confirmed with the current patch series. I can allocate buffers on GPU 1
>> and map them into a VM on GPU 2.
>
> As I said VM mapping is not the problem here.

Great.

>
>> BTW, this is the normal mode of operation for system memory BOs on
>> multi-GPU systems. Logically system memory BOs belong to the CPU node in
>> the KFD topology. So the Thunk API isn't told which GPU to allocate the
>> system memory BO from. It just picks the first one. Then we can map
>> those BOs on any GPUs we want. If we want to use them on GPU2, that's
>> where they get mapped.
>
> Well keeping NUMA in mind that actually sounds like a design problem
> to me.

We're not doing NUMA with TTM because TTM is not NUMA aware. We have
some prototype NUMA code in the Thunk that uses userptr to map memory
allocated with NUMA awareness to the GPU.

>
> On NUMA system some parts of the system memory can be "closer" to a
> GPU than other parts.

Yes, I understand that.

>
>> So I just put two GPUs in a system and ran a test on GPU2. The system
>> memory buffers are allocated from the GPU1 device, but mapped into the
>> GPU2 VM. The tests work normally.
>>
>> If this is enabled by any changes that break existing buffer sharing for
>> A+A or A+I systems, please point it out to me. I'm not aware that this
>> patch series does anything to that effect.
>
> As I said it completely breaks scanout with A+I systems.

Please tell me what "it" is. What in the changes I have posted is
breaking A+I systems. I don't see it.

>
> Over all it looks like the change causes more problems than it solves.
> So I'm going to block upstreaming it until we have found a way to make
> it work for everybody.

Again, I don't know what "it" is.

Regards,
  Felix

>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>>> Regards,
>>>    Felix
>>>
>>>> So end result is that we probably need to revert it and find a
>>>> different solution. I'm already working on this for a couple of weeks
>>>> now and should have something ready after I'm done with the PASID
>>>> handling.
>>>>
>>>> Regards,
>>>> Christian.
>