[PATCH 21/24] drm/amdkfd: add queue remapping

Felix Kuehling felix.kuehling at amd.com
Thu Nov 23 23:01:45 UTC 2023


On 2023-11-23 17:41, Greathouse, Joseph wrote:
> [Public]
>
>> -----Original Message-----
>> From: Zhu, James <James.Zhu at amd.com>
>> Sent: Thursday, November 23, 2023 1:49 PM
>>
>> On 2023-11-23 14:02, Felix Kuehling wrote:
>>> On 2023-11-23 11:25, James Zhu wrote:
>>>> On 2023-11-22 17:35, Felix Kuehling wrote:
>>>>> On 2023-11-03 09:11, James Zhu wrote:
>>>>>> Add queue remapping to force the waves in any running
>>>>>> processes to complete a CWSR trap.
>>>>> Please add an explanation why this is needed.
>>>> [JZ] Even though the profiling-enabled bits is turned off, the CWSR
>>>> trap handlers for some kernels with this process may still in running
>>>> stage, this will
>>>>
>>>> force the waves in any running processes to complete a CWSR trap, and
>>>> make sure pc sampling is completely stopped with this process.   I
>>>> will add it later.
>>> It may be confusing to talk specifically about "CWSR trap handler".
>>> There is only one trap handler that is triggered by different events:
>>> CWSR, host trap, s_trap instructions, exceptions, etc. When a new trap
>>> triggers, it serializes with any currently running trap handler in
>>> that wavefront. So it seems that you're using CWSR as a way to ensure
>>> that any host trap has completed: CWSR will wait for previous traps to
>>> finish before trapping again for CWSR, the HWS firmware waits for CWSR
>>> completion and the driver waits for HWS to finish CWSR with a fence on
>>> a HIQ QUERY_STATUS packet. Is that correct?
>> [JZ] I think your explanation is more detail. Need Joseph to confirm.
> Felix, your summary is correct. The reason we are trying to perform a queue unmap/map cycle as part of the PC sampling stop is to prevent the following:
>
> 1. A PC sampling request arrives to Wave X, sending it to 1st-level trap handler
> 2. User thread asks KFD to stop sampling for this process, which leads to kfd_pc_sample_stop()
> 3. kfd_pc_sample_stop() decrements the sampling refcent. If this is the last process to stop sampling, it stops any further sampling traps from being generated
> 4. kfd_pc_sample_stop() sets this process's TMA flag to false so waves in the 1st-level trap handler know sampling is disabled
>      4.1. Wave X may be in 1st-level handler and not yet checked the TMA flag. If so, it will exit the 1st-level handler when it sees flag is false
>      4.2. Wave X may have already passed the 1st-level TMA flag check and entered the 2nd-level trap handler to do the PC sample
> 5. kfd_pc_sample_stop() returns, eventually causing ioctl to return, back to user-space
> 6. Because the stop ioctl has returned, user-land deallocates user-space buffer the 2nd level trap handler uses to output sample data
> 7. Wave X that was in the 2nd-level handler tries to finish its sample output and writes to the now-freed location, causing a use-after-free
>
> Note that Step 3 does not always stop further traps from arriving -- if another process still wants to do sampling, the driver or HW might still send traps to every wave on the device after Step 3.
> As such, to avoid going into the 2nd-level handler for non-sampled processes, all 1st-level handlers must check their TMA flag to see if they should allow the sample to flow to the 2nd-level handler.
>
> By removing the queue from the HW after Step 4, we can be sure that any existing waves from this process that entered the PC sampling 2nd-level handler before Step 4 are done.
> Any waves that were still in the 1st-level handler at Step 4.1 will be filtered by the TMA flag being set to false. CWSR will wait until they exit.
> Any waves that were already in the 2nd-level handler (4.2) must complete before the CWSR save will complete and allow this queue removal request to complete.
> Any waves that enter the 1st-level trap handler after Step 4 won't go into the PC sampling logic in the 2nd-level handler because the TMA flag is set to false. CWSR will wait until they exit.
>
> When we then put the queue back on the hardware, any further traps that might show up (e.g. because another process is sampling) will get filtered by the TMA flag.
>
> So once the queue removal (and thus CWSR save cycle) has completed, we can be sure that no other traps to this process will try to use its PC sample data buffer, so it's safe to return to user-space and let them potentially free that buffer.
>
> I don't know how to summarize this nicely in a comment, but hopefully y'all can figure that out. :)

My best summary: We need to ensure that any waves executing the PC 
sampling part of the trap handler are done before kfd_pc_sample_stop 
returns, and that no new waves enter that part of the trap handler 
afterwards. This avoids race conditions that could lead to 
use-after-free. Unmapping and remapping the queues either waits for the 
waves to drain, or preempts them with CWSR, which itself executes a trap 
and waits for previous traps to finish.

Regards,
   Felix


>
> Thanks,
> -Joe
>
>>> Regards,
>>>    Felix
>>>
>>>
>>>>>> Signed-off-by: James Zhu <James.Zhu at amd.com>
>>>>>> ---
>>>>>>    drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 11
>>>>>> +++++++++++
>>>>>>    drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h |  5 +++++
>>>>>>    drivers/gpu/drm/amd/amdkfd/kfd_pc_sampling.c          |  3 +++
>>>>>>    3 files changed, 19 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>>> index c0e71543389a..a3f57be63f4f 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>>> @@ -3155,6 +3155,17 @@ int debug_refresh_runlist(struct
>>>>>> device_queue_manager *dqm)
>>>>>>        return debug_map_and_unlock(dqm);
>>>>>>    }
>>>>>>    +void remap_queue(struct device_queue_manager *dqm,
>>>>>> +                enum kfd_unmap_queues_filter filter,
>>>>>> +                uint32_t filter_param,
>>>>>> +                uint32_t grace_period)
>>>>> Not sure if you need the filter and grace period parameters in this
>>>>> function. What's the point of exposing that to callers who just want
>>>>> to trigger a CWSR? You could also change the function name to
>>>>> reflect the purpose of the function, rather than the implementation.
>>>> [JZ] Just want to create a general function in case that used by
>>>> others. I am fine to remove passing filter_param/grace_period
>>>>> Regards,
>>>>>    Felix
>>>>>
>>>>>
>>>>>> +{
>>>>>> +    dqm_lock(dqm);
>>>>>> +    if (!dqm->dev->kfd->shared_resources.enable_mes)
>>>>>> +        execute_queues_cpsch(dqm, filter, filter_param,
>>>>>> grace_period);
>>>>>> +    dqm_unlock(dqm);
>>>>>> +}
>>>>>> +
>>>>>>    #if defined(CONFIG_DEBUG_FS)
>>>>>>      static void seq_reg_dump(struct seq_file *m,
>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
>>>>>> index cf7e182588f8..f8aae3747a36 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
>>>>>> @@ -303,6 +303,11 @@ int debug_lock_and_unmap(struct
>>>>>> device_queue_manager *dqm);
>>>>>>    int debug_map_and_unlock(struct device_queue_manager *dqm);
>>>>>>    int debug_refresh_runlist(struct device_queue_manager *dqm);
>>>>>>    +void remap_queue(struct device_queue_manager *dqm,
>>>>>> +                enum kfd_unmap_queues_filter filter,
>>>>>> +                uint32_t filter_param,
>>>>>> +                uint32_t grace_period);
>>>>>> +
>>>>>>    static inline unsigned int get_sh_mem_bases_32(struct
>>>>>> kfd_process_device *pdd)
>>>>>>    {
>>>>>>        return (pdd->lds_base >> 16) & 0xFF;
>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_pc_sampling.c
>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_pc_sampling.c
>>>>>> index e8f0559b618e..66670cdb813a 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_pc_sampling.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_pc_sampling.c
>>>>>> @@ -24,6 +24,7 @@
>>>>>>    #include "kfd_priv.h"
>>>>>>    #include "amdgpu_amdkfd.h"
>>>>>>    #include "kfd_pc_sampling.h"
>>>>>> +#include "kfd_device_queue_manager.h"
>>>>>>      struct supported_pc_sample_info {
>>>>>>        uint32_t ip_version;
>>>>>> @@ -164,6 +165,8 @@ static int kfd_pc_sample_stop(struct
>>>>>> kfd_process_device *pdd,
>>>>>> cancel_work_sync(&pdd->dev->pcs_data.hosttrap_entry.base.pc_sampling_work);
>>>>>>
>>>>>> kfd_process_set_trap_pc_sampling_flag(&pdd->qpd,
>>>>>> pdd->dev->pcs_data.hosttrap_entry.base.pc_sample_info.method, false);
>>>>>> +        remap_queue(pdd->dev->dqm,
>>>>>> +            KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0,
>>>>>> USE_DEFAULT_GRACE_PERIOD);
>>>>>>              mutex_lock(&pdd->dev->pcs_data.mutex);
>>>>>> pdd->dev->pcs_data.hosttrap_entry.base.target_simd = 0;


More information about the amd-gfx mailing list