[PATCH 21/24] drm/amdkfd: add queue remapping
James Zhu
jamesz at amd.com
Thu Nov 23 23:16:25 UTC 2023
On 2023-11-23 18:01, Felix Kuehling wrote:
> On 2023-11-23 17:41, Greathouse, Joseph wrote:
>> [Public]
>>
>>> -----Original Message-----
>>> From: Zhu, James <James.Zhu at amd.com>
>>> Sent: Thursday, November 23, 2023 1:49 PM
>>>
>>> On 2023-11-23 14:02, Felix Kuehling wrote:
>>>> On 2023-11-23 11:25, James Zhu wrote:
>>>>> On 2023-11-22 17:35, Felix Kuehling wrote:
>>>>>> On 2023-11-03 09:11, James Zhu wrote:
>>>>>>> Add queue remapping to force the waves in any running
>>>>>>> processes to complete a CWSR trap.
>>>>>> Please add an explanation why this is needed.
>>>>> [JZ] Even though the profiling-enabled bits is turned off, the CWSR
>>>>> trap handlers for some kernels with this process may still in running
>>>>> stage, this will
>>>>>
>>>>> force the waves in any running processes to complete a CWSR trap, and
>>>>> make sure pc sampling is completely stopped with this process. I
>>>>> will add it later.
>>>> It may be confusing to talk specifically about "CWSR trap handler".
>>>> There is only one trap handler that is triggered by different events:
>>>> CWSR, host trap, s_trap instructions, exceptions, etc. When a new trap
>>>> triggers, it serializes with any currently running trap handler in
>>>> that wavefront. So it seems that you're using CWSR as a way to ensure
>>>> that any host trap has completed: CWSR will wait for previous traps to
>>>> finish before trapping again for CWSR, the HWS firmware waits for CWSR
>>>> completion and the driver waits for HWS to finish CWSR with a fence on
>>>> a HIQ QUERY_STATUS packet. Is that correct?
>>> [JZ] I think your explanation is more detail. Need Joseph to confirm.
>> Felix, your summary is correct. The reason we are trying to perform a
>> queue unmap/map cycle as part of the PC sampling stop is to prevent
>> the following:
>>
>> 1. A PC sampling request arrives to Wave X, sending it to 1st-level
>> trap handler
>> 2. User thread asks KFD to stop sampling for this process, which
>> leads to kfd_pc_sample_stop()
>> 3. kfd_pc_sample_stop() decrements the sampling refcent. If this is
>> the last process to stop sampling, it stops any further sampling
>> traps from being generated
>> 4. kfd_pc_sample_stop() sets this process's TMA flag to false so
>> waves in the 1st-level trap handler know sampling is disabled
>> 4.1. Wave X may be in 1st-level handler and not yet checked the
>> TMA flag. If so, it will exit the 1st-level handler when it sees flag
>> is false
>> 4.2. Wave X may have already passed the 1st-level TMA flag check
>> and entered the 2nd-level trap handler to do the PC sample
>> 5. kfd_pc_sample_stop() returns, eventually causing ioctl to return,
>> back to user-space
>> 6. Because the stop ioctl has returned, user-land deallocates
>> user-space buffer the 2nd level trap handler uses to output sample data
>> 7. Wave X that was in the 2nd-level handler tries to finish its
>> sample output and writes to the now-freed location, causing a
>> use-after-free
>>
>> Note that Step 3 does not always stop further traps from arriving --
>> if another process still wants to do sampling, the driver or HW might
>> still send traps to every wave on the device after Step 3.
>> As such, to avoid going into the 2nd-level handler for non-sampled
>> processes, all 1st-level handlers must check their TMA flag to see if
>> they should allow the sample to flow to the 2nd-level handler.
>>
>> By removing the queue from the HW after Step 4, we can be sure that
>> any existing waves from this process that entered the PC sampling
>> 2nd-level handler before Step 4 are done.
>> Any waves that were still in the 1st-level handler at Step 4.1 will
>> be filtered by the TMA flag being set to false. CWSR will wait until
>> they exit.
>> Any waves that were already in the 2nd-level handler (4.2) must
>> complete before the CWSR save will complete and allow this queue
>> removal request to complete.
>> Any waves that enter the 1st-level trap handler after Step 4 won't go
>> into the PC sampling logic in the 2nd-level handler because the TMA
>> flag is set to false. CWSR will wait until they exit.
>>
>> When we then put the queue back on the hardware, any further traps
>> that might show up (e.g. because another process is sampling) will
>> get filtered by the TMA flag.
>>
>> So once the queue removal (and thus CWSR save cycle) has completed,
>> we can be sure that no other traps to this process will try to use
>> its PC sample data buffer, so it's safe to return to user-space and
>> let them potentially free that buffer.
>>
>> I don't know how to summarize this nicely in a comment, but hopefully
>> y'all can figure that out. :)
>
> My best summary: We need to ensure that any waves executing the PC
> sampling part of the trap handler are done before kfd_pc_sample_stop
> returns, and that no new waves enter that part of the trap handler
> afterwards. This avoids race conditions that could lead to
> use-after-free. Unmapping and remapping the queues either waits for
> the waves to drain, or preempts them with CWSR, which itself executes
> a trap and waits for previous traps to finish.
> [JZ] Thanks all!
> Regards,
> Felix
>
>
>>
>> Thanks,
>> -Joe
>>
>>>> Regards,
>>>> Felix
>>>>
>>>>
>>>>>>> Signed-off-by: James Zhu <James.Zhu at amd.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 11
>>>>>>> +++++++++++
>>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h | 5 +++++
>>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_pc_sampling.c | 3 +++
>>>>>>> 3 files changed, 19 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>>>> index c0e71543389a..a3f57be63f4f 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>>>>>> @@ -3155,6 +3155,17 @@ int debug_refresh_runlist(struct
>>>>>>> device_queue_manager *dqm)
>>>>>>> return debug_map_and_unlock(dqm);
>>>>>>> }
>>>>>>> +void remap_queue(struct device_queue_manager *dqm,
>>>>>>> + enum kfd_unmap_queues_filter filter,
>>>>>>> + uint32_t filter_param,
>>>>>>> + uint32_t grace_period)
>>>>>> Not sure if you need the filter and grace period parameters in this
>>>>>> function. What's the point of exposing that to callers who just want
>>>>>> to trigger a CWSR? You could also change the function name to
>>>>>> reflect the purpose of the function, rather than the implementation.
>>>>> [JZ] Just want to create a general function in case that used by
>>>>> others. I am fine to remove passing filter_param/grace_period
>>>>>> Regards,
>>>>>> Felix
>>>>>>
>>>>>>
>>>>>>> +{
>>>>>>> + dqm_lock(dqm);
>>>>>>> + if (!dqm->dev->kfd->shared_resources.enable_mes)
>>>>>>> + execute_queues_cpsch(dqm, filter, filter_param,
>>>>>>> grace_period);
>>>>>>> + dqm_unlock(dqm);
>>>>>>> +}
>>>>>>> +
>>>>>>> #if defined(CONFIG_DEBUG_FS)
>>>>>>> static void seq_reg_dump(struct seq_file *m,
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
>>>>>>> index cf7e182588f8..f8aae3747a36 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
>>>>>>> @@ -303,6 +303,11 @@ int debug_lock_and_unmap(struct
>>>>>>> device_queue_manager *dqm);
>>>>>>> int debug_map_and_unlock(struct device_queue_manager *dqm);
>>>>>>> int debug_refresh_runlist(struct device_queue_manager *dqm);
>>>>>>> +void remap_queue(struct device_queue_manager *dqm,
>>>>>>> + enum kfd_unmap_queues_filter filter,
>>>>>>> + uint32_t filter_param,
>>>>>>> + uint32_t grace_period);
>>>>>>> +
>>>>>>> static inline unsigned int get_sh_mem_bases_32(struct
>>>>>>> kfd_process_device *pdd)
>>>>>>> {
>>>>>>> return (pdd->lds_base >> 16) & 0xFF;
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_pc_sampling.c
>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_pc_sampling.c
>>>>>>> index e8f0559b618e..66670cdb813a 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_pc_sampling.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_pc_sampling.c
>>>>>>> @@ -24,6 +24,7 @@
>>>>>>> #include "kfd_priv.h"
>>>>>>> #include "amdgpu_amdkfd.h"
>>>>>>> #include "kfd_pc_sampling.h"
>>>>>>> +#include "kfd_device_queue_manager.h"
>>>>>>> struct supported_pc_sample_info {
>>>>>>> uint32_t ip_version;
>>>>>>> @@ -164,6 +165,8 @@ static int kfd_pc_sample_stop(struct
>>>>>>> kfd_process_device *pdd,
>>>>>>> cancel_work_sync(&pdd->dev->pcs_data.hosttrap_entry.base.pc_sampling_work);
>>>>>>>
>>>>>>>
>>>>>>> kfd_process_set_trap_pc_sampling_flag(&pdd->qpd,
>>>>>>> pdd->dev->pcs_data.hosttrap_entry.base.pc_sample_info.method,
>>>>>>> false);
>>>>>>> + remap_queue(pdd->dev->dqm,
>>>>>>> + KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0,
>>>>>>> USE_DEFAULT_GRACE_PERIOD);
>>>>>>> mutex_lock(&pdd->dev->pcs_data.mutex);
>>>>>>> pdd->dev->pcs_data.hosttrap_entry.base.target_simd = 0;
More information about the amd-gfx
mailing list