[PATCH] drm/amdgpu: Always emit GDS switch when GDS/GWS/OA is used
Friedrich Vock
friedrich.vock at gmx.de
Wed Aug 2 10:29:01 UTC 2023
Gentle ping. Any updates on this yet?
Thanks,
Friedrich
On 20.07.23 23:25, Friedrich Vock wrote:
> Hi,
>
> On 07.07.23 10:21, Christian König wrote:
>> Am 07.07.23 um 09:28 schrieb Friedrich Vock:
>>> Hi Christian,
>>>
>>> On 07.07.23 08:56, Christian König wrote:
>>>>
>>>>
>>>> Am 07.07.23 um 08:28 schrieb Friedrich Vock:
>>>>> During gfxoff, the per-VMID GDS registers are reset and not restored
>>>>> afterwards.
>>>>
>>>> Hui? Since when? Those registers should be part of the saved ones.
>>>>
>>>> Have you found that by observation?
>>>
>>> yes. I tested this on my RX 6700 XT and the Steam Deck (Vangogh). In
>>> the
>>> bug report I linked, a test program using GWS I developed hangs because
>>> of this.
>>>
>>> The hang occurs as soon as the kernel re-uses a VMID on which GWS was
>>> already used once. In the hung state, inspecting the per-VMID GWS
>>> registers shows that the values have been reset to 0.
>>> The hang does not occur when gfxoff is disabled.
>>>
>>> Even without causing hangs, you can confirm the behaviour by doing the
>>> following:
>>> 1. Disable gfxoff.
>>> 2. Set some GWS registers.
>>> 3. Enable gfxoff and wait a bit.
>>> 4. Disable gfxoff and read the registers again. The GWS registers have
>>> been reset.
>>>
>>> I performed this test for the GDS_BASE/SIZE registers and it seems
>>> these
>>> aren't affected, so it's only GWS that is buggy here.
>>
>> That's most like a bug in the FW then. I'm going to ask around
>> internally.
>
> Did the talks with the FW team result in anything yet? It's not that
> high-priority, but it'd be nice to know if this is going to be fixed in
> the firmware or if I should make a v2 (or if this isn't going to be
> fixed at all).
>
> Thanks,
> Friedrich
>
>>
>>> I should probably make a v2 that combines the behaviour before this
>>> patch for GDS and OA, and the patched behaviour for GWS.
>>
>> Yeah, that sounds like a good idea to me. But let me ping the fw teams
>> first.
>>
>>>
>>> I'm not aware of userspace using GWS (yet, I had some ideas for
>>> using it
>>> in RADV which is what I've been writing these tests for),
>>> so perhaps the Cc to stable can also be omitted.
>>
>> Depends on what the fw teams says. As far as I know GWS has never been
>> used widely on Linux.
>>
>> Could be that they say there is a hw bug and we deprecated it for this
>> generation, or it's simply not handled by the fw and the driver needs
>> to take care of this (like this patch does) or whatever.
>>
>> Thanks for the notice,
>> Christian.
>>
>>>
>>> Thanks,
>>> Friedrich
>>>
>>>>
>>>> Thanks,
>>>> Christian.
>>>>
>>>>
>>>>> The kernel needs to emit a GDS switch to manually update the
>>>>> GWS registers in this case. Since gfxoff can happen between any two
>>>>> submissions and the kernel has no way of knowing, emit the GDS switch
>>>>> before every submission.
>>>>>
>>>>> Fixes: 56b0989e29 ("drm/amdgpu: fix GDS/GWS/OA switch handling")
>>>>> Cc: stable at vger.kernel.org
>>>>> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2530
>>>>> Signed-off-by: Friedrich Vock <friedrich.vock at gmx.de>
>>>>> ---
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c | 22 +++++++---------------
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 1 -
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 10 ++++++++--
>>>>> 3 files changed, 15 insertions(+), 18 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
>>>>> index ff1ea99292fb..de73797e9279 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
>>>>> @@ -165,24 +165,17 @@ bool amdgpu_vmid_had_gpu_reset(struct
>>>>> amdgpu_device *adev,
>>>>> atomic_read(&adev->gpu_reset_counter);
>>>>> }
>>>>>
>>>>> -/* Check if we need to switch to another set of resources */
>>>>> -static bool amdgpu_vmid_gds_switch_needed(struct amdgpu_vmid *id,
>>>>> - struct amdgpu_job *job)
>>>>> -{
>>>>> - return id->gds_base != job->gds_base ||
>>>>> - id->gds_size != job->gds_size ||
>>>>> - id->gws_base != job->gws_base ||
>>>>> - id->gws_size != job->gws_size ||
>>>>> - id->oa_base != job->oa_base ||
>>>>> - id->oa_size != job->oa_size;
>>>>> -}
>>>>> -
>>>>> /* Check if the id is compatible with the job */
>>>>> static bool amdgpu_vmid_compatible(struct amdgpu_vmid *id,
>>>>> struct amdgpu_job *job)
>>>>> {
>>>>> return id->pd_gpu_addr == job->vm_pd_addr &&
>>>>> - !amdgpu_vmid_gds_switch_needed(id, job);
>>>>> + id->gds_base == job->gds_base &&
>>>>> + id->gds_size == job->gds_size &&
>>>>> + id->gws_base == job->gws_base &&
>>>>> + id->gws_size == job->gws_size &&
>>>>> + id->oa_base == job->oa_base &&
>>>>> + id->oa_size == job->oa_size;
>>>>> }
>>>>>
>>>>> /**
>>>>> @@ -434,7 +427,6 @@ int amdgpu_vmid_grab(struct amdgpu_vm *vm, struct
>>>>> amdgpu_ring *ring,
>>>>> list_move_tail(&id->list, &id_mgr->ids_lru);
>>>>> }
>>>>>
>>>>> - job->gds_switch_needed = amdgpu_vmid_gds_switch_needed(id, job);
>>>>> if (job->vm_needs_flush) {
>>>>> id->flushed_updates = amdgpu_vm_tlb_seq(vm);
>>>>> dma_fence_put(id->last_flush);
>>>>> @@ -503,7 +495,7 @@ void amdgpu_vmid_free_reserved(struct
>>>>> amdgpu_device *adev,
>>>>> * @vmhub: vmhub type
>>>>> * @vmid: vmid number to use
>>>>> *
>>>>> - * Reset saved GDW, GWS and OA to force switch on next flush.
>>>>> + * Reset saved GDS, GWS and OA data.
>>>>> */
>>>>> void amdgpu_vmid_reset(struct amdgpu_device *adev, unsigned vmhub,
>>>>> unsigned vmid)
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>>> index a963a25ddd62..2898508b1ce4 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>>>> @@ -53,7 +53,6 @@ struct amdgpu_job {
>>>>> uint32_t preamble_status;
>>>>> uint32_t preemption_status;
>>>>> bool vm_needs_flush;
>>>>> - bool gds_switch_needed;
>>>>> bool spm_update_needed;
>>>>> uint64_t vm_pd_addr;
>>>>> unsigned vmid;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> index 291977b93b1d..61856040cae2 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> @@ -557,6 +557,12 @@ void amdgpu_vm_check_compute_bug(struct
>>>>> amdgpu_device *adev)
>>>>> }
>>>>> }
>>>>>
>>>>> +/* Check if the job needs a GDS switch */
>>>>> +static bool amdgpu_vm_need_gds_switch(struct amdgpu_job *job)
>>>>> +{
>>>>> + return job->gds_size || job->gws_size || job->oa_size;
>>>>> +}
>>>>> +
>>>>> /**
>>>>> * amdgpu_vm_need_pipeline_sync - Check if pipe sync is needed for
>>>>> job.
>>>>> *
>>>>> @@ -579,7 +585,7 @@ bool amdgpu_vm_need_pipeline_sync(struct
>>>>> amdgpu_ring *ring,
>>>>> if (job->vm_needs_flush || ring->has_compute_vm_bug)
>>>>> return true;
>>>>>
>>>>> - if (ring->funcs->emit_gds_switch && job->gds_switch_needed)
>>>>> + if (ring->funcs->emit_gds_switch &&
>>>>> amdgpu_vm_need_gds_switch(job))
>>>>> return true;
>>>>>
>>>>> if (amdgpu_vmid_had_gpu_reset(adev, &id_mgr->ids[job->vmid]))
>>>>> @@ -609,7 +615,7 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring,
>>>>> struct amdgpu_job *job,
>>>>> struct amdgpu_vmid *id = &id_mgr->ids[job->vmid];
>>>>> bool spm_update_needed = job->spm_update_needed;
>>>>> bool gds_switch_needed = ring->funcs->emit_gds_switch &&
>>>>> - job->gds_switch_needed;
>>>>> + amdgpu_vm_need_gds_switch(job);
>>>>> bool vm_flush_needed = job->vm_needs_flush;
>>>>> struct dma_fence *fence = NULL;
>>>>> bool pasid_mapping_needed = false;
>>>>> --
>>>>> 2.41.0
>>>>>
>>>>
>>
More information about the amd-gfx
mailing list