[PATCH 08/31] drm/amdgpu: track ring state associated with a job
Christian König
christian.koenig at amd.com
Thu Jun 5 13:50:06 UTC 2025
On 6/5/25 15:21, Alex Deucher wrote:
>>> + am_fence->start_ring_wptr = 0;
>>> + am_fence->end_ring_wptr = 0;
>>
>> Why do we need the start here? I would just keep the end around and then jump from fence to fence while re-submitting them.
>
> I need to know the start and end of the ring contents associated with
> each fence. When I re-emit, I just copy over the ring contents for
> all fences that don't match the bad one. Also we submit multiple
> fences per IB depending on whether we do a vm flush. Those fences are
> internal to the IB frame so they don't really need a start and end,
> hence 0.
What I wanted to do is the following:
ptr = bad_fence->wend_ring_wptr;
for (i = (bad_fence->seq + 1) & fence_drv->mask; i != bad_fence_seq; ++i &= fence_drv->mask)
fence = fence_drv->fences[i]
if (dma_fence_is_signaled(fence))
break;
if (!fence->end_ring_wptr)
continue;
if (fence->context != bad_fence->context)
backup(ptr, fence->end_ring_wptr);
ptr = fence->end_ring_wptr;
}
But could be that it is better to backup start/end explicitly.
>>> +void amdgpu_fence_driver_seq_force_completion(struct amdgpu_ring *ring, u32 seq)
>>
>> Better give the full fence structure here.
>
> You mean pass the fence directly?
Yes.
>
>>
>>> +{
>>> + amdgpu_fence_write(ring, seq);
>>> + amdgpu_fence_process(ring);
>>> +}
>>> +
>>> /*
>>> * Common fence implementation
>>> */
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>> index 802743efa3b39..636941697a740 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>> @@ -126,7 +126,9 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned int num_ibs,
>>> struct dma_fence **f)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> + u64 start_ring_wptr, end_ring_wptr;
>>> struct amdgpu_ib *ib = &ibs[0];
>>> + struct amdgpu_fence *am_fence;
>>> struct dma_fence *tmp = NULL;
>>> bool need_ctx_switch;
>>> struct amdgpu_vm *vm;
>>> @@ -138,7 +140,6 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned int num_ibs,
>>> int vmid = AMDGPU_JOB_GET_VMID(job);
>>> bool need_pipe_sync = false;
>>> unsigned int cond_exec;
>>> -
>>> unsigned int i;
>>> int r = 0;
>>>
>>> @@ -187,6 +188,7 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned int num_ibs,
>>> dev_err(adev->dev, "scheduling IB failed (%d).\n", r);
>>> return r;
>>> }
>>> + start_ring_wptr = ring->wptr;
>>>
>>> need_ctx_switch = ring->current_ctx != fence_ctx;
>>> if (ring->funcs->emit_pipeline_sync && job &&
>>> @@ -306,6 +308,15 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned int num_ibs,
>>>
>>> amdgpu_ring_ib_end(ring);
>>> amdgpu_ring_commit(ring);
>>> + /* This must be last for resets to work properly
>>> + * as we need to save the wptr associated with this
>>> + * fence.
>>> + */
>>> + end_ring_wptr = ring->wptr;
>>> + am_fence = container_of(*f, struct amdgpu_fence, base);
>>> + am_fence->start_ring_wptr = start_ring_wptr;
>>> + am_fence->end_ring_wptr = end_ring_wptr;
>>
>> The end_ring_wptr variable is superflous and I would put assigning that into a helper in amdgpu_fence.c
>
> I'm not following. I need the start and end wptrs in order to know
> what ranges of the ring I need to save.
But you have
end_ring_wptr = ring->wptr;
...
am_fence->end_ring_wptr = end_ring_wptr;
start_ring_wptr is available as ring->wptr_old btw.
>>> +void amdgpu_ring_backup_unprocessed_jobs(struct amdgpu_ring *ring,
>>> + bool is_guilty,
>>> + struct amdgpu_fence *bad_fence)
>>> +{
>>> + struct amdgpu_fence *fence;
>>> + struct dma_fence *old, **ptr;
>>> + int i;
>>> +
>>> + ring->ring_backup_entries_to_copy = 0;
>>> + for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) {
>>
>> That is the wrong order for the fences, you need to start/end at the last submitted one.
>
> I'm not sure I follow. When I backup the ring contents, I need to go
> from oldest to newest so the order is correct when I re-emit.
Yeah, but 0 is not the oldest. fence_drv->fences is a ring buffer!
You need to start with something like fence_drv->fences[bad_fence->seq & mask].
Christian.
>
> Alex
More information about the amd-gfx
mailing list