[PATCH 08/31] drm/amdgpu: track ring state associated with a job

Thu Jun 5 13:50:06 UTC 2025

On 6/5/25 15:21, Alex Deucher wrote:
>>> +     am_fence->start_ring_wptr = 0;
>>> +     am_fence->end_ring_wptr = 0;
>>
>> Why do we need the start here? I would just keep the end around and then jump from fence to fence while re-submitting them.
> 
> I need to know the start and end of the ring contents associated with
> each fence.  When I re-emit, I just copy over the ring contents for
> all fences that don't match the bad one.  Also we submit multiple
> fences per IB depending on whether we do a vm flush.  Those fences are
> internal to the IB frame so they don't really need a start and end,
> hence 0.

What I wanted to do is the following:

ptr = bad_fence->wend_ring_wptr;
for (i = (bad_fence->seq + 1) & fence_drv->mask; i != bad_fence_seq; ++i &= fence_drv->mask)
	fence = fence_drv->fences[i]
	if (dma_fence_is_signaled(fence))
		break;

	if (!fence->end_ring_wptr)
		continue;

	if (fence->context != bad_fence->context)
		backup(ptr, fence->end_ring_wptr);

	ptr = fence->end_ring_wptr;
}

But could be that it is better to backup start/end explicitly.

>>> +void amdgpu_fence_driver_seq_force_completion(struct amdgpu_ring *ring, u32 seq)
>>
>> Better give the full fence structure here.
> 
> You mean pass the fence directly?

Yes.

> 
>>
>>> +{
>>> +     amdgpu_fence_write(ring, seq);
>>> +     amdgpu_fence_process(ring);
>>> +}
>>> +
>>>  /*
>>>   * Common fence implementation
>>>   */
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>> index 802743efa3b39..636941697a740 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
>>> @@ -126,7 +126,9 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned int num_ibs,
>>>                      struct dma_fence **f)
>>>  {
>>>       struct amdgpu_device *adev = ring->adev;
>>> +     u64 start_ring_wptr, end_ring_wptr;
>>>       struct amdgpu_ib *ib = &ibs[0];
>>> +     struct amdgpu_fence *am_fence;
>>>       struct dma_fence *tmp = NULL;
>>>       bool need_ctx_switch;
>>>       struct amdgpu_vm *vm;
>>> @@ -138,7 +140,6 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned int num_ibs,
>>>       int vmid = AMDGPU_JOB_GET_VMID(job);
>>>       bool need_pipe_sync = false;
>>>       unsigned int cond_exec;
>>> -
>>>       unsigned int i;
>>>       int r = 0;
>>>
>>> @@ -187,6 +188,7 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned int num_ibs,
>>>               dev_err(adev->dev, "scheduling IB failed (%d).\n", r);
>>>               return r;
>>>       }
>>> +     start_ring_wptr = ring->wptr;
>>>
>>>       need_ctx_switch = ring->current_ctx != fence_ctx;
>>>       if (ring->funcs->emit_pipeline_sync && job &&
>>> @@ -306,6 +308,15 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned int num_ibs,
>>>
>>>       amdgpu_ring_ib_end(ring);
>>>       amdgpu_ring_commit(ring);
>>> +     /* This must be last for resets to work properly
>>> +      * as we need to save the wptr associated with this
>>> +      * fence.
>>> +      */
>>> +     end_ring_wptr = ring->wptr;
>>> +     am_fence = container_of(*f, struct amdgpu_fence, base);
>>> +     am_fence->start_ring_wptr = start_ring_wptr;
>>> +     am_fence->end_ring_wptr = end_ring_wptr;
>>
>> The end_ring_wptr variable is superflous and I would put assigning that into a helper in amdgpu_fence.c
> 
> I'm not following. I need the start and end wptrs in order to know
> what ranges of the ring I need to save.

But you have 
	end_ring_wptr = ring->wptr;
...
	am_fence->end_ring_wptr = end_ring_wptr;

start_ring_wptr is available as ring->wptr_old btw.

>>> +void amdgpu_ring_backup_unprocessed_jobs(struct amdgpu_ring *ring,
>>> +                                      bool is_guilty,
>>> +                                      struct amdgpu_fence *bad_fence)
>>> +{
>>> +     struct amdgpu_fence *fence;
>>> +     struct dma_fence *old, **ptr;
>>> +     int i;
>>> +
>>> +     ring->ring_backup_entries_to_copy = 0;
>>> +     for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) {
>>
>> That is the wrong order for the fences, you need to start/end at the last submitted one.
> 
> I'm not sure I follow.  When I backup the ring contents, I need to go
> from oldest to newest so the order is correct when I re-emit.

Yeah, but 0 is not the oldest. fence_drv->fences is a ring buffer!

You need to start with something like fence_drv->fences[bad_fence->seq & mask].

Christian.

> 
> Alex