<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<p style="font-family:Calibri;font-size:10pt;color:#0000FF;margin:5pt;font-style:normal;font-weight:normal;text-decoration:none;" align="Left">
[AMD Official Use Only - AMD Internal Distribution Only]<br>
</p>
<br>
<div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<i>> Does this patch fix it?</i></div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<i><a href="https://patchwork.freedesktop.org/patch/605437/" target="_blank" id="OWAa13c2cc3-61b4-e6df-265d-852c907b6a0a" class="OWAAutoLink" rel="noopener noreferrer" data-auth="NotApplicable" data-linkindex="0" style="margin: 0px; text-align: left;">https://patchwork.freedesktop.org/patch/605437/</a></i></div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
No, please do not check in the patch, it will make my fix not working.</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Regards,</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Jack</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Alex Deucher <alexdeucher@gmail.com><br>
<b>Sent:</b> Tuesday, 23 July 2024 03:52<br>
<b>To:</b> Christian König <ckoenig.leichtzumerken@gmail.com><br>
<b>Cc:</b> Xiao, Jack <Jack.Xiao@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; Deucher, Alexander <Alexander.Deucher@amd.com><br>
<b>Subject:</b> Re: [PATCH] drm/amdgpu/mes: fix mes ring buffer overflow</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText">Does this patch fix it?<br>
<a href="https://patchwork.freedesktop.org/patch/605437/">https://patchwork.freedesktop.org/patch/605437/</a><br>
<br>
Alex<br>
<br>
On Mon, Jul 22, 2024 at 7:21 AM Christian König<br>
<ckoenig.leichtzumerken@gmail.com> wrote:<br>
><br>
> What I meant is that the MES ring is now to small for the number of packets written to it.<br>
><br>
> Either reduce the num_hw_submission or increase the MES ring size.<br>
><br>
> E.g. see this code here in amdgpu_ring_init:<br>
><br>
>         if (ring->funcs->type == AMDGPU_RING_TYPE_KIQ)<br>
>                 sched_hw_submission = max(sched_hw_submission, 256);<br>
>         else if (ring == &adev->sdma.instance[0].page)<br>
>                 sched_hw_submission = 256;<br>
><br>
> We are basically just missing a case for the MES here as far as I can see.<br>
><br>
> Regards,<br>
> Christian.<br>
><br>
> Am 22.07.24 um 10:46 schrieb Xiao, Jack:<br>
><br>
> [AMD Official Use Only - AMD Internal Distribution Only]<br>
><br>
><br>
> >> Can you try to reduce num_hw_submission for the MES ring?<br>
><br>
> Smaller num_hw_submission should not help for this issue, for Mes work without drm scheduler like legacy kiq. Smaller num_hw_submission will result in smaller mes ring size and more waiting time.<br>
><br>
> Regards,<br>
> Jack<br>
> ________________________________<br>
> From: Christian König <ckoenig.leichtzumerken@gmail.com><br>
> Sent: Monday, 22 July 2024 16:20<br>
> To: Xiao, Jack <Jack.Xiao@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; Deucher, Alexander <Alexander.Deucher@amd.com><br>
> Subject: Re: [PATCH] drm/amdgpu/mes: fix mes ring buffer overflow<br>
><br>
> Thx, but in that case this patch here then won't help either it just mitigates the problem.<br>
><br>
> Can you try to reduce num_hw_submission for the MES ring?<br>
><br>
> Thanks,<br>
> Christian.<br>
><br>
> Am 22.07.24 um 05:27 schrieb Xiao, Jack:<br>
><br>
> [AMD Official Use Only - AMD Internal Distribution Only]<br>
><br>
><br>
> >> I think we rather need to increase the MES ring size instead.<br>
><br>
> Unfortunately, it doesn't work. I guess mes firmware has limitation.<br>
><br>
> Regards,<br>
> Jack<br>
><br>
> ________________________________<br>
> From: Christian König <ckoenig.leichtzumerken@gmail.com><br>
> Sent: Friday, 19 July 2024 23:44<br>
> To: Xiao, Jack <Jack.Xiao@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; Deucher, Alexander <Alexander.Deucher@amd.com><br>
> Subject: Re: [PATCH] drm/amdgpu/mes: fix mes ring buffer overflow<br>
><br>
> Am 19.07.24 um 11:16 schrieb Jack Xiao:<br>
> > wait memory room until enough before writing mes packets<br>
> > to avoid ring buffer overflow.<br>
> ><br>
> > Signed-off-by: Jack Xiao <Jack.Xiao@amd.com><br>
> > ---<br>
> >   drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 18 ++++++++++++++----<br>
> >   drivers/gpu/drm/amd/amdgpu/mes_v12_0.c | 18 ++++++++++++++----<br>
> >   2 files changed, 28 insertions(+), 8 deletions(-)<br>
> ><br>
> > diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c<br>
> > index 8ce51b9236c1..68c74adf79f1 100644<br>
> > --- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c<br>
> > +++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c<br>
> > @@ -168,7 +168,7 @@ static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,<br>
> >        const char *op_str, *misc_op_str;<br>
> >        unsigned long flags;<br>
> >        u64 status_gpu_addr;<br>
> > -     u32 status_offset;<br>
> > +     u32 seq, status_offset;<br>
> >        u64 *status_ptr;<br>
> >        signed long r;<br>
> >        int ret;<br>
> > @@ -196,6 +196,13 @@ static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,<br>
> >        if (r)<br>
> >                goto error_unlock_free;<br>
> ><br>
> > +     seq = ++ring->fence_drv.sync_seq;<br>
> > +     r = amdgpu_fence_wait_polling(ring,<br>
> > +                                   seq - ring->fence_drv.num_fences_mask,<br>
> > +                                   timeout);<br>
> > +     if (r < 1)<br>
> > +             goto error_undo;<br>
> > +<br>
> >        api_status = (struct MES_API_STATUS *)((char *)pkt + api_status_off);<br>
> >        api_status->api_completion_fence_addr = status_gpu_addr;<br>
> >        api_status->api_completion_fence_value = 1;<br>
> > @@ -208,8 +215,7 @@ static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,<br>
> >        mes_status_pkt.header.dwsize = API_FRAME_SIZE_IN_DWORDS;<br>
> >        mes_status_pkt.api_status.api_completion_fence_addr =<br>
> >                ring->fence_drv.gpu_addr;<br>
> > -     mes_status_pkt.api_status.api_completion_fence_value =<br>
> > -             ++ring->fence_drv.sync_seq;<br>
> > +     mes_status_pkt.api_status.api_completion_fence_value = seq;<br>
> ><br>
> >        amdgpu_ring_write_multiple(ring, &mes_status_pkt,<br>
> >                                   sizeof(mes_status_pkt) / 4);<br>
> > @@ -229,7 +235,7 @@ static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,<br>
> >                dev_dbg(adev->dev, "MES msg=%d was emitted\n",<br>
> >                        x_pkt->header.opcode);<br>
> ><br>
> > -     r = amdgpu_fence_wait_polling(ring, ring->fence_drv.sync_seq, timeout);<br>
> > +     r = amdgpu_fence_wait_polling(ring, seq, timeout);<br>
> >        if (r < 1 || !*status_ptr) {<br>
> ><br>
> >                if (misc_op_str)<br>
> > @@ -252,6 +258,10 @@ static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,<br>
> >        amdgpu_device_wb_free(adev, status_offset);<br>
> >        return 0;<br>
> ><br>
> > +error_undo:<br>
> > +     dev_err(adev->dev, "MES ring buffer is full.\n");<br>
> > +     amdgpu_ring_undo(ring);<br>
> > +<br>
> >   error_unlock_free:<br>
> >        spin_unlock_irqrestore(&mes->ring_lock, flags);<br>
> ><br>
> > diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c<br>
> > index c9f74231ad59..48e01206bcc4 100644<br>
> > --- a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c<br>
> > +++ b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c<br>
> > @@ -154,7 +154,7 @@ static int mes_v12_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,<br>
> >        const char *op_str, *misc_op_str;<br>
> >        unsigned long flags;<br>
> >        u64 status_gpu_addr;<br>
> > -     u32 status_offset;<br>
> > +     u32 seq, status_offset;<br>
> >        u64 *status_ptr;<br>
> >        signed long r;<br>
> >        int ret;<br>
> > @@ -182,6 +182,13 @@ static int mes_v12_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,<br>
> >        if (r)<br>
> >                goto error_unlock_free;<br>
> ><br>
> > +     seq = ++ring->fence_drv.sync_seq;<br>
> > +     r = amdgpu_fence_wait_polling(ring,<br>
> > +                                   seq - ring->fence_drv.num_fences_mask,<br>
><br>
> That's what's amdgpu_fence_emit_polling() does anyway.<br>
><br>
> So this here just moves the polling a bit earlier.<br>
><br>
> I think we rather need to increase the MES ring size instead.<br>
><br>
> Regards,<br>
> Christian.<br>
><br>
><br>
> > +                                   timeout);<br>
> > +     if (r < 1)<br>
> > +             goto error_undo;<br>
> > +<br>
> >        api_status = (struct MES_API_STATUS *)((char *)pkt + api_status_off);<br>
> >        api_status->api_completion_fence_addr = status_gpu_addr;<br>
> >        api_status->api_completion_fence_value = 1;<br>
> > @@ -194,8 +201,7 @@ static int mes_v12_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,<br>
> >        mes_status_pkt.header.dwsize = API_FRAME_SIZE_IN_DWORDS;<br>
> >        mes_status_pkt.api_status.api_completion_fence_addr =<br>
> >                ring->fence_drv.gpu_addr;<br>
> > -     mes_status_pkt.api_status.api_completion_fence_value =<br>
> > -             ++ring->fence_drv.sync_seq;<br>
> > +     mes_status_pkt.api_status.api_completion_fence_value = seq;<br>
> ><br>
> >        amdgpu_ring_write_multiple(ring, &mes_status_pkt,<br>
> >                                   sizeof(mes_status_pkt) / 4);<br>
> > @@ -215,7 +221,7 @@ static int mes_v12_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,<br>
> >                dev_dbg(adev->dev, "MES msg=%d was emitted\n",<br>
> >                        x_pkt->header.opcode);<br>
> ><br>
> > -     r = amdgpu_fence_wait_polling(ring, ring->fence_drv.sync_seq, timeout);<br>
> > +     r = amdgpu_fence_wait_polling(ring, seq, timeout);<br>
> >        if (r < 1 || !*status_ptr) {<br>
> ><br>
> >                if (misc_op_str)<br>
> > @@ -238,6 +244,10 @@ static int mes_v12_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,<br>
> >        amdgpu_device_wb_free(adev, status_offset);<br>
> >        return 0;<br>
> ><br>
> > +error_undo:<br>
> > +     dev_err(adev->dev, "MES ring buffer is full.\n");<br>
> > +     amdgpu_ring_undo(ring);<br>
> > +<br>
> >   error_unlock_free:<br>
> >        spin_unlock_irqrestore(&mes->ring_lock, flags);<br>
> ><br>
><br>
><br>
><br>
</div>
</span></font></div>
</div>
</body>
</html>