[bug report] drm/amdgpu: add ring buffer information in devcoredump
Christian König
ckoenig.leichtzumerken at gmail.com
Mon Mar 18 10:32:51 UTC 2024
Am 15.03.24 um 20:08 schrieb Khatri, Sunil:
>
> Thanks for pointing these. I do have some doubt and i raised inline.
>
> On 3/15/2024 8:46 PM, Dan Carpenter wrote:
>> Hello Sunil Khatri,
>>
>> Commit 42742cc541bb ("drm/amdgpu: add ring buffer information in
>> devcoredump") from Mar 11, 2024 (linux-next), leads to the following
>> Smatch static checker warning:
>>
>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c:219 amdgpu_devcoredump_read()
>> error: we previously assumed 'coredump->adev' could be null (see line 206)
>>
>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>> 171 static ssize_t
>> 172 amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count,
>> 173 void *data, size_t datalen)
>> 174 {
>> 175 struct drm_printer p;
>> 176 struct amdgpu_coredump_info *coredump = data;
>> 177 struct drm_print_iterator iter;
>> 178 int i;
>> 179
>> 180 iter.data = buffer;
>> 181 iter.offset = 0;
>> 182 iter.start = offset;
>> 183 iter.remain = count;
>> 184
>> 185 p = drm_coredump_printer(&iter);
>> 186
>> 187 drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
>> 188 drm_printf(&p, "version: " AMDGPU_COREDUMP_VERSION "\n");
>> 189 drm_printf(&p, "kernel: " UTS_RELEASE "\n");
>> 190 drm_printf(&p, "module: " KBUILD_MODNAME "\n");
>> 191 drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec,
>> 192 coredump->reset_time.tv_nsec);
>> 193
>> 194 if (coredump->reset_task_info.pid)
>> 195 drm_printf(&p, "process_name: %s PID: %d\n",
>> 196 coredump->reset_task_info.process_name,
>> 197 coredump->reset_task_info.pid);
>> 198
>> 199 if (coredump->ring) {
>> 200 drm_printf(&p, "\nRing timed out details\n");
>> 201 drm_printf(&p, "IP Type: %d Ring Name: %s\n",
>> 202 coredump->ring->funcs->type,
>> 203 coredump->ring->name);
>> 204 }
>> 205
>> 206 if (coredump->adev) {
>> ^^^^^^^^^^^^^^
>> Check for NULL
> This is the check for NULL. Is there any issue here ?
>> 207 struct amdgpu_vm_fault_info *fault_info =
>> 208 &coredump->adev->vm_manager.fault_info;
>> 209
>> 210 drm_printf(&p, "\n[%s] Page fault observed\n",
>> 211 fault_info->vmhub ? "mmhub" : "gfxhub");
>> 212 drm_printf(&p, "Faulty page starting at address: 0x%016llx\n",
>> 213 fault_info->addr);
>> 214 drm_printf(&p, "Protection fault status register: 0x%x\n\n",
>> 215 fault_info->status);
>> 216 }
>> 217
>> 218 drm_printf(&p, "Ring buffer information\n");
>> --> 219 for (int i = 0; i < coredump->adev->num_rings; i++) {
>> ^^^^^^^^^^^^^^
>> Unchecked dereference
> Agree
>> 220 int j = 0;
>> 221 struct amdgpu_ring *ring = coredump->adev->rings[i];
>> 222
>> 223 drm_printf(&p, "ring name: %s\n", ring->name);
>> 224 drm_printf(&p, "Rptr: 0x%llx Wptr: 0x%llx RB mask: %x\n",
>> 225 amdgpu_ring_get_rptr(ring),
>> 226 amdgpu_ring_get_wptr(ring),
>> 227 ring->buf_mask);
>> 228 drm_printf(&p, "Ring size in dwords: %d\n",
>> 229 ring->ring_size / 4);
>> 230 drm_printf(&p, "Ring contents\n");
>> 231 drm_printf(&p, "Offset \t Value\n");
>> 232
>> 233 while (j < ring->ring_size) {
>> 234 drm_printf(&p, "0x%x \t 0x%x\n", j, ring->ring[j/4]);
>> 235 j += 4;
>> 236 }
>> 237 }
>> 238
>> 239 if (coredump->reset_vram_lost)
>> 240 drm_printf(&p, "VRAM is lost due to GPU reset!\n");
>> 241 if (coredump->adev->reset_info.num_regs) {
>> ^^^^^^^^^^^^^^
>> Here too
> Agree.
>> 242 drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n");
>> 243
>> 244 for (i = 0; i < coredump->adev->reset_info.num_regs; i++)
>> 245 drm_printf(&p, "0x%08x: 0x%08x\n",
>> 246 coredump->adev->reset_info.reset_dump_reg_list[i],
>> 247 coredump->adev->reset_info.reset_dump_reg_value[i]);
>> 248 }
>> 249
>> 250 return count - iter.remain;
>> 251 }
>
>
> Although adev is a global structure and never in the code it is being
> checked for NULL as it wont be NULL until the driver is unloaded. I
> can add a check for adev in the beginning of the function
> amdgpu_devcoredump_read for completeness of the tool but still not
> very sure of it.
>
> Christian and Alex Do you agree with my understanding the adev does
> really need a validation for NULL. I dint see throughout the code adev
> to be validated for NULL. Do you recommend to add a check for NULL for
> adev in the above mentioned function/places.
>
No, that doesn't make sense. adev is mandatory to be around for the core
dump to be valid and accessible.
Without a device you don't have a core dump in the first place.
Regards,
Christian.
> Regards Sunil
>
>> regards,
>> dan carpenter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20240318/0e282a92/attachment-0001.htm>
More information about the amd-gfx
mailing list