[Intel-xe] Code review for Xe driver - First batch of comments (24/5)

Wed May 24 17:49:41 UTC 2023

On Wed, May 24, 2023 at 05:53:36PM +0300, Oded Gabbay wrote:
>Hi,
>I finally got to start the code review of the Xe driver and I want to
>send the first batch of CR comments, but before I paste them here, I'd
>like to do some alignment.
>
>First, I'm reviewing the code in gitlab, and I tagged the HEAD of the
>branch I reviewed so when you go over the comments, please do it vs.
>that tag. The tag name is xe-rev1-2023-05-02.
>
>Secondly, I'm putting all the comments in a text file (cr.txt) I added
>to my review branch (xe-rev1) on gitlab. What I'll do is every
>Thursday, I'll copy the new comments I've added and start a new mail
>thread here and then we can discuss it here. I'm sending the first
>batch today as tomorrow is a holiday here.
>
>Third, when you are doing fixes according to the code review, please
>take the above tag as the base. Don't fix it in your latest branch
>(drm-xe-next). Before doing rev2 we will probably rebase the commits
>on some more updated upstream HEAD, but I don't want to see any new
>code (except for a select few commits that got my approval).
>Otherwise, I will lose my hands and feet.

not sure about the process here... there are things that already changed
in drm-xe-next, it got even a new base. The current process for fixing
things in an old commits is to merge a patch done by
`git commit --fixup` which is appropriately squashed on next cleanup
rebase

$ git range-diff $(git log -n1 --merges --format=%H xe-rev1-2023-05-02)..xe-rev1-2023-05-02 \
		 $(git log -n1 --merges --format=%H xe/drm-xe-next)..xe/drm-xe-next

this shows things that changed on the "xe commit range" due to the
rebase and additional cleanups that happened.

>
>That's it for alignment. Here is the first batch of comments, grouped
>by filename:
>
>General:
>- Multiple checkpatch issues for many files. Please make sure no
>warnings/errors appear.

some are really impossible to deal with. Yes, there are some that slip,
but they should be analyzed on a case by case.

>
>- Multiple BUG_ON (over 200), and I believe 99% of them have no
>justification to do it.
>  It's not justified to crash the host if you got a bad pointer from
>the calling function, or
>  if some other parameter is not between the valid values. Just do
>drm_* and return -EINVAL.
>
>- Same thing for WARN_ON. I would *strongly* request that you convert
>many of them to drm_* and
>  just return error because in the data center, WARN_ON is often
>defined to behave the same as BUG_ON.
>
>xe_macros.h:
>- XE_IOCTL_ERR is defined to call drm_info. This is not allowed as you
>enable a user to spam the
>  kernel log. Please change that macro to use drm_dbg.
>
>- Also, please remove the FILE+LINE print. The drm_dbg macro will
>print the function and that tends
>  to be enough. If you want to print FILE+LINE, use ftrace.
>
>xe_drv.h:
>- DRIVER_TIMESTAMP never used
>- DRIVER_DATE looks a bit old...

probably carryover from i915? Historically they were updated by
maintainers. Should probably be removed altogether.

>
>xe_module.c:
>- Do we still need to keep the enable_guc param ? Isn't the exec list
>path broken ?

in the last (of multiple) discussions about this, the outcome was to
remove the fallback to execlist when GuC is not found. However it was
decided not to remove it completely as it helps to layer the driver
appropriately for future backend

maybe time to revisit that and simply remove the enable_guc together
with the execlist code.

>xe_drm.h:
>- Use short format in license, no need to copy entire license text.
>
>- Engine ioctls - why is it called an engine ? You are not creating an
>engine, you are creating
>                  an LRC which is assigned to some engine class. Isn't
>a context a more
>                  appropriate name ? (I don't refer to the
>engine_class struct which is fine)

This comes from the vulkan world. You are creating an engine containing
an LRC object and you can schedule it on an *hw* engine. From the API
point of view for vulkan this all makes sense. From an Intel GPU pov,
particularly looking at the bspec, it's very confusing.

I raised this some time ago at
https://gitlab.freedesktop.org/drm/xe/kernel/-/merge_requests/206#note_1632304

(...)

>xe_mmio.h:
>- xe_mmio_wait32 reg & val arguments are not in the same order as
>intel_wait_for_register in i915.
>  We are setting ourselves up to fail when copying stuff from i915.
>(comment from Jani)

there were big refactors in this area already, not only related to the
order of the fields, but it covers that too

Lucas De Marchi

>
>xe_bo.h:
>- PPAT_CACHED_PDE is defined as 0. Is this correct or it should be BIT_ULL(0) ?
>  If it is 0, the code in gen8_pde_encode is confusing.
>
>xe_vm.c:
>- Line 3301: remove extra ;
>
>- xe_vm_lookup() - the locking pattern here looks wrong. Usually it should be:
>  1. lock
>  2. load/find object
>  3. if found -> get (increase refcnt of object)
>  4. unlock
>  When you do the get after the unlock, you are open to a race where
>that object was already
>  deleted by another thread and therefore, you are holding a garbage pointer
>
>- xe_vm_destroy_ioctl() - I think this is not protected against
>concurrency. Two threads that
>will call it to destroy the same vm object might result in a race.
>assuming we have thread A & B, both threads can reach up to the call to
>xe_vm_close_and_put (line 1935) having a valid vm pointer. Then, let's
>say Thread A
>executes that function, which will make that vm object invalid. Then
>Thread B will try
>to execute that function and that will cause a crash.
>
>- Line 1124: I can understand you are exiting D3 state, but WHY are
>you doing this here ?
>
>That's it for this week.
>Thanks,
>Oded