[PATCH v5 00/21] Host1x/TegraDRM UAPI

Tue Jan 26 02:45:55 UTC 2021

On 1/20/21 12:29 AM, Dmitry Osipenko wrote:
> 11.01.2021 15:59, Mikko Perttunen пишет:
>> Hi all,
>>
>> here's the fifth revision of the Host1x/TegraDRM UAPI proposal,
>> containing primarily small bug fixes. It has also been
>> rebased on top of recent linux-next.
>>
>> vaapi-tegra-driver has been updated to support the new UAPI
>> as well as Tegra186:
>>
>>    https://github.com/cyndis/vaapi-tegra-driver
>>
>> The `putsurface` program has been tested to work.
>>
>> The test suite for the new UAPI is available at
>> https://github.com/cyndis/uapi-test
>>
>> The series can be also found in
>> https://github.com/cyndis/linux/commits/work/host1x-uapi-v5.
>>
>> Older versions:
>> v1: https://www.spinics.net/lists/linux-tegra/msg51000.html
>> v2: https://www.spinics.net/lists/linux-tegra/msg53061.html
>> v3: https://www.spinics.net/lists/linux-tegra/msg54370.html
>> v4: https://www.spinics.net/lists/dri-devel/msg279897.html
>>
>> Thank you,
>> Mikko
> 
> The basic support for the v5 UAPI is added now to the Opentegra driver.
>   In overall UAPI works, but there are couple things that we need to
> improve, I'll focus on them here.
> 
> Problems
> ========
> 
> 1. The channel map/unmap API needs some more thought.
> 
> The main problem is a difficulty to track liveness of BOs and mappings.
>   The kernel driver refs BO for each mapping and userspace needs to track
> both BO and its mappings together, it's too easy to make mistake and
> leak BOs without noticing.
> 
> 2. Host1x sync point UAPI should not be used for tracking DRM jobs.  I
> remember we discussed this previously, but this pops up again and I
> don't remember where we ended previously.
> 
> This creates unnecessary complexity for userspace.  Userspace needs to
> go through a lot of chores just to get a sync point and then to manage
> it for the jobs.
> 
> Nothing stops two different channels to reuse a single sync point and
> use it for a job, fixing this will only add more complexity to the
> kernel driver instead of removing it.
> 
> 3. The signalling of DMA fences doesn't work properly in v5 apparently
> because of the host1x waiter bug.  I see that sync point interrupt
> happens, but waiter callback isn't invoked.
> 
> 4. The sync_file API is not very suitable for DRM purposes because of
> -EMFILE "Too many open files", which I saw while was running x11perf.
> It also adds complexity to userspace, instead of removing it.  This
> approach not suitable for DRM scheduler as well.
> 
> 5. Sync points have a dirty hardware state when allocated / requested.
> The special sync point reservation is meaningless in this case.
> 
> 6. I found that the need to chop cmdstream into gathers is a bit
> cumbersome for userspace of older SoCs which don't have h/w firewall.
> Can we support option where all commands are collected into a single
> dedicated cmdstream for a job?
> 
> Possible solutions for the above problems
> =========================================
> 
> 1. Stop to use concept of "channels". Switch to DRM context-only.
> 
> Each DRM context should get access to all engines once DRM context is
> created.  Think of it like "when DRM context is opened, it opens a
> channel for each engine".
> 
> Then each DRM context will get one instance of mapping per-engine for
> each BO.
> 
> enum tegra_engine {
> 	TEGRA_GR2D,
> 	TEGRA_GR3D,
> 	TEGRA_VIC,
> 	...
> 	NUM_ENGINES
> };
> 
> struct tegra_bo_mapping {
> 	dma_addr_t ioaddr;
> 	...
> };
> 
> struct tegra_bo {
> 	...
> 	struct tegra_bo_mapping *hw_maps[NUM_ENGINES];
> };
> 
> Instead of DRM_IOCTL_TEGRA_CHANNEL_MAP we should have
> DRM_IOCTL_TEGRA_GEM_MAP_TO_ENGINE, which will create a BO mapping for a
> specified h/w engine.
> 
> Once BO is closed, all its mappings should be closed too.  This way
> userspace doesn't need to track both BOs and mappings.
> 
> Everything that userspace needs to do is:
> 
> 	1) Open DRM context
> 	2) Create GEM
> 	3) Map GEM to required hardware engines
> 	4) Submit job that uses GEM handle
> 	5) Close GEM
> 
> If GEM wasn't mapped prior to the job's submission, then job will fail
> because kernel driver won't resolve the IO mapping of the GEM.
Perhaps we can instead change the reference counting such that if you 
close the GEM object, the mappings will be dropped as well automatically.

> 
> 2. We will probably need a dedicated drm_tegra_submit_cmd for sync point
> increments.  The job's sync point will be allocated dynamically when job
> is submitted.  We will need a fag for the sync_incr and wait_syncpt
> commands, saying "it's a job's sync point increment/wait"

Negative. Like I have explained in previous discussions, with the 
current way the usage of hardware resources is much more deterministic 
and obvious. I disagree on the point that this is much more complicated 
for the userspace. Separating syncpoint and channel allocation is one of 
the primary motivations of this series for me.

> 
> 3. We should use dma-fence API directly and waiter-shim should be
> removed.  It's great that you're already working on this.
> 
> 4. Sync file shouldn't be needed for the part of DRM API which doesn't
> interact with external non-DRM devices.  We should use DRM syncobj for
> everything related to DRM, it's a superior API over sync file, it's
> suitable for DRM scheduler.

Considering the issues with fileno limits, I suppose there is no other 
choice. Considering the recent NTSYNC proposal by Wine developers, maybe 
we should also have NTHANDLEs to get rid of restrictions of file 
descriptors. DRM syncobjs may have some advantages over sync files, but 
also disadvantages. They cannot be poll()ed, so they cannot be combined 
with waits for other resources.

I'll look into this for v6.

> 
> 5. The hardware state of sync points should be reset when sync point is
> requested, not when host1x driver is initialized.

This may be doable, but I don't think it is critical for this UAPI, so 
let's consider it after this series.

The userspace should anyway not be able to assume the initial value of 
the syncpoint upon allocation. The kernel should set it to some high 
value to catch any issues related to wraparound.

Also, this makes code more complicated since it now needs to ensure all 
waits on the syncpoint have completed before freeing the syncpoint, 
which can be nontrivial e.g. if the waiter is in a different virtual 
machine or some other device connected via PCIe (a real usecase).

> 
> 6. We will need to allocate a host1x BO for a job's cmdstream and add a
> restart command to the end of the job's stream.  CDMA will jump into the
> job's stream from push buffer.
> 
> We could add a flag for that to drm_tegra_submit_cmd_gather, saying that
> gather should be inlined into job's main cmdstream.
> 
> This will remove a need to have a large push buffer that will easily
> overflow, it's a real problem and upstream driver even has a bug where
> it locks up on overflow.
> 
> How it will look from CDMA perspective:
> 
> PUSHBUF |
> ---------
> ...     |      | JOB   |
>          |      ---------       | JOB GATHER |
> RESTART	------> CMD    |       --------------
>          |      |GATHER -------> DATA        |
> ... <---------- RESTART|       |            |
>          |      |       |
> 

Let me check if I understood you correctly:
- You would like to have the job's cmdbuf have further GATHER opcodes 
that jump into smaller gathers?

I assume this is needed because currently WAITs are placed into the 
pushbuffer, so the job will take a lot of space in the pushbuffer if 
there are a lot of waits (and GATHERs in between these waits)?

If so, perhaps as a simpler alternative we could change the firewall to 
allow SETCLASS into HOST1X for waiting specifically, then userspace 
could just submit one big cmdbuf taking only little space in the 
pushbuffer? Although that would only allow direct ID/threshold waits.

In any case, it seems that this can be added in a later patch, so we 
should omit it from this series for simplicity. If it is impossible for 
the userspace to deal with it, we could disable the firewall 
temporarily, or implement the above change in the firewall.

> 
> What's missing
> ==============
> 
> 1. Explicit and implicit fencing isn't there yet, we need to support DRM
> scheduler for that.
> 
> 2. The "wait" command probably should be taught to take a syncobj handle
> in order to populate it with a dma-fence by kernel driver once job is
> submitted.  This will give us intermediate fences of a job and allow
> utilize the syncobj features like "wait until job is submitted to h/w
> before starting to wait for timeout", which will be needed by userspace
> when DRM scheduler will be supported.
> 
> Miscellaneous
> =============
> 
> 1. Please don't forget to bump driver version.  This is important for
> userspace.

Sure. I didn't do it this time since it's backwards compatible and it's 
easy to detect if the new UAPI is available by checking for /dev/host1x. 
I can bump it in v6 if necessary.

> 
> 2. Please use a proper kernel coding style, use checkpatch.
> 
>     # git format-patch -v5 ...
>     # ./scripts/checkpatch.pl --strict v5*

Looks like I accidentally placed some spaces into firewall.c. Otherwise 
the warnings it prints do not look critical.

> 
> 3. Kernel driver needs a rich error messages for each error condition
> and it should dump submitted job when firewall fails.  It's very tedious
> to re-add it all each time when something doesn't work.

Yes, that's true. Will take a look for v6.

> 
> 4. Previously firewall was using the client's class if is_valid_class
> wasn't specified in tegra_drm_client_ops, you changed it and now
> firewall fails for GR3D because it doesn't have the is_valid_class() set
> in the driver.  See [1].
> 
> 5. The CDMA class should be restored to the class of a previous gather
> after the wait command in submit_gathers() and not to the class of the
> client.  GR2D supports multiple classes.  See [1].

Will take a look at these two for v6.

> 
> [1]
> https://github.com/grate-driver/linux/commit/024cba369c9c0e2762e9890068ff9944cb10c44f
> 

Mikko