[RFC] Host1x/TegraDRM UAPI (drm_tegra_submit_command)

Tue Jun 30 10:40:34 UTC 2020

On 6/29/20 3:00 AM, Dmitry Osipenko wrote:
> 28.06.2020 13:28, Mikko Perttunen пишет:
>> On 6/28/20 1:32 AM, Dmitry Osipenko wrote:
>>> 23.06.2020 15:09, Mikko Perttunen пишет:
>>>> /* Command is an opcode gather from a GEM handle */
>>>> #define DRM_TEGRA_SUBMIT_COMMAND_GATHER             0
>>>> /* Command is an opcode gather from a user pointer */
>>>> #define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR        1
>>>> /* Command is a wait for syncpt fence completion */
>>>> #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCPT        2
>>>> /* Command is a wait for SYNC_FILE FD completion */
>>>> #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNC_FILE     3
>>>> /* Command is a wait for DRM syncobj completion */
>>>> #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCOBJ       4
>>>>
>>>> /*
>>>>    * Allow driver to skip command execution if engine
>>>>    * was not accessed by another channel between
>>>>    * submissions.
>>>>    */
>>>> #define DRM_TEGRA_SUBMIT_CONTEXT_SETUP                        (1<<0)
>>>>
>>>> struct drm_tegra_submit_command {
>>>>           __u16 type;
>>>>           __u16 flags;
>>>
>>> Shouldn't the "packed" attribute be needed if a non-32bit aligned fields
>>> are used?
>>
>> I guess we should change these to u32 for consistency.
>>
>>>
>>>>           union {
>>>>                   struct {
>>>>                       /* GEM handle */
>>>>                       __u32 handle;
>>>>
>>>>                       /*
>>>>                        * Offset into GEM object in bytes.
>>>>                        * Must be aligned by 4.
>>>>                        */
>>>>                       __u64 offset;
>>>
>>> 64bits for a gather offset is a bit too much, in most cases gathers are
>>> under 4K.
>>>
>>> u32 should be more than enough (maybe even u16 if offset is given in a
>>> dword granularity).
>>
>> I guess this can be changed to u32, though I don't think there is any
>> particularly pressing reason not to use u64.
>>
>> In any case, I think we concluded that we'll drop the GEM gather command
>> for now.
> 
> Sure, I commented this just in a case, for the future :)

Yep, OK :)

> 
>>>
>>>>                       /*
>>>>                        * Length of gather in bytes.
>>>>                        * Must be aligned by 4.
>>>>                        */
>>>>                       __u64 length;
>>>
>>> u32/16
>>>
>>>>                   } gather;
>>>
>>>>                   struct {
>>>>                           __u32 reserved[1];
>>>>
>>>>                           /*
>>>>                            * Pointer to gather data.
>>>>                            * Must be aligned by 4 bytes.
>>>>                            */
>>>>                           __u64 base;
>>>>                           /*
>>>>                            * Length of gather in bytes.
>>>>                            * Must be aligned by 4.
>>>>                            */
>>>>                           __u64 length;
>>>>                   } gather_uptr;
>>>
>>> What about to inline the UPTR gather data and relocs into the
>>> drm_tegra_submit_command[] buffer:
>>>
>>> struct drm_tegra_submit_command {
>>>      struct {
>>>          u16 num_words;
>>>          u16 num_relocs;
>>>
>>>          gather_data[];
>>>          drm_tegra_submit_relocation relocs[];
>>>      } gather_uptr;
>>> };
>>>
>>> struct drm_tegra_channel_submit {
>>>           __u32 num_syncpt_incrs;
>>>           __u32 syncpt_idx;
>>>
>>>           __u64 commands_ptr;
>>>      __u32 commands_size;
>>> ...
>>> };
>>>
>>> struct drm_tegra_submit_command example[] = {
>>>      cmd.gather_uptr{},
>>>      ...
>>>      gather_data[],
>>>      gather_relocs[],
>>>      cmd.wait_syncpt{},
>>>      ...
>>> };
>>>
>>> This way we will have only a single copy_from_user() for the whole
>>> cmdstream, which should be more efficient to do and nicer from both
>>> userspace and kernel perspectives in regards to forming and parsing the
>>> commands.
>>>
>>
>> I'm not sure I agree it being nicer with regard to forming and parsing
>> - Can you actually place multiple unsized arrays in a struct without
>> pointers?
> 
> No :) But there are no unsized arrays here. The parser will read first
> command and then move pointer to the next command based on size of the
> parsed command.
> 
>> - gather_data's length is a multiple of 4, and so since we have u64's in
>> drm_tegra_submit_relocation, alignment needs to be taken care of
>> manually, both when forming and kernel needs to validate it while
>> parsing. Depending on number of words in the gather, padding would need
>> to be inserted. We could swap the two around but it still feels more
>> brittle.
> 
> Yes, there will be unaligned reads on ARM64, but I don't think it's a
> problem. Isn't it?

It's not a big problem, but I think it would be less error-prone to have 
naturally aligned data.

> 
>> Also, if we read the gather from a separate address, userspace doesn't
>> necessarily need to know the length of the gather (and number of relocs)
>> upfront, so it's easier to have a builder pattern without needing to
>> copy things later.
> 
> Usually there are 2-3 relocs per gather, so userspace could simply
> maintain a fixed-sized static buffer for the relocs (say 32 relocs).
> Once gather is sliced by userspace, the stashed relocs will be appended
> to the comands buffer and next gather will be formed.
> 
> The relocs copying doesn't really costs anything for userspace, the
> price is incomparable with the cost of UPTR copying of each gather for
> the kernel.
> 
> Well, the UPTR copying isn't expensive, but if there is a single copy
> for the whole IOCTL, then it's even much better!
> 
>> If we allow direct page mappings for gathers later, a separate address
>> would make things also a little bit easier. For direct page mappings
>> userspace would need to keep the opcode data unchanged while the job is
>> being executed, so userspace could keep an arena where the gathers are
>> placed, and refer to segments of that, instead of having to keep the
>> drm_tegra_submit_commands alive.
> 
> Okay, what about to have a single UPTR buffer for all gathers of a job?
> 
> struct drm_tegra_channel_submit {
> 	u64 commands_ptr;
> 	u64 gathers_ptr;
> 
> 	u32 commands_size;
> 	u32 gathers_size;
> 	...
> };
> 
> struct drm_tegra_submit_command {
> 	...
>          union {
>                  struct {
>                          /*
>                           * Length of gather in bytes.
>                           * Must be aligned by 4.
>                           */
>                          __u32 length;
>                  } gather_uptr;
> 	};
> };
> 
> Now we have a single UPTR gather that could be sliced into sub-gathers
> during of job's submission and also the whole data could be copied at
> once by kernel driver for the parsing purposes.
> 
> The userspace will have to preallocate a large-enough buffer upfront for
> the gathers. If it runs out of space in the buffer, then it should
> reallocate a larger buffer. Nice and simple :)
> 

I think that would work fine for the usecases that I can think of.

Mikko