First version of host1x intro

Wed Dec 5 01:47:29 PST 2012

Hi,

I created a base for host1x introduction text, and pasted it into
https://gitorious.org/linux-tegra-drm/pages/Host1xIntroduction. For
convenience, I also copy it below.

As I've worked with all of this for so long, I cannot know what areas
are most interesting to you, so I just tried to put in the basics and
scope it to the features we've been discussing so far. Please point out
the features that you'd like more information on so I can add details.

2D is still totally missing from here. Everything is treated as generic
host1x clients. libdrm is touched on only briefly.

The text is written in LaTeX, and converted with pandoc. I beg for
forgiveness for any formatting oddities.

Hardware introduction
=====================

HOST1X is a front-end to a list of client units which deal with graphics
and multimedia. The most important features are channels for serializing
and offloading programming of the client units, and sync points for
synchronizing client units with each other, or with CPU.

Channels
--------

Channel is a push buffer containing HOST1X opcodes. The push buffer
boundaries are defined with `HOST1X_CHANNEL_DMASTART_0` and
`HOST1X_CHANNEL_DMAEND_0`. `HOST1X_CHANNEL_DMAGET_0` indicates the next
position within the boundaries that is going to be processes, and
`HOST1X_CHANNEL_DMAPUT_0` indicates the position of last valid opcode.
Whenever `HOST1X_CHANNEL_DMAPUT_0` and `HOST1X_CHANNEL_DMAGET_0` differ,
command DMA will copy commands from push buffer to a command FIFO.

If command DMA sees opcode GATHER, it will a memory area to command
FIFO. The number of words is indicated in GATHER opcode, and the base
address is read from the following word. GATHERs are not recursive.

HOST1X command processor goes through the FIFO and executes opcodes.
Each channel has some stored state, such as the client unit this channel
is talking to. The most important opcodes are:

-   SETCL for changing the target client unit

-   IMM, INCR, NONINCR, MASK write values to registers of client unit

-   GATHER instructs command DMA to fetch from another memory area

-   RESTART instructs command DMA to start over from beginning of push
    buffer

Channel class can also be HOST1X itself. Register writes to HOST1X will
invoke host class methods. The most important use is
`NV_CLASS_HOST_WAIT_SYNCPT_0`, which freezes a channel until sync point
reaches a threshold value.

Synchronization
---------------

A sync point is a 32-bit register in HOST1X. There are 32 sync points in
Tegra2 and Tegra3. HOST1X can be programmed to assert an interrupt when
a value higher than a pre-determined threshold is written to sync
pointer register. Each channel can also be frozen waiting for a
threshold to be reached.

Sync points are initialized to zero at boot-up, and treated as
monotonously incrementing counter with wrapping. CPU can increment a
sync point by writing the sync point id (0-31 in Tegra2 and Tegra3) to
register `HOST1X_SYNC_SYNCPT_CPU_INCR_0`. Client units all have sync
point increment method at offset 0, and the command streams request
client units to increment sync point using that. The parameters for the
increment method are condition and sync point id. Condition could be
`OP_DONE` telling to increment sync point when previous operations are
done, or `RD_DONE` indicating that client unit has finished all reads
from buffers.

Software
========

There are three components involved with programming HOST1X and its
client units. Linux kernel contains the drivers tegradrm and host1x.
User space library libdrm is added functionality to communicate with the
tegradrm, which communicates with host1x driver.

This text discusses only pieces relevant to HOST1X and its client units,
excluding the part about frame buffer and display controller
programming.

libdrm
======

libdrm communicates with tegradrm kernel driver to allocate buffers,
create and send command streams, synchronize.

TODO

tegradrm
========

tegradrm contains functionality to allocate buffers, and open channels.
The only channel available at the moment is 2D channel, which is handled
by the 2D driver inside tegradrm.

Command stream management and synchronization is passed on from 2D
driver to host1x driver. The 2D driver inside tegradrm processes the
requests from user space, and calls relevant calls in host1x.

host1x driver
=============

At bootup, host1x initializes hardware. It clears sync points, and
registers interrupt handlers.

Sync points
-----------

Each sync point register is treated as a range. The range minimum is a
shadow copy of the sync point register, and the maximum tracks how many
increments we expect to be done. A fence is a pair (sync point id,
threshold value) indicating completion of an event of interest to
software.

Due to wrapping, software does pre-checking for each sync point wait,
whether done via HOST1X channel, or CPU. Each wait is potentially for an
already expired fence. Any wait whose threshold value lies outside the
range ]min, max] is treated as already expired and will not be sent to
HOST1X hardware.

The sync point CPU wait is handled by registering the threshold value as
an event to the interrupt code, and waiting for completion of that
event.

Interrupt management
--------------------

HOST1X has two kinds of interrupt: generic and sync point. Generic
interrupts are not interesting in this scope, so this text focuses on
sync point threshold interrupts.

Interrupt code manages a sorted list of events, and their sync point
threshold values. The earliest event is kept first in the list.

`nvhost_intr_add_action()` adds an action to the event list. If the
event list was empty, HOST1X interrupt is programmed to assert interrupt
when that threshold is reached.

When an interrupt is asserted, the event list is processed. Each event
that has had its threshold passes will be moved to a completed list, and
removed from the event list. Submit complete is treated specially to
optimize for the fact that processing the event is heavy, so we call it
only once even though we have completed multiple submit complete events.

`action_submit_complete()` handles all clean-up for completed jobs.

`action_wakeup()` and `action_wakeup_interruptible()` wake up a thread
waiting for a particular sync point threshold.

After the list of events is processed, the value of the head of the list
is written to HOST1X as the next interrupt threshold.

Job management
--------------

Each command stream sent from user space to kernel is treated as a job.
User space indicates how many sync point increments that stream
generates, and which sync point register it’s using. It also indicates
the buffers involved with the command stream, and the locations in
command stream where the buffers are referred to. Last but not least,
user space indicates locations of sync point waits and thresholds.

First action taken is taking a reference to all buffers in the command
stream. This includes the command stream buffers themselves, but also
the target buffers. We also map each buffer to target hardware to get a
device virtual address.

After this, relocation information is processed. Each reference to
target buffers in command stream are replaced with device virtual
addresses. The relocation information contains the reference to target
buffer, and to command stream to be able to do this.

After relocation, each wait is checked against expiration. Any wait
whose threshold has already expired will be converted to a no-wait by
writing `0x00000000` over the word. This will essentially turn any
expired wait into a wait for sync point register 0, value 0, and thus we
keep sync point 0 reserved for this purpose and never change it from
value 0.

In upstream kernel without IOMMU support we also check the contents of
the command stream for any accesses to memory that are not taken care of
by relocation information.

Next, the number of sync point increments is checked and the id of the
sync point. The sync point maximum value is incremented by the number of
increments, and thus kernel ends up with a fence indicating when that
job has completed. Then each command stream is added to push buffer. In
case of IOMMU support, GATHER opcodes referring to the command streams
are added to the channel push buffer. If IOMMU isn’t supported, the
contents of the GATHER is copied. The fence is added to interrupt event
list as a submit complete action, and at this point the job is
submitted.

When a fence for a job is reached, `action_submit_complete()` will call
`nvhost_cdma_update()`. It goes through list of jobs in channel, and
frees the resources associated for all jobs whose fence has been
reached.

At submit time, we also start a timer for each job. If the timer times
out, the job is removed from the channel, and the sync point increments
that haven’t been done, will be done by the host1x driver. This prevents
channel from remaining stuck in case a command stream is formed
incorrectly and cannot be completed.