Commit f01d28b6 authored by Alyssa Rosenzweig's avatar Alyssa Rosenzweig 💜
Browse files

pan/va: Add initial ISA.xml for Valhall



This handwritten file is the product of over a hundred hours of
reverse-engineering and represents the sum of what I've learned about
the Valhall architecture. It will be used in the next commits as the
backbone of a Valhall toolchain.
Signed-off-by: Alyssa Rosenzweig's avatarAlyssa Rosenzweig <alyssa@collabora.com>
parent fce0027d
<!--
Copyright (C) 2021 Collabora Ltd.
Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice (including the next
paragraph) shall be included in all copies or substantial portions of the
Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
-->
<valhall>
<lut name="Immediates">
<desc>
This immediates are accessible in (almost) any instruction, provided the
immediate mode is kept to the default. They optimize for the most common
immediate values; any immediate listed here may be used without taking up
a uniform slot or a register. Most integer instructions can access
separate half-words and individual bytes via swizzles on the source.
</desc>
<constant desc="Zero">0x00000000</constant>
<constant desc="All ones; integer $-1$">0xFFFFFFFF</constant>
<constant desc="Maximum integer; floating-point NaN">0x7FFFFFFF</constant>
<constant desc="Integers $(-2, -3, -4, -5)$">0xFAFCFDFE</constant>
<constant desc="16-bit integer $2^8$">0x01000000</constant>
<constant desc="Multiples of 16 $(0, 32, 0, 128)$">0x80002000</constant>
<constant desc="Multiples of 16 $(48, 80, 96, 112)$">0x70605030</constant>
<constant desc="Multiples of 16 $(144, 160, 176, 192)$">0xC0B0A090</constant>
<constant desc="Integers $(0, 1, 2, 3)$">0x03020100</constant>
<constant desc="Integers $(4, 5, 6, 7)$">0x07060504</constant>
<constant desc="Integers $(8, 9, 10, 11)$">0x0B0A0908</constant>
<constant desc="Integers $(12, 13, 14, 15)$">0x0F0E0D0C</constant>
<constant desc="Integers $(16, 17, 18, 19)$">0x13121110</constant>
<constant desc="Integers $(20, 21, 22, 23)$">0x17161514</constant>
<constant desc="Integers $(24, 25, 26, 27)$">0x1B1A1918</constant>
<constant desc="Integers $(28, 29, 30, 31)$">0x1F1E1D1C</constant>
<constant desc="Float $1.0$">0x3F800000</constant>
<constant desc="Float $0.1$">0x3DCCCCCD</constant>
<constant desc="Float $1 / \pi$">0x3EA2F983</constant>
<constant desc="Float $\log(2)$">0x3F317218</constant>
<constant desc="Float $\pi$">0x40490FDB</constant>
<constant desc="Float $0.0$">0x00000000</constant>
<constant desc="Float $65535.0 = 2^$16$ - 1$">0x477FFF00</constant>
<constant desc="Half-float $(255.0, 256.0) = (2^8 - 1, 2^8)$">0x5C005BF8</constant>
<constant desc="Half-float $0.1 = 1 / 10$">0x2E660000</constant>
<constant desc="Half-float $0.25 = 2^{-2}$">0x34000000</constant>
<constant desc="Half-float $0.5 = 2^{-1}$">0x38000000</constant>
<constant desc="Half-float $1.0 = 2^0$">0x3C000000</constant>
<constant desc="Half-float $2.0 = 2^1$">0x40000000</constant>
<constant desc="Half-float $4.0 = 2^2$">0x44000000</constant>
<constant desc="Half-float $8.0 = 2^3$">0x48000000</constant>
<constant desc="Half-float $\pi$">0x42480000</constant>
</lut>
<enum name="Action">
<desc>
Every Valhall instruction can perform an action, like wait on dependency
slots. A few special actions are available, specified in the instruction
metadata from this enum. The `wait0126` action is required to wait on
dependency slot #6 and should be set on the instruction immediately
preceding `ATEST`. The `barrier` action may be set on any instruction for
subgroup barriers, and should particularly be set with the `BARRIER`
instruction for global barriers. The `td` action only applies to fragment
shaders and is used to terminate helper invocations, it should be set as
early as possible after helper invocations are no longer needed as
determined by data flow analysis. The `return` action is used to terminate
the shader, although it may be overloaded by the `BLEND` instruction.
The `reconverge` action is required on any instruction immediately
preceding a possible change to the mask of active threads in a subgroup.
This includes all divergent branches, but it also includes the final
instruction at the end of any basic block where the immediate successor
(fallthrough) is the target of a divergent branch.
</desc>
<value name="Wait on all dependency slots">wait0126</value>
<value name="Subgroup barrier">barrier</value>
<value name="Perform branch reconverge">reconverge</value>
<reserved/>
<reserved/>
<value name="Terminate discarded threads">td</value>
<reserved/>
<value name="Return from shader">return</value>
</enum>
<enum name="Immediate mode">
<desc>Selects how immediates sources are interpreted.</desc>
<value desc="No special immediates" default="true">none</value>
<value desc="Thread storage pointers">ts</value>
<reserved/>
<value desc="Thread identification">id</value>
</enum>
<enum name="Thread storage pointers">
<desc>
Situated between the immediates hard-coded in the hardware and the
uniforms defined purely in software, Valhall has a some special
"constants" passing through data structures. These are encoded like the
table of immediates, as if special constant $i$ were lookup table entry
$32 + i$. These special values are selected with the `.ts` modifier.
</desc>
<reserved/>
<reserved/>
<value desc="Thread local storage base pointer (low word)">tls_ptr</value>
<value desc="Thread local storage base pointer (high word)">tls_ptr_hi</value>
<reserved/>
<reserved/>
<value desc="Workgroup local storage base pointer (low word)">wls_ptr</value>
<value desc="Workgroup local storage base pointer (high word)">wls_ptr_hi</value>
</enum>
<enum name="Thread identification">
<desc>
Situated between the immediates hard-coded in the hardware and the
uniforms defined purely in software, Valhall has a some special
"constants" passing through data structures. These are encoded like the
table of immediates, as if special constant $i$ were lookup table entry
$32 + i$. These special values are selected with the `.id` modifier.
</desc>
<reserved/>
<reserved/>
<value desc="Lane ID">lane_id</value>
<reserved/>
<reserved/>
<reserved/>
<value desc="Core ID">core_id</value>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<value desc="Program counter">program_counter</value>
<reserved/>
</enum>
<enum name="Swizzles (8-bit)">
<value default="true">b0123</value>
<value>b3210</value>
<value>b0101</value>
<value>b2323</value>
<value>b0000</value>
<value>b1111</value>
<value>b2222</value>
<value>b3333</value>
<value>b2301</value>
<value>b1032</value>
<value>b0011</value>
<value>b2233</value>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
</enum>
<enum name="Lanes (8-bit)">
<desc>Used to select the 2 bytes for shifts of 16-bit vectors</desc>
<value>b02</value>
<reserved/>
<reserved/>
<reserved/>
<value>b00</value>
<value>b11</value>
<value>b22</value>
<value>b33</value>
<reserved/>
<reserved/>
<value>b01</value>
<value>b23</value>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
</enum>
<enum name="Swizzles (16-bit)">
<value>h00</value> <!-- 0,2 -->
<value>h10</value>
<value default="true">h01</value>
<value>h11</value>
<value>b00</value> <!-- 0,0 -->
<value>b20</value> <!-- 1,1 -->
<value>b02</value> <!-- 2,2 -->
<value>b22</value> <!-- 3,3 -->
<value>b11</value>
<value>b31</value>
<value>b13</value> <!-- 0,1 -->
<value>b33</value> <!-- 2,3 -->
<value>b01</value>
<value>b23</value>
<reserved/>
<reserved/>
</enum>
<enum name="Swizzles (32-bit)">
<value default="true">none</value>
<reserved/>
<value>h0</value>
<value>h1</value>
<value>b0</value>
<value>b1</value>
<value>b2</value>
<value>b3</value>
</enum>
<enum name="Swizzles (64-bit)">
<value default="true">none</value>
<reserved/>
<value>h0</value>
<value>h1</value>
<value>b0</value>
<value>b1</value>
<value>b2</value>
<value>b3</value>
<value>w0</value>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
</enum>
<enum name="Lane (8-bit)" implied="true">
<value>b0</value>
<value>b1</value>
<value>b2</value>
<value>b3</value>
</enum>
<enum name="Lane (16-bit)" implied="true">
<value>h0</value>
<value>h1</value>
</enum>
<enum name="Round mode">
<desc>Corresponds to IEEE 754 rounding modes</desc>
<value desc="Round to nearest even" default="true">rte</value>
<value desc="Round to positive infinity">rtp</value>
<value desc="Round to negative infinity">rtn</value>
<value desc="Round to zero">rtz</value>
</enum>
<enum name="Result type">
<desc>
Comparison instructions like `FCMP` return a boolean but may encode this
boolean in a variety of ways. `i1` gives a OpenGL style `0/1` boolean.
`m1` gives a Direct3D style `0/~0` boolean. `f1` gives a floating-point
`0.0f / 1.0f` boolean. Switching between these modes is useful to fold a
boolean type convert into a comparison. `u1` is used internally to
implement 64-bit comparisons.
</desc>
<value desc="Integer 1">i1</value>
<value desc="Float 1">f1</value>
<value desc="Minus 1">m1</value>
<value desc="Low half of 64-bit compare">u1</value>
</enum>
<enum name="Widen">
<value default="true">none</value>
<value>h0</value>
<value>h1</value>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
</enum>
<enum name="Clamp">
<desc>
Clamp applied to the destination of a floating-point instruction. Note the
clamps may be decomposed as two independent bits for `clamp_0_inf` and
`clamp_m1_1`, with `clamp_0_1` arising as the composition of `clamp_0_inf`
and `clamp_m1_1` in either order.
</desc>
<value default="true" desc="Identity">none</value>
<value desc="Clamp positive">clamp_0_inf</value>
<value desc="Clamp to $[-1, 1]$">clamp_m1_1</value>
<value desc="Clamp to $[0, 1]$">clamp_0_1</value>
</enum>
<enum name="Condition">
<desc>
Condition code. Type must be inferred from the instruction. IEEE 754 total
ordering only applies to floating point compares. "Not equal" and "greater
than or less than" are distinguished by NaN behaviour conforming to
the IEEE 754 specification.
</desc>
<value desc="Equal">eq</value>
<value desc="Greater than">gt</value>
<value desc="Greater than or equal">ge</value>
<value desc="Not equal">ne</value>
<value desc="Less than">lt</value>
<value desc="Less than or equal">le</value>
<value desc="Greater than or less than">gtlt</value>
<value desc="Totally ordered">total</value>
</enum>
<enum name="Dimension">
<desc>Texture dimension.</desc>
<value desc="1D or buffer">1d</value>
<value desc="2D or 2D array">2d</value>
<value desc="3D or 3D array">3d</value>
<value desc="Cube map or cube map array">cube</value>
</enum>
<enum name="LOD mode">
<desc>Level-of-detail selection mode in a texture instruction.</desc>
<value desc="Set to zero">zero</value>
<value desc="Computed based on neighboring fragments">computed</value>
<reserved/>
<reserved/>
<value desc="Explicitly specified in a register">explicit</value>
<value desc="Computed based on neighboring fragments added with bias in a register">computed_bias</value>
<value desc="Derived from a gradient descriptor in registers">grdesc</value>
<reserved/>
</enum>
<enum name="Register format">
<desc>Format of data loaded to / stored from registers for general memory access.</desc>
<reserved/>
<reserved/>
<value desc="32-bit floats">f32</value>
<value desc="16-bit floats">f16</value>
<value desc="32-bit unsigned integers">u32</value>
<reserved/>
<reserved/>
<reserved/>
</enum>
<enum name="Staging register count" implied="true">
<value>sr0</value>
<value>sr1</value>
<value>sr2</value>
<value>sr3</value>
<value>sr4</value>
<value>sr5</value>
<value>sr6</value>
<value>sr7</value>
</enum>
<enum name="Vector size">
<desc>Number of channels loaded/stored for general memory access.</desc>
<value default="true" desc="Scalar">none</value>
<value desc="2 channels">v2</value>
<value desc="3 channels">v3</value>
<value desc="4 channels">v4</value>
</enum>
<enum name="Memory size">
<desc>Number of bits loaded/stored for general memory access.</desc>
<value desc="8-bits">i8</value>
<value desc="16-bits">i16</value>
<value desc="24-bits">i24</value>
<value desc="32-bits">i32</value>
<value desc="48-bits">i48</value>
<value desc="64-bits">i64</value>
<value desc="96-bits">i96</value>
<value desc="128-bits">i128</value>
</enum>
<enum name="Slot">
<desc>
Dependency slot set on a message-passing instruction that writes to
registers. Before reading the destination, a future instruction must wait
on the specified slot. Slot #7 is for `BARRIER` instructions only.
</desc>
<value desc="Slot #0">slot0</value>
<value desc="Slot #1">slot1</value>
<value desc="Slot #2">slot2</value>
<reserved/>
<reserved/>
<reserved/>
<reserved/>
<value desc="Slot #7">slot7</value>
</enum>
<enum name="Store segment">
<desc>Memory segment written to by a `STORE` instruction.</desc>
<value desc="Global or workgroup local memory" default="none">global</value>
<value desc="Position output (in a position shader)">pos</value>
<value desc="Varyings with LEA_ATTR computed addresses">vary</value>
<value desc="Thread local storage">tl</value>
</enum>
<enum name="Subgroup size">
<desc>
Selects the effective subgroup size from subgroup operations. The hardware
warps are sixteen threads on Valhall, but subdividing a warp may be useful
for API requirements. In particular, derivatives may be calculated with
quads (four threads).
</desc>
<value desc="Two threads">subgroup2</value>
<value desc="Four threads">subgroup4</value>
<value desc="Eight threads">subgroup8</value>
<value desc="Sixteen threads" default="true">subgroup16</value>
</enum>
<enum name="Lane operation">
<desc>
Acts as a modifier on the lane specificier for a `CLPER` instruction. The
`accumulate` mode is required for efficient subgroup reductions.
</desc>
<value name="No operation" default="true">none</value>
<value name="Exclusive-or">xor</value>
<value name="Accumulate">accumulate</value>
<value name="Shift">shift</value>
</enum>
<enum name="Inactive result">
<desc>
Accesses to inactive lanes (due to divergence) in a subgroup is generally
undefined in APIs. However, the results of permuting with an inactive lane
with `CLPER.i32` are well-defined in Valhall: they return one of the
following values, as specified in the `CLPER.i32` instructions. Sometimes
certain values enable small optimizations.
</desc>
<value name="0x00000000" default="true">zero</value>
<value name="0xFFFFFFFF">umax</value>
<value name="0x00000001">i1</value>
<value name="0x00010001">v2i1</value>
<value name="0x80000000">smin</value>
<value name="0x7FFFFFFF">smax</value>
<value name="0x80008000">v2smin</value>
<value name="0x7FFF7FFF">v2smax</value>
<value name="0x80808080">v4smin</value>
<value name="0x7F7F7F7F">v4smax</value>
<value name="0x3F800000">f1</value>
<value name="0x3C003C00">v2f1</value>
<value name="0xFF800000">infn</value>
<value name="0x7F800000">inf</value>
<value name="0xFC00FC00">v2infn</value>
<value name="0x7C007C00">v2inf</value>
</enum>
<ins name="NOP" title="No operation" dests="0" opcode="0x00">
<desc>
Do nothing. Useful at the start of a block for waiting on slots required
by the first actual instruction of the block, to reconcile dependencies
after a branch. Also useful as the sole instruction of an empty shader.
</desc>
</ins>
<ins name="BRANCHZ" title="Compare to zero and branch" dests="0" opcode="0x1F">
<desc>
Branches to a specified relative offset if its source is nonzero (default)
or if its source is zero (if `.eq` is set). The offset is 27-bits and
sign-extended, giving an effective range of ±26-bits. The offset is
specified in units of instructions, relative to the *next* instruction.
Positive offsets may be interpreted as "number of instructions to skip".
Since Valhall instructions are 8 bytes, this operates as:
$$PC := \begin{cases} PC + 8 \cdot (\text{offset} \; + 1) &amp; \text{if} \;
\text{src} \stackrel{?}{=} 0 \\ PC + 8 &amp; \text{otherwise} \end{cases}$$
Used with comparison instructions to implement control flow. Tie the
source to a nonzero constant to implement a jump. May introduce
divergence, so generally requires `.reconverge` flow control.
</desc>
<src>Value to compare against zero</src>
<imm name="offset" start="8" size="27" signed="true"/>
<mod name="eq" start="36" size="1"/>
</ins>
<ins name="DISCARD" title="Discard fragment" opcode="0x20">
<desc>
Evaluates the given condition, and if it passes, discards the current
fragment and terminates the thread. The destination should be set to R60.
Only valid in a frgment shader.
</desc>
<cmp/>
<dest>Updated coverage mask (set to R60)</dest>
<src absneg="true" swizzle="true">Left value to compare</src>
<src absneg="true" swizzle="true">Right value to compare</src>
</ins>
<ins name="BRANCHZI" title="Compare to zero and branch indirect" opcode="0x2F">
<desc>
Jump to an indirectly specified address. Used to jump to blend shaders at
the end of a fragment shader.
</desc>
<src>Value to compare against zero</src>
<src>Branch target</src>
<mod name="eq" start="36" size="1"/>
</ins>
<ins name="BARRIER" title="Execution and memory barrier" opcode="0x45">
<desc>
General-purpose barrier. Must use slot #7. Must be paired with a
`.barrier` action on the instruction.
</desc>
<slot/>
</ins>
<group name="CSEL" title="Floating-point conditional select" dests="1">
<ins name="CSEL.f32" opcode="0x154"/>
<ins name="CSEL.v2f16" opcode="0x155"/>
<desc>
Evaluates the given condition and outputs either the true source or the
false source.
</desc>
<cmp/>
<src float="true">Left value to compare</src>
<src float="true">Right value to compare</src>
<src float="true">Return value if true</src>
<src float="true">Return value if false</src>
</group>
<group name="CSEL" title="Integer conditional select" dests="1">
<ins name="CSEL.u32" opcode="0x150"/>
<ins name="CSEL.v2u16" opcode="0x151"/>
<ins name="CSEL.i32" opcode="0x158"/>
<ins name="CSEL.v2i16" opcode="0x159"/>
<desc>
Evaluates the given condition and outputs either the true source or the
false source.
Valhall lacks integer minimum/maximum instructions. `CSEL` instructions
with tied operands form the canonical implementations of these
instructions. Similarly, the integer $\text{sign}$ function is canonically
implemented with a pair of `CSEL` instructions.
</desc>
<cmp/>
<src>Left value to compare</src>
<src>Right value to compare</src>
<src>Return value if true</src>
<src>Return value if false</src>
</group>
<ins name="LD_VAR_SPECIAL" title="Load special varying" opcode="0x56">
<sr write="true"/>
<sr_count/>
<vecsize/>
<regfmt/>
<slot/>
<src/>
<imm name="index" start="12" size="4"/> <!-- 0 for pointx, 1 for pointy, 2 for fragw, 3 for fragz -->
</ins>
<group name="LD_VAR_IMM_F32" title="Load immediate varying">
<desc>Interpolates a given varying</desc>
<ins name="LD_VAR_IMM_F32" opcode="0x5C"/>
<ins name="LD_VAR_IMM_F16" opcode="0x5D"/>
<sr write="true"/>
<vecsize/>
<sr_count/>
<regfmt/>
<slot/>
<src/>
<src/>
<imm name="index" start="20" size="4"/>
</group>
<ins name="LD_ATTR_IMM" title="Load immediate attribute" opcode="0x66">
<sr_count/>
<vecsize/>
<regfmt/>
<slot/>
<sr write="true"/>
<src>Vertex ID</src>
<src>Instance ID</src>
<imm name="index" start="20" size="4"/>
</ins>
<ins name="LD_ATTR" title="Load indirect attribute" opcode="0x67">
<desc>The index must not diverge within a warp.</desc>
<vecsize/>
<regfmt/>
<slot/>
<sr_count/>
<sr write="true"/>
<src>Vertex ID</src>
<src>Instance ID</src>
<src>Index</src>
</ins>
<ins name="LEA_ATTR" title="Load effective address" opcode="0x5E">
<desc>
Loads the effective address of the position buffer (in a position shader)
or the varying buffer (in a varying shader). That is, the base pointer
plus the vertex's linear ID (the first source) times the buffer's
per-vertex stride. `LEA_ATTR` should be executed once in a
position/varying shader, with the linear ID preloaded as `r59`. Each
position/varying store can then be constructed as `STORE` with the base
address sourced from the 64-bit destination of `LEA_ATTR` and an
appropriately computed offset. Varying stores bypass the usual conversion
hardware for attributes; this diverges from earlier Mali hardware.
</desc>
<sr write="true"/>
<sr_count/>
<slot/>
<imm name="unk" start="8" size="4"/>
<src>Linear ID</src>
</ins>
<ins name="LOAD" title="Global memory load" opcode="0x60">
<desc>Loads from main memory</desc>