Commit b9f1bf9c authored by Connor Abbott's avatar Connor Abbott
Browse files

bifrost: Document another header bit

parent c45164be
......@@ -30,7 +30,7 @@ Note that in addition to the execution unit, which we've been describing, there
- Load/store unit
- Texture unit
The execution unit interacts with these fixed-function blocks through special, variable-latency instructions in the FMA and ADD units. They bypass the usual, fixed-latency mechanism for reading/writing registers, and as such instructions in the same clause can't assume that registers have been read/written after the instruction is done (TODO: verify this for registers being read, and not just written). Instead, any dependent instructions must be put into a separate clause with the appropriate dependencies in the clause header set.
The execution unit interacts with these fixed-function blocks through special, variable-latency instructions in the FMA and ADD units. They bypass the usual, fixed-latency mechanism for reading/writing registers, and as such instructions in the same clause can't assume that registers have been read/written after the instruction is done. Instead, any dependent instructions must be put into a separate clause with the appropriate dependencies in the clause header set.
= Clauses
Conceptually, each clause consists of a clause header, followed by one or more 78-bit instruction words and then zero or more 60-bit constants. (Constants are actually 64 bits, but they're loaded the same port as uniform registers and share the same field in the instruction word, which includes 7 bits to choose which uniform register to load, some of which would be unused for constants, so ARM decided to be clever and stick the bottom 4 bits in each instruction where the constant is loaded, so the actual constants in the instruction stream are only 60 bits). But the instruction fetching hardware only works in 128-bit quadwords, so each clause has to be a multiple of 128 bits. To make the representation of the clauses as compact as possible which still making the decoding circuitry relatively simple, the instructions are packed so that two 128-bit quadwords can store 3 78-bit instructions, or 3 128-bit quadwords can store 4 instructions and a 60-bit constant. There were some bits left over, which seem to have been used to obviate the need to keep track of state between each word, simplifying the decoder and making it possible to decode the quadwords in parallel. Thus, the quadwords can be (almost) arbitrarily reordered while still retaining the meaning of the clause. (It's unknown whether this works in practice, but theoretically it could be done.) Each format fully describes which instruction(s) in the decoded clause the bits in the quadword represent, and whether one of those instructions is the last instruction.
......@@ -204,13 +204,14 @@ For any remaining constants, we simply add quadwords with two constants each. No
== Clause Header
The clause header mainly contains information about "variable-latency" instructions like SSBO loads/stores/atomics, texture sampling, etc. that use a separate functional units. There can be at most one variable-latency instruction per clause. It also indicates when execution should stop, and has some information about branching. The format of the header is as follows:
The clause header mainly contains information about "variable-latency" instructions like SSBO loads/stores/atomics, texture sampling, etc. that use separate functional units. There can be at most one variable-latency instruction per clause. It also indicates when execution should stop, and has some information about branching. The format of the header is as follows:
[options="header"]
|============================
| Field | Bits
| unknown | 18
| Register | 6
| unknown | 17
| Data Register Write Barrier | 1
| Data Register | 6
| Scoreboard dependencies | 8
| Scoreboard entry | 3
| Instruction type | 4
......@@ -221,7 +222,7 @@ The clause header mainly contains information about "variable-latency" instructi
=== Register field
A lot of variable-latency instructions have to interact with the register file in ways that would be awkward to express in the usual manner, i.e. with the per-instruction register field. For example, the STORE instruction has to read up to 4 32-bit registers, which the usual pathways for reading a register can't handle -- they're designed for reading up to three 32-bit or 64-bit registers each cycle, and it also needs to load a 64-bit address from registers. The LOAD instruction can't write to the register until the operation has finished, possibly well after the instruction executes. For cases like these, there's a "register" field in the clause header that lets the variable-latency instruction read/write one, or a sequence of, registers, in a manner different than the usual one. Since there can only be one variable-latency instruction per clause, this field isn't ambiguous about which instruction it applies to. If more than one register is being read from or written to, it must be a power of two, and the register field must be aligned to that power of two. For example, a two-register source could be R0-R1 (if the register field is 0), R2-R3 (register field is 2), R4-R5, etc. Or a four-register source could be R0-R3, R4-R7, etc.
A lot of variable-latency instructions have to interact with the register file in ways that would be awkward to express in the usual manner, i.e. with the per-instruction register field. For example, the STORE instruction has to read up to 4 32-bit registers, which the usual pathways for reading a register can't handle -- they're designed for reading up to three 32-bit or 64-bit registers each cycle, and it also needs to load a 64-bit address from registers. The LOAD instruction can't write to the register until the operation has finished, possibly well after the instruction executes. For cases like these, there's a "register" field in the clause header that lets the variable-latency instruction read/write one, or a sequence of, registers, using a completely different mechanism. Since there can only be one variable-latency instruction per clause, this field isn't ambiguous about which instruction it applies to. If more than one register is being read from or written to, and the register field must be aligned to the greatest power of two less than or equal to the number of registers. For example, a two-register source could be R0-R1 (if the register field is 0), R2-R3 (register field is 2), R4-R5, etc. A three-register source could be R0-R2, R2-R4, etc. Or a four-register source could be R0-R3, R4-R7, etc.
=== Dependency tracking
......@@ -282,6 +283,8 @@ The first clause in a program implicitly has no dependencies. This scheme makes
In addition to the normal 6 scoreboard entries available for clauses to wait on other clauses, there are two more entries reserved for tile operations. Bit 6 is cleared when depth and stencil values have been written for earlier fragments, so that the depth and stencil tests can safely proceed. The ATEST instruction (see patent) must wait on this bit. Bit 7 is cleared when blending has been completed for earlier fragments and the results written to the tile buffer, so that blending is possible. The BLEND instruction must wait on this bit. The blob makes also BLEND wait on bit 6, but I don't think that's necessary since it also waits on ATEST which waits on bit 6. These scoreboard entries provide similar functionality to the branch-on-no-dependency instruction on Midgard.
There is one last field in the header for write-after-read dependencies involving the data register. When an instruction like the STORE instruction executes, it may not read the data register right away. There might be too many requests queued for the load/store unit at once, so there isn't space for our STORE data. If another instruction clobbers the data to be stored, it will be corrupted when we actually go to read it. If we need to write to the data register for some other purpose, the "Data Register Write Barrier" bit must be set on the clause before the clause which actually does the write. This bit will stall execution until all the outstanding data register reads actually happen, so the contents aren't needed anymore. It's less heavy-handed than waiting for the STORE to actually complete, reducing the time spent stalling.
=== Instruction type
The "instruction type" and "next clause instruction type" fields tell whether the clause has a variable-latency instruction, and if it does, which kind. Unsurprisingly, the "next clause instruction type" field applies to the next clause to be executed. If the clause doesn't have any variable-latency instructions, then the whole scoreboarding mechanism is skipped -- the clause is always executed immediately and it never sets or clears any scoreboard bits.
......@@ -362,6 +365,7 @@ The uniform/const port also supports loading a few "special" 64-bit constants th
[options="header"]
|============================
| Field value | Special constant
| 00 | Always zero.
| 05 | Alpha-test data (used with ATEST)
| 06 | gl_FragCoord sample position pointer
| 08-0f | Blend descriptors 0-7 (used with BLEND to indicate which output to blend with)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment