Commit f5771004 authored by Connor Abbott's avatar Connor Abbott
Browse files

bifrost: Describe 64-bit clause encoding

parent 0a95642d
......@@ -408,7 +408,7 @@ There is one last field in the header for write-after-read dependencies involvin
=== Instruction type
The "instruction type" and "next clause instruction type" fields tell whether the clause has a variable-latency instruction, and if it does, which kind. Unsurprisingly, the "next clause instruction type" field applies to the next clause to be executed. If the clause doesn't have any variable-latency instructions, then the whole scoreboarding mechanism is skipped -- the clause is always executed immediately and it never sets or clears any scoreboard bits.
The "instruction type" and "next clause instruction type" fields tell whether the clause has a variable-latency instruction, and if it does, which kind. Unsurprisingly, the "next clause instruction type" field applies to the next clause to be executed. If the clause doesn't have any variable-latency instructions, then the whole scoreboarding mechanism is skipped -- the clause is always executed immediately and it never sets or clears any scoreboard bits. The 64-bit instruction type is a bit special, since it changes the meaning of the entire clause, causing most operations to work on 64-bit quantities. For example, any `FMA.f32` instructions in a clause with a 64-bit instruction type will automatically be upgraded to `FMA.f64`. For more on the meaning of the 64-bit instruction type, see the <<64-bit-clauses,64-Bit Clauses>> part of the register field documentation.
[options="header"]
|============================
......@@ -416,6 +416,7 @@ The "instruction type" and "next clause instruction type" fields tell whether th
| 0 | no variable-latency instruction
| 5 | SSBO store
| 6 | SSBO load
| 15 | 64-bit (implicitly no variable-latency instruction)
|============================
TODO: fill out this table
......@@ -428,7 +429,7 @@ Now that we know how instructions and constants are packed into clauses, let's t
As the name suggests, this stage reads from and writes to the register file. The current instruction reads from the register file at the same time that the previous instruction writes to the register file. Thus, the field contains both reads from the current instruction and writes from the previous instruction. Presumably, the scheduler makes this happen by interleaving the execution of multiple clauses from different quads. It only executes one instruction from a given quad every 3 cycles, so that the register write phase of one instruction happens at the same time as the register read phase of the next. Of course, it's possible that the FMA and ADD stages take more than 1 cycle, and more threads are interleaved as a consequence; this is a microarchitectural decision that's not visible to us. The result is that a write to a register that's immediately read by the next instruction won't work, but that's never necessary anyways thanks to the passthrough sources detailed later.
The register file has four ports, two read ports, a read/write port, and a write port. Thus, up to 3 registers can be read during an instruction. These ports are represented directly in the instruction word, with a field for telling each port the register address to use. There are three outputs, corresponding to the three read ports, and two inputs, corresponding to the FMA and ADD results from the previous stage. The ports are controlled through what the ARM patent calls the "register access descriptor," which is a 4-bit entry that says what each of the ports should do. Finally, there is the uniform/const port, which is responsible for loading uniform registers and constants embedded in the clause. Note that the uniforms and constants share the same port, which means that only one uniform or one constant (but not both) can be loaded for an instruction. This port supplies 64 bits of data, though, so two 32-bit parts of the same 64-bit value can be accessed in the same instruction.
The register file has four ports, two read ports, a read/write port, and a write port. Thus, up to 3 registers can be read during an instruction, or 2 if the previous instruction writes 2 registers. These ports are represented directly in the instruction word, with a field for telling each port the register address to use. There are three outputs, corresponding to the three read ports, and two inputs, corresponding to the FMA and ADD results from the previous stage. The ports are controlled through what the ARM patent calls the "register access descriptor," which is a 4-bit entry that says what each of the ports should do. Finally, there is the uniform/const port, which is responsible for loading uniform registers and constants embedded in the clause. Note that the uniforms and constants share the same port, which means that only one uniform or one constant (but not both) can be loaded for an instruction. This port supplies 64 bits of data, though, so two 32-bit parts of the same 64-bit value can be accessed in the same instruction.
The format of the register part of the instruction word is as follows:
......@@ -467,6 +468,98 @@ Before we get to the actual format of the Control field, though, we need to desc
| 15 | Write ADD with Port 3, write FMA with Port 2
|============================
[#64-bit-clauses]
=== 64-bit clauses
When the 64-bit clause type is enabled, instructions operate on aligned 2-register pairs, where the `Rn` contains the low 32 bits for each thread in a quad and `R(n+1)` specifies the high 32. One of the patents mentions an alternate mode where `Rn` contains the first two full 64-bit values for the quad and `R(n+1)` contains the second two, but so far this hasn't been observed. When the 64-bit clause type is enabled, the register field has a totally different encoding. Ports 0-3 are similar, but now they load/store 64 bits each, in addition to the uniform/const port (which remains the same). The format of the register field when this mode is enabled is:
[options="header"]
|===============================
| Field | Bits
| Uniform/const | 8
| Port 2 | 5
| Port 3 | 5
| Port 0 | 4
| Port 1 | 5
| Control | 5
| Unknown (=7) | 3
|===============================
Note that each port has one less bit, since the register must be 2-aligned. For example, setting Port 0 to 1 would mean loading the low 32 bits in `R2` and the high 32 bits in `R3`.
The encoding of what each port does is now significantly more complicated. The previous scheme, where port 1 may contain the actual control field, is significantly expanded: the actual control field may now be spread across the Control, Port 1, and Port 3 fields!
First, we'll describe how ports 0 and 1 are encoded. The same scheme from before is used to save a bit from Port 0, where the fields are compared and both are subtracted from 31 (instead of 63) when `Port 0 > Port 1`. In addition, when `Port 0 == Port 1`, then only port 0 is actually active (port 1 is disabled). However, there are two cases we haven't described how to handle: when loading a single register pair which is at least `R32` (since this is too big for Port 0) and when loading no registers. Both are handled through a few special control values. There is a special value for Control to indicate that Port 1 is to hold a control value, and then the two cases are distinguished by the control value in Port 1. Based on the control value in Port 1, either Port 0 is disabled or 16 is implicitly added to its value, so that 0 loads `R32-33`.
If there is a control value in Port 3, then it controls what port 2 does. Either Port 2 is disabled, or it is used to either load a register, store the FMA result, or store the ADD result. The only remaining cases are when both ports are used: either port 2 is used to load a register while port 3 stores the FMA or ADD result, or both ports are used to store the FMA/ADD results (port 2 stores FMA, port 3 stores ADD). These cases are indicated using the main Control field, in addition to a special case to indicate that there is a control value in Port 3. These cases are also specified in the Port 1 control field if it exists, although it's redundant.
Algother, there are 6 possible cases indicated by the main Control field:
- Both port 0 and port 1 are used, and both port 2 and port 3 are used to store the FMA and ADD results.
- Both port 0 and port 1 are used, port 2 is used to load, and port 3 is used to store the ADD result.
- Same as above, but port 3 is used to store the FMA result.
- Both port 0 and port 1 are used, but port 2 usage is indicated by the control in port 3.
- Port 0 usage is indicated by the control in port 1, and both port 2 and port 3 are used to store the FMA and ADD results.
- Port 0 usage is indicated by the control in port 1, and port 2 usage is indicated by the control in port 3.
And 4 possible cases for port 1 as a control field:
- Add 16 to port 0, and both FMA and ADD are stored (no control in port 3)
- Port 0 is disabled, and both FMA and ADD are stored.
- Add 16 to port 0, and there is a control in port 3.
- Port 0 is disabled, and there is a control in port 3.
The actual encoding of the control field is shown below. The table shows all possible control values, even though each of them can only appear in one location. If a box is left blank, that means it's not specified by that control value (it is specified by the control value in another field). When multiple control values in multiple fields specify what the same field does, they must agree, or the encoding is invalid. The "allowed in" field specifies where this control field is allowed to be. In Port 0, "Read +16" means to add 16 to the value to get the actual register to be read.
[options="header"]
|===============================
| Value | Allowed in | Port 0 | Port 1 | Port 2 | Port 3
| 0 | Port 1 | Read +16 | Control | Write FMA | Write ADD
| 2 | Port 3 | | | Write ADD | Control
| 3 | Port 1 | Unused | Control | Write FMA | Write ADD
| 6 | Port 3 | | | Write FMA | Control
| 7 | Port 3 | | | Unused | Control
| 8 | Main | Read | Read | Write FMA | Write ADD
| 10 | Port 3 | | | Read | Control
| 12 | Port 1 | Read +16 | Control | | Control
| 15 | Port 1 | Unused | Control | | Control
| 17 | Main | Read | Read | Read | Write ADD
| 26 | Main | Read | Read | Read | Write FMA
| 27 | Main | Read | Read | | Control
| 29 | Main | | Control | Write FMA | Write ADD
| 31 | Main | | Control | | Control
|===============================
To decode the register field, first look up the entry corresponding to the value of the Control field (it had better be "allowed in Main"). If there are any fields in that entry that say "Control", look up their values in the same table (they had better have the appropriate "allowed in") and fill in the blanks with what they say. At the end, every field should have a consistent assignment.
Now, we describe the encoding algorithm. First, we assign registers to ports in almost the same manner as before. If there are at least two registers to read, assign the first two to ports 0 and 1, with the smaller in port 0. If there are three registers to read, the third goes in port 2. If both FMA and ADD will write a register, they go in port 2 and port 3 respectively, as before. However, if only one stage writes a register, then port 2 is now preferred over port 3 -- only if port 2 is used for reading should port 3 hold the register to be written.
The rest of the encoding algorithm is described below.
* If both port 0 and port 1 are used:
** If port 0 reads `R32` or above, then subtract both from 15 similar to before, and store them.
** If port 3 is unused, set Control = 27 and set port 3 as below.
** If port 2 is used for reading:
*** If port 3 writes ADD, set Main = 17.
*** If port 3 writes FMA, set Main = 26.
** Otherwise, both port 2 and port 3 must be used for writing, so set Main = 8.
* Otherwise:
** If port 0 loads `R30` or below, set port 0 and port 1 to the same value, and then follow the instructions above as if both ports 0 and 1 were used.
** If both FMA and ADD are written, set Main = 29. Otherwise set Main = 31.
** If port 0 is used, set the value of the Port 0 field to the actual value minus 16, and:
*** If both FMA and ADD are written, set Port 1 = 0.
*** Otherwise, port 3 is unused, so set Port 1 = 12 and set port 3 as below.
** Otherwise:
*** If both FMA and ADD are written, set Port 1 = 3.
*** Otherwise, port 3 is unused, so set Port 1 = 15 and set port 3 as below.
* If port 3 is unused:
** If port 2 is unused, set Port 3 = 7.
** If port 2 writes ADD, set Port 3 = 2.
** If port 2 writes FMA, set Port 3 = 6.
** If port 2 reads, set Port 3 = 10.
=== Uniform/const port
Unlike the other ports, the uniform/const port always loads 64 bits at a time. If an FMA or ADD instruction only needs 32 bits of data, the high 32 bits or low 32 bits are selected later in the source field, described below.
The uniform/const bits describe what the uniform/const port should load. If the high bit is set, then the low 7 bits describe which pair of 32-bit uniform registers to load. For example, 10000001 would load from uniform registers 2 and 3. If the high bit isn't set, then the next-highest 3 bits indicate what 64-bit immediate to load, while the low 4 bits contain the low 4 bits of the constant. The mapping from from bits to constants is a little strange:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment