... | ... | @@ -20,7 +20,7 @@ This would suggest the obvious improvement of surrounding the instruction with a |
|
|
|
|
|
TODO: test mixed workloads.
|
|
|
|
|
|
# Few lanes enabled.
|
|
|
# Few lanes enabled
|
|
|
|
|
|
With one lane enabled we only get ~14 Ginsns/sec, which is a far cry of the 169 Ginsns/sec one would expect with perfect scaling. In fact below 12 lanes enabled you don't get any improvement. Even worse, it is 12-lanes per half in wave64 (unless 0 lanes in which case 1 half can be disabled). So a mask of `exec_lo = 0x1` and `exec_hi = 0x1` only give you ~7 Ginsns/sec.
|
|
|
|
... | ... | |