Manual instruction scheduling hasn't been useful on x86 since prior to 2002-ish, since the processor has deep instruction pipelines and will reorder aggressively.
I don't know what the details are on current processors, but around 2011-ish, a 2x unroll (shift=1) would completely fill up some execution unit (load/store or SSE arithmetic) for every possible Orc program. This expectation will trivially be false on modern processors if, for example, there are more than 2 load/store units. In that case a 4x unroll might offer additional benefit (and I expect it would).
IMO, it's quite reasonable to remove all the compile_.c tests and the functions orc_test_compile_(). Rationale:
@amyspark Looking through the code now.
orc_x86_emit_mov_imm_reg (compiler, 4, 16, X86_EAX);
The magic value 16 in this line is the minimum optimal alignment for writes of SSE registers. That is, the 3-region method emits code so that in middle region, writes are done aligned to 16 byte addresses. And 16 is chosen because 16-byte alignment is better than 8 byte alignment and not worse than 32 byte alignment. This is reasonable for processors from 10-15 years ago, but I have no idea for 2023 processors.
HOWEVER, I'm not even sure that 16 byte alignment was better than 8 byte alignment on 2010-era processors. Cache line sizes were 16 bytes, so maybe being aligned to cache lines provided a minor benefit. But I have no specific recollection to justify 16 over 8. I'm guessing I made it 16 to solve the same problem you're running into here, which is that the 3-region pattern requires that this number be as large as the register size.
I'm guessing that if you change 16 to 32, it will start working. I don't see other cases of a magic "16" with that same meaning, but I did not search exhaustively.
I think the alternative to changing this to 32 is overhauling the 3-region method. It needs an overhaul anyway, since AVX has masked load/store and you can emit much faster code for regions 1/3. I worry that this starts to get tied up in a bunch of other refactors that Orc could benefit from.
benchmorc was meant to do exactly this, but was never completed. It was left in a prototype state. The patch below gets it closer to what you might expect as correct benchmark behavior, also redirecting to the SSE backend.
I don't recall the significance of the weights, although "ginger" would definitely be correct. These are names of actual machines, feathers was PowerPC and n900 was ARM.
I'm getting a score of 359 for SSE and 679 for AVX, so a pretty decent speedup. It appears that 100 is the speed of the backup, which in this case would be emulation.
ORC: INFO: ../orc/orccpu-x86.c(297): orc_x86_cpuid_get_branding_string(): processor string 'Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz'
diff --git a/testsuite/benchmorc/benchmorc.c b/testsuite/benchmorc/benchmorc.c
index fd7514e..95e304a 100644
--- a/testsuite/benchmorc/benchmorc.c
+++ b/testsuite/benchmorc/benchmorc.c
@@ -68,9 +68,9 @@ main (int argc, char *argv[])
double perf;
double weight;
- perf = orc_test_performance_full (programs[i], 0, NULL);
- /* weight = weights_ginger[i]; */
- weight = weights_feathers[i];
+ perf = orc_test_performance_full (programs[i], 0, "sse");
+ weight = weights_ginger[i];
+ /* weight = weights_feathers[i]; */
/* weight = weights_n900[i]; */
sum += weight * perf;
This is cool.
Currently, users reading the documentation for basic tutorials on the web site have the option of choosing C, Python, or node. However, there is only one tutorial for Python (and none for node?), limiting the usefulness of the documentation. This increases the coverage of python tutorials.
This is marked DRAFT because