David Schleef activity

David Schleef commented on merge request !176 at GStreamer / orc

2024-04-13T18:42:42Z

I actually don't know why we prefer mapped files. Maybe @ds remembers

There was a shift around 2010 in Linux about how JIT memory was supposed to be allocated, in order to address vulnerability vectors such as writable+executable pages. I followed some of the recommended practices for a while, including anonymous mmap, mmap of /dev/zero, and mmap of a /tmp file. Whatever is there now is effectively the then-current fad frozen in time. As I recall, there were a few more developments in the area which I'm sure are implemented in the more widespread JIT systems.

David Schleef commented on issue #12 at GStreamer / orc

2023-12-08T04:52:12Z

Manual instruction scheduling hasn't been useful on x86 since prior to 2002-ish, since the processor has deep instruction pipelines and will reorder aggressively.

I don't know what the details are on current processors, but around 2011-ish, a 2x unroll (shift=1) would completely fill up some execution unit (load/store or SSE arithmetic) for every possible Orc program. This expectation will trivially be false on modern processors if, for example, there are more than 2 load/store units. In that case a 4x unroll might offer additional benefit (and I expect it would).

David Schleef commented on issue #3 at GStreamer / orc

2023-12-08T04:20:36Z

IMO, it's quite reasonable to remove all the compile_.c tests and the functions orc_test_compile_(). Rationale:

The ability for Orc to generate assembly code that could be compiled was never a feature that really got used.
These tests were most interesting while bootstrapping the binary code generation
All of these tests are bit-rotten (and were always fragile)
Any future direction for Orc would likely focus exclusively on run-time code generation

David Schleef commented on merge request !111 at GStreamer / orc

2023-12-02T22:27:23Z

@amyspark Looking through the code now.

  orc_x86_emit_mov_imm_reg (compiler, 4, 16, X86_EAX);

The magic value 16 in this line is the minimum optimal alignment for writes of SSE registers. That is, the 3-region method emits code so that in middle region, writes are done aligned to 16 byte addresses. And 16 is chosen because 16-byte alignment is better than 8 byte alignment and not worse than 32 byte alignment. This is reasonable for processors from 10-15 years ago, but I have no idea for 2023 processors.

HOWEVER, I'm not even sure that 16 byte alignment was better than 8 byte alignment on 2010-era processors. Cache line sizes were 16 bytes, so maybe being aligned to cache lines provided a minor benefit. But I have no specific recollection to justify 16 over 8. I'm guessing I made it 16 to solve the same problem you're running into here, which is that the 3-region pattern requires that this number be as large as the register size.

I'm guessing that if you change 16 to 32, it will start working. I don't see other cases of a magic "16" with that same meaning, but I did not search exhaustively.

I think the alternative to changing this to 32 is overhauling the 3-region method. It needs an overhaul anyway, since AVX has masked load/store and you can emit much faster code for regions 1/3. I worry that this starts to get tied up in a bunch of other refactors that Orc could benefit from.

David Schleef commented on merge request !111 at GStreamer / orc

2023-11-27T18:00:54Z

benchmorc was meant to do exactly this, but was never completed. It was left in a prototype state. The patch below gets it closer to what you might expect as correct benchmark behavior, also redirecting to the SSE backend.

I don't recall the significance of the weights, although "ginger" would definitely be correct. These are names of actual machines, feathers was PowerPC and n900 was ARM.

I'm getting a score of 359 for SSE and 679 for AVX, so a pretty decent speedup. It appears that 100 is the speed of the backup, which in this case would be emulation.

ORC: INFO: ../orc/orccpu-x86.c(297): orc_x86_cpuid_get_branding_string(): processor string 'Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz'

diff --git a/testsuite/benchmorc/benchmorc.c b/testsuite/benchmorc/benchmorc.c
index fd7514e..95e304a 100644
--- a/testsuite/benchmorc/benchmorc.c
+++ b/testsuite/benchmorc/benchmorc.c
@@ -68,9 +68,9 @@ main (int argc, char *argv[])
     double perf;
     double weight;
 
-    perf = orc_test_performance_full (programs[i], 0, NULL);
-    /* weight = weights_ginger[i]; */
-    weight = weights_feathers[i];
+    perf = orc_test_performance_full (programs[i], 0, "sse");
+    weight = weights_ginger[i];
+    /* weight = weights_feathers[i]; */
     /* weight = weights_n900[i]; */
 
     sum += weight * perf;

David Schleef commented on merge request !111 at GStreamer / orc

2023-11-23T00:50:26Z

This is cool.

David Schleef opened merge request !5707: Draft: Convert tutorials to python at GStreamer / gstreamer

2023-11-22T21:52:22Z

Currently, users reading the documentation for basic tutorials on the web site have the option of choosing C, Python, or node. However, there is only one tutorial for Python (and none for node?), limiting the usefulness of the documentation. This increases the coverage of python tutorials.

This is marked DRAFT because

I intend to convert most or all of the C basic tutorials to Python
basic tutorial 3 in Python is buggy

David Schleef pushed new project branch ds-python-tutorials at David Schleef / gstreamer

2023-11-22T21:45:58Z

David Schleef (a53510b8) at 22 Nov 21:45

Convert basic tutorial 4

... and 3 more commits

David Schleef created project David Schleef / gstreamer

2023-11-22T21:32:51Z