nir: Replacing GC with manual memory management
Early in the GLSL IR design, we opted to use ralloc to do hierarchical memory management, and use a mark-and-sweep type operation to GC unused IR at the end of the compile pipeline. This decision was inherited by NIR. I think it would be worthwhile to re-evaluate this, since NIR is a much flatter IR that makes memory management in compiler passes much easier.
The basic motivation I have is this pahole output from a debugoptimized amd64 build:
struct nir_alu_instr {
nir_instr instr; /* 0 32 */
nir_op op; /* 32 4 */
_Bool exact:1; /* 36: 0 1 */
_Bool no_signed_wrap:1; /* 36: 1 1 */
_Bool no_unsigned_wrap:1; /* 36: 2 1 */
/* XXX 5 bits hole, try to pack */
/* XXX 3 bytes hole, try to pack */
nir_alu_dest dest; /* 40 72 */
/* --- cacheline 1 boundary (64 bytes) was 48 bytes ago --- */
nir_alu_src src[]; /* 112 0 */
/* size: 112, cachelines: 2, members: 7 */
/* sum members: 108, holes: 1, sum holes: 3 */
/* sum bitfield members: 3 bits, bit holes: 1, sum bit holes: 5 bits */
/* last cacheline: 48 bytes */
};
struct ralloc_header {
unsigned int canary; /* 0 4 */
/* XXX 4 bytes hole, try to pack */
struct ralloc_header * parent; /* 8 8 */
struct ralloc_header * child; /* 16 8 */
struct ralloc_header * prev; /* 24 8 */
struct ralloc_header * next; /* 32 8 */
void (*destructor)(void *); /* 40 8 */
/* size: 48, cachelines: 1, members: 6 */
/* sum members: 44, holes: 1, sum holes: 4 */
/* last cacheline: 48 bytes */
} __attribute__((__aligned__(16)));
While ralloc_header has 8 bytes for canary+pad that won't be in a release build, we align the whole ralloc allocation to 16 bytes anyway, which puts us at 288 bytes for a 2-op NIR alu instr (16% memory overhead for ralloc). We're well past glibc's fast chunk size limit (https://sourceware.org/glibc/wiki/MallocInternals), which may be contributing to our malloc overhead pains.
How much runtime performance would we get if we weren't spending ~16% of our cache for NIR compiles on ralloc headers? How much would it reduce our maximum memory size? How much better would our caches work if we were freeing and reusing memory as we go instead of always generating new instructions and only freeing at the end of the compile?
I wonder how hard it would really be to bite the bullet and stop doing GC. Given that we have asan these days, not just valgrind, and we have asan in at least some of our CI, the "undetected memory leaks" consideration is less strong than it was back when we made the GLSL IR decision.