amdgpu_bo_create can retry a single allocation 8 times on OOM
Suppose we're trying to do a small buffer allocation, but we're out of memory.
amdgpu_bo_create
tries to handle out of memory requests by retrying faled allocations to the slab allocator, like this:
struct pb_slabs *slabs = get_slabs(ws, alloc_size);
entry = pb_slab_alloc(slabs, alloc_size, heap);
if (!entry) {
/* Clean up buffer managers and try again. */
amdgpu_clean_up_buffer_managers(ws);
entry = pb_slab_alloc(slabs, alloc_size, heap);
}
if (!entry)
return NULL;
The problem is, pb_slab_alloc
calls into amdgpu_bo_slab_alloc
when there are no free slabs, which recursively calls back into amdgpu_bo_create
. Since each level of this recursion can call into the next level twice on the error path, you end up doing 2^n retries, where n is the depth of the recursion.
A failed allocation can end up looking something like this:
begin 1224-byte allocation
begin 8192-byte allocation
begin 262144-byte allocation
begin 2097152-byte allocation
end 2097152-byte allocation (failure)
clean_up_buffer_managers, 262144-byte allocation
begin 2097152-byte allocation
end 2097152-byte allocation (failure)
end 262144-byte allocation (failure)
clean_up_buffer_managers, 8192-byte allocation
begin 262144-byte allocation
begin 2097152-byte allocation
end 2097152-byte allocation (failure)
clean_up_buffer_managers, 262144-byte allocation
begin 2097152-byte allocation
end 2097152-byte allocation (failure)
end 262144-byte allocation (failure)
end 8192-byte allocation (failure)
clean_up_buffer_managers, 1224-byte allocation
begin 8192-byte allocation
begin 262144-byte allocation
begin 2097152-byte allocation
end 2097152-byte allocation (failure)
clean_up_buffer_managers, 262144-byte allocation
begin 2097152-byte allocation
end 2097152-byte allocation (failure)
end 262144-byte allocation (failure)
clean_up_buffer_managers, 8192-byte allocation
begin 262144-byte allocation
begin 2097152-byte allocation
end 2097152-byte allocation (failure)
clean_up_buffer_managers, 262144-byte allocation
begin 2097152-byte allocation
end 2097152-byte allocation (failure)
end 262144-byte allocation (failure)
end 8192-byte allocation (failure)
end 1224-byte allocation (failure)
This is more retries than necessary. I think we do at most 8 retries of the innermost allocation since NUM_SLAB_ALLOCATORS
is 3.
I noticed this while working on !18052. We likely need to fix this in multiple places, since other drivers that use pb_slab have also copied this code.