lima: issues with current tile heap max size
With a fairly trivial test application drawing a few hundred very large triangles I was able to reproduce some of the kernel task errors users have reported: https://github.com/enunes/mesa-test-programs/blob/trianglelimit/egl-color.c#L104
With LIMIT=500 egl-color-kms at 1080p I get things such as:
[94639.988039] lima d00c0000.gpu: fail to save task state from egl-color-kms pid 74218: error task list is full
[94639.997308] lima d00c0000.gpu: gp task error int_state=0 status=aa
[94645.633945] lima d00c0000.gpu: fail to save task state from egl-color-kms pid 74219: error task list is full
[94645.643052] lima d00c0000.gpu: gp task error int_state=0 status=aa
Or
[94658.945039] [drm:lima_sched_timedout_job [lima]] *ERROR* lima job timeout
[94658.951106] lima d00c0000.gpu: fail to save task state from egl-color-kms pid 74222: error task list is full
[94658.960897] lima d00c0000.gpu: pp task error 0 int_state=0 status=1
[94658.967117] lima d00c0000.gpu: pp task error 1 int_state=0 status=1
[94658.973212] lima d00c0000.gpu: pp task error 2 int_state=0 status=1
[94659.329534] lima d00c0000.gpu: timeout wait pmu cmd
This appears to be due to the current tile heap size limit of 16M. If I understand correctly it makes sense as it depends on how many tiles the geometry will affect and drawing large geometry in a single batch stresses that. If I locally change the growable heap limit to be higher then it can do more triangles accordingly.
I feel like we should have some sort of split to the job once we hit some threshold. I'm not sure what would be the best threshold to watch. I guess a large command buffer of small geometry would not run into the issue, so just command buffer size is not totally optimal. Trying to predict jobs so they stay under the current tile heap size seems non trivial.
#3467 (closed) seems to be resolved by increasing the tile heap size limit.
Maybe there is some heuristic we can put in place to avoid the crashes and if we need any improvement in the lima jobs for an optimal solution we can do it later on? Thoughts? @yuq825 @anarsoul @rellla