nv30: can driver be more agressive in case of CACHE_ERROR?

Hi, currently running piglit/cts tests connected to depth are hardlocking machine. (Easiest way to cause hardlock, but probably not the only one.) It happens randomly without any notice, just at one point we see more and more:

Mar 15 21:44:29 Aquarius kernel: nouveau 0000:01:00.0: fifo: CACHE_ERROR - ch 0 [kworker/0:4[92]] subc 2 mthd 012c data 00000000
Mar 15 21:44:29 Aquarius kernel: nouveau 0000:01:00.0: fifo: CACHE_ERROR - ch 0 [kworker/0:4[92]] subc 2 mthd 0134 data 00000000
Mar 15 21:44:29 Aquarius kernel: nouveau 0000:01:00.0: fifo: CACHE_ERROR - ch 0 [kworker/0:4[92]] subc 2 mthd 0100 data 00000000
Mar 15 21:44:29 Aquarius kernel: nouveau 0000:01:00.0: fifo: CACHE_ERROR - ch 0 [kworker/0:4[92]] subc 2 mthd 0130 data 00000000

(these four values in cycles)

And later machine stops responding, we can ssh into (but there's nothing interesting in logs). Nothing weird in logs of ktsan, so it's probably mesa's fault.

I was talking with Karol on this topic. Consensus was that mesa is probably doing something weird/illegal. I don't have any experience with debugging nouveau, so I'm opening issue mainly to ask @skeggsb if he has idea how to narrow the problem.

Problem seems to happen on any nv30/nv40 gpu.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

nv30: can driver be more agressive in case of CACHE_ERROR?