Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
I have two GPUs in my system: integrated Intel and Sapphire Pulse Vega 56.
I boot with Intel as my primary gpu and I use Vega for VFIO (gpu passthrough) and gpu offloading.
What I'm trying to do is to boot with amdgpu driver for Vega and bind it to vfio-pci when I start VM (qemu).
The problem occurs when I try to unbind Vega from amdgpu driver using this command:
echo -n "0000:03:00.0" > /sys/bus/pci/drivers/amdgpu/unbind
It results in segfault with following error in dmesg (full dmesg from boot to shutdown is attached):
[drm:amdgpu_pci_remove [amdgpu]] ERROR Device removal is currently not supported outside of fbcon
After that I'm unable to rebind device back to amdgpu or any other driver:
echo "0000:03:00.0" > /sys/bus/pci/drivers/amdgpu/bind
bash: echo: write error: No such device
Also I'm unable to shutdown properly. Shutdown process becomes stuck at some point and only holding the button helps.
I've attached relevant lspci -vvv output before and after attempt to unbind, in case it's useful.
Another thing I've tried is to unbind using kernel 4.19.60 and it just hangs after executing the command. I've attached the log of this attempt (error is different from 5.2.1).
Result is the same but backtrace seems a bit different. This was done with kernel 5.2.1.
I've tried suspend to ram and another reset bug mitigation (which helps in other cases), but gpu is still unusable after this failed attempt to unbind. I still can't re-bind it to amdgpu or vfio-pci and clean shutdown is not happening.
I couldn't rebind my RX 470 or shutdown the system cleanly after unbinding it on any kernel my NixOS had since I've got it last winter. Reproduced OPs method for 4.19.64, got severe warnings and oops, "modprobe -r amdgpu" just hangs.
I'll do more testing, but it seems that unbind works with kernel 5.3-rc7.
There is still this error in the log:
[drm:amdgpu_pci_remove [amdgpu]] ERROR Device removal is currently not supported outside of fbcon
without any backtraces and unbind seems to succeed with and without X running (on other gpu, of course).
It'd be nice to have confirmation from other people.
I confirm that on on 5.3-rc7 I could unbind/bind RX470 multiple times and shut the system down cleanly afterwards. Got some warning with a trace in dmesg, now going to check if this does affect system stability and whether my goal of switching the Radeon-powered seat between Linux desktop (without persistent session, of course) and virtual machine is now reachable.
Since last comment I've used this for a dozen times for switching between Linux desktop and Windows VM, one time amdgpu crashed after resume from suspend but I'm not sure if it was related to this bug and I was still able to reboot after it.
However I still get this warning sometimes on unbind:
ERROR Device removal is currently not supported outside of fbcon
is printed non-conditionally, without checking if DRM nodes are being used by userspace clients. I wonder if it's possible to implement such a check and prevent the unbind if they are
Fedora 31, 5.3.1 kernel, 5700XT - still seeing problems with unbinding from the AMDGPU driver.
I have video=efifb:off in my kernel parameters to keep the efifb from ever using the card.
After stopping X and unbinding from vtcon0 and vtcon1, attempting to unbind the driver from yields the following error, I cannot bind a new driver to the card, and I can't shutdown cleanly.
[ 140.760872] fbcon: Taking over console
[ 140.773454] Console: switching to colour frame buffer device 320x90
[ 577.562635] Console: switching to colour dummy device 80x25
[ 679.403956] VFIO - User Level meta-driver version: 0.3
[ 679.410718] [drm:amdgpu_pci_remove [amdgpu]] ERROR Device removal is currently not supported outsid
e of fbcon
[ 679.410938] [drm] amdgpu: finishing device.
This issue hasn't had any activity since 2020-01-18. The AMD driver stack changes rapidly and contains lots of shared code across products so it's possible that it has already been fixed. Please upgrade to a current stable kernel and userspace stack and try again. If you still experience this issue with the latest driver stack, please capture relevant logging and open a new issue referring back to this one.