"PyTorch - Illegal Instruction (core dumped)" when loading torch module - ROCm 6.0
Brief summary of the problem:
New installation of Ubuntu 22.04.4 LTS on a older dual-proc HP Z800 workstation with an RX 7800XT 16GB.
Followed the steps listed in the "Quick-Start install guide" (https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html#rocm-install-quick) to install the required drivers and ROCm packages. After rebooting the system, I followed the tutorial "Install PyTorch for ROCm" (https://rocm.docs.amd.com/projects/radeon/en/latest/docs/install/install-pytorch.html) to install Torch and Torchvision.
I used conda venvs to ensure I was using Python 3.10 (to match the ROCm packages), and to avoid any conflicts with system packages.
Python fails to load the torch module from the latest ROCm release with an "Illegal instruction (core dumped)" message:
(rocm-python310) amon@valdor:~/20240225-1$ python -V
Python 3.10.13
(rocm-python310) amon@valdor:~/20240225-1$ pip list |grep torch
torch 2.1.2+rocm6.0
torchvision 0.16.1+rocm6.0
(rocm-python310) amon@valdor:~/20240225-1$ python
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Illegal instruction (core dumped)
(rocm-python310) amon@valdor:~/20240225-1$exit
This does not happen if I use the CPU-only torch module installed from pytorch.org; module is imported successfully and generates output:
(base) amon@valdor:~/20240225-1$ conda activate torchcpu-python310
(torchcpu-python310) amon@valdor:~/20240225-1$ python -V
Python 3.10.13
(torchcpu-python310) amon@valdor:~/20240225-1$ pip list | grep torch
torch 2.2.1
torchaudio 2.2.1
torchvision 0.17.1
(torchcpu-python310) amon@valdor:~/20240225-1$ python
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> x = torch.rand(5, 3)
>>> print(x)
tensor([[0.0522, 0.2051, 0.4779],
[0.8953, 0.8550, 0.0763],
[0.4734, 0.2974, 0.2028],
[0.9460, 0.5906, 0.2668],
[0.3418, 0.1116, 0.6357]])
>>>
Some google searches pointed to possible issues with lack of AVX support on the older processors (like the Xeon), so I'm guessing maybe the AMD PyTorch packages are being built with support only for newer CPUs?
To make things even more 'interesting', there seem to be multiple ways to install PyTorch and they don't quite match:
- https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/3rd-party/pytorch-install.html
- https://rocm.docs.amd.com/projects/radeon/en/latest/docs/install/install-pytorch.html
The first link suggest using docker images which are also broken with the same behavior on my hardware (I tried).
Hardware description:
- CPU: dual Xeon X5675 @ 3.07GHz
- GPU: Sapphire AMD RX 7800XT
- System Memory: 96GB DDR3-1333
- Display(s): Dell U2719DC
- Type of Display Connection: HDMI
System information:
- Distro name and Version: Ubuntu 22.04.4 LTS
- Kernel version: 6.5.0-21-generic
- Custom kernel: N/A
- AMD official driver version: amdgpu/6.3.6-1718217.22.04, 6.5.0-21-generic, x86_64: installed
How to reproduce the issue:
- Do fresh Ubuntu 22.04 install, followed by all updates
- Go to https://rocm.docs.amd.com/projects/radeon/en/latest/docs/install/install-radeon.html, follow steps to add repos and get amdgpu script
- Install driver and ROCm - "amdgpu-install -y --usecase=graphics,rocm"
- Reboot
- Create new conda environment for Python 3.10: "conda create --name rocm-python310 python=3.10"
- Follow guide to install PyTorch for Radeon GPUs: https://rocm.docs.amd.com/projects/radeon/en/latest/docs/install/install-pytorch.html
- Once install is completed, try to verify installation
- Get "Illegal instruction (code dumped)" message