Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
Welcome to our new datacenter. The migration is still not over, but we try to bring up the service to the best we can. There are some parts not working yet (shared runners, previous job logs, previous job artifacts, ... ) but we try to do our best.
We do not guarantee data while the migration is not over, please consider this as read-only
I have a macbook pro with vega 20 which uses the amdgpu firmware vega12 and when i boot any distro the graphics glitch and the computer freezes. If i install amdgpu pro on ubuntu it works flawlessly. Would you guys help me debug this and fix for upstream?
Please, let me know which kind of logs and information I can attach in order to start debugging this issue. I’ll be more than happy to be part of it!
amdgpu pro uses the same driver as upstream, just packaged so that you can install it on enterprise distros, so the code is the same. What driver package version did you use? What upstream kernels have you tried? Please include the dmesg output from the working and non-working cases.
Hi, let me put some more details.
If I start ubuntu 20.04 with nomodeset and then install the AMDGPU-pro driver it works, even if I upgrade to a kernel like 5.10 which the dams part of the driver does not compile.
If I try any other distro with the open source AMDGPU driver it glitches and freeze right after it opens the wm.
I`m using the 5.10.12 kernel on ubuntu with the amdgpu-pro driver 20.50 for the "working" dmesg output. And 5.10.30 on Manjaro with the open source driver for the "not working" dmesg output.
Can you narrow down which option helps? The bapm and aspm options don't do anything on kernel 5.10 with your board. Please check whether it's runpm or dpm that helps. If it's dpm, you can use the ppfeaturemask option to narrow down dpm feature(s) are causing the problem.
Line 166 at this link starts the comments for and definition of PP_FEATURE_MASK, for 5.10 since that's what you're running. (More recently, one more mask has been added.)
To select the multiple features to mask (edit: as enabled) at once, add their values together to give them as a single kernel argument.
You may know this, but the numbers in the definition are in hex, and the number you give also needs to be in hex, signified by the 0x at the beginning of them. If you aren't familiar with this, just put in google the values you want to add together like this: 0x1+0x2+0x4+0x8, and it would tell you these add up to 0xF. (Not 0x15!)
All the mask values listed for 5.10 add up to 0x7FFFF.
If it were me, I'd probably start with amdgpu.ppfeaturemask=0x7FFFF to first make sure that makes it work for you, just like amdgpu.dpm=0 does. Then, I'd start removing masks once by one, and putting any back in that made it not work, and move on to trying to remove other masks. If there's complicated interactions between which features are masked, it would be more complicated to narrow down than this simple process I've described, but hopefully that's not the case. (Edit: 2021-11-23 strikeout because the next comment from Alex clarifies my understanding of amdgpu.ppfeaturemask was backwards.)
If amdgpu.dpm=0 works but amdgpu.ppfeaturemask=0x7FFFFamdgpu.ppfeaturemask=0 doesn't, then I don't believe trying other combinations of masks to get values less more than 0x7FFFF0x0 will find any that work, to potentially save yourself some time or at least set expectations. I'm guessing there must be some things dpm=0 does that ppfeaturemask can't do. (Edit: 2021-11-23 strikeout because the next comment from Alex clarifies my understanding of amdgpu.ppfeaturemask was backwards.) I'm not an amdgpu developer, so Alex or someone else may have something more useful to say.
Please try other values of ppfeaturemask. That will allow us to narrow down which feature(s) are causing problems. dpm=0 disables all power features. ppfeaturemask allows you to selectively enable power features. The features represented by the bits in ppfeaturemask are defined here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/include/amd_shared.h#n196
When the bits are set the feature is enabled. When the bits are cleared the features are disabled. For example setting ppfeaturemask=0xfff7bffe (i.e., clearing bit 0) will disable GFX clock dpm while leaving all other power features enabled. Setting ppfeaturemask=0xfff7bffc (i.e., clearing bits 0 and 1) will disable GFX clock dpm and memory clock dpm while leaving all other power features enabled.
No, nothing. I can try more if it would be useful. But James said that the less restrictive combination not working it would very unlikely that any other combination would work. Is that so?
I apologize for the confusion I added here. I mistakingly believed that amdgpu.ppfeaturemask selectively disabled power features as bits were added in, and mistakingly believed amdgpu.dpm=0 and amdgpu.ppfeaturemask=0x7FFFF were nearly the same thing, both saying to have all of those features off. Alex's comment explaining that "ppfeaturemask allows you to selectively enable power features... When the bits are set the feature is enabled. When the bits are cleared the features are disabled" clarified that I had my understanding of amdgpu.ppfeaturemask completely backwards. Becuase of this, my last comment which starts "If amdgpu.dpm=0 works..." and the last paragraph of my comment before that which starts "If it were me, I'd..." should be disregarded. I'll edit the comments for clarity.
In an attempt to be helpful, I'll rewrite those portions of my previous comments to possibly correct them, to give you somewhere to start if Alex hasn't responded yet. But, I reiterate my previous warning that I'm not an amdgpu developer. I'm just another user. Something I say could be off, and anything Alex says takes precedence over what I say. :-)
If it were me, I'd probably start with amdgpu.ppfeaturemask=0 to make sure that works for you, and then amdgpu.ppfeaturemask=0xff7bfff to make sure that fails for you. Checking these are possibly unnecessary, but I usually like to do a sanity check at the extremes before iterating through a bunch of possibilities. (I'll note I'm not sure why the high value is 0xff7bfff. I'd think 0xfffff would do it, which is what the masks currently listed in the source add up to for enum PP_FEATURE_MASK, but maybe there's more to it. My larger number 0xff7bfff is based on Alex's 0xfff7bffe example but with bit 0 added back in. That seems to have PP_OVERDRIVE_MASK = 0x4000 and PP_GFXOFF_MASK = 0x8000 bits cleared, which I'm not sure was intentional or not.)
So, assuming you have a working and failing ppfeaturemask, there's a few ways to go.
I'd probably then verify amdgpu.ppfeaturemask=0xfffff failed. (If it worked, you'd have to play around with adding in higher values than are listed in the source following the 1/2/4/8 pattern you'll see in the numbers.) Then, I'd start with what works (presumably amdgpu.ppfeaturemask=0) and add in features one at a time, until you found one that made it fail. So, I might go through this list in order until I found one that failed:
Then, I might go through all the other feature bit masks and add them in, but keeping the one out that caused the failure. (Note that means no longer going through that list, but doing the math on each feature -- as that list is just each feature added in, one by one, at least for the 20 features listed for enum PP_FEATURE_MASK.) So, let's say for example 0x3FF worked but 0x7FF failed. In that case, that would mean enabling the feature PP_CLOCK_STRETCH_MASK = 0x400 caused a failure. I'd take 0x3FF which worked, and add in PP_OD_FUZZY_FAN_CONTROL_MASK = 0x800 and try 0xBFF next. I'd want to give as few features as possible that when enabled caused the issue, to narrow it down the most. Maybe you'd find you only needed to skip adding one value in to keep it working, or maybe you'd find a couple.
so far i tested with amdgpu.featuremask=0 it works.
then i went to amdgpu.ppfeaturemask=ff7bfff, and it not only worked but the native resolution (wich is 2880x1800) went to 3360x2100 and it is working very well. Of course everything is too small but with a 200% sacling it is very clear and crispy. I will keep trying some combinations here and report my findings.
This issue hasn't had any activity since 2021-11-24. The AMD driver stack changes rapidly and contains lots of shared code across products so it's possible that it has already been fixed. Please upgrade to a current stable kernel and userspace stack and try again. If you still experience this issue with the latest driver stack, please capture relevant logging and open a new issue referring back to this one.