Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
I'm using the Ryzen 7 7840HS CPU along with fully updated Fedora 39, running Linux 6.7.5 with mesa-va-drivers-freeworld, i.e. with full HW acceleration for H.264/H.265.
The CPU is running around 4700MHz (amd-pstate-epp driver running by default)
Power consumption is around 13W
These are truly insane numbers for a 5nm CPU.
Here's a comparison for the same video with the 8 years old Intel Core i5 6200U CPU using the very first iteration of Intel's 14 nm node:
Average CPU frequency is 700MHz
Average power consumption is 3.5W
Why is the new super advanced AMD part is so terribly inefficient? That doesn't seem right.
I would expect it to consume at most 2-3W given all the advances in manufacturing, not four times more.
Addendum 1:
This
echo power | tee /sys/devices/system/cpu/cpufreq/*/energy_performance_preference
Makes the CPU run at around 2300MHz and cuts power consumption roughly in half (8W) but that still is hugely inefficient. Unfortunately this workaround is not good for 4K 60fps VP9/AV1 videos because they start occasionally stuttering and skipping frames. I'm talking about YouTube.
Addendum 2:
Firefox needs media.ffmpeg.vaapi.enabled enabled in about:config in order to use HW video acceleration.
Addendum 3:
Under Windows 10 LTSC 2021 (21H2) with the VP9 extension installed and active:
For a 1440p VP9 60Hz stream on YouTube under Firefox 123:
I don't think there is much we can do in the driver to address this directly. Windows and android handle most of this in their compositors. To address this we need the following:
Compositors need to be updated use the overlay and compositing hardware in the display block to handle the CSC, scaling, and composition of the video. Currently most video players use graphics APIs like OpenGL to handle CSC and scaling. This uses a lot of power because GFX is powered up and the CPU is busy to keep feeding GFX. If we use the display hardware, we can keep gfx powered down and this will also generate less CPU traffic since we don't have to keep feeding GFX. Today this looks like:
Compressed video stream -> VCN (video decode) -> Raw YUV data -> GFX (CSC, scaling) -> Scaled RGB data -> DCN for display
what it should look like is:
Compressed video stream -> VCN (video decode) -> Raw YUV data -> DCN for CSC, scaling, and display
Make sure compositors allow direct flipping for full screen video, similar to full screen graphics for games. This avoids an extra copy. This is probably already handled in most cases.
Compositor or desktop environment should pick a less aggressive CPU profile while full screen video playback is active. This is probably mostly nice to have since there should be less CPU load in general once 1 is implemented.
Does Wayland help achieve that or a display server being used is inconsequential for the issue?
It's generally not possible to use multiple planes in X, so this would be wayland only.
Is it worth pinging mpv developers, so that at least this video player could become more frugal in terms of video playback?
What would need to happen would be for the media player to be able to ask the compositor if it can just hand it the raw YUV video data. If the compositor supports that and uses display planes to handle it, then the media player can just share the YUV images rather than RGB, cutting out the GFX work in the media player. Also, a lot of users like to use custom filters that media players provide via OpenGL which might not be possible with fixed function display hardware.
Can Firefox/Chrome achieve the same under Linux?
Sure via the same mechanism as a standalone media application talking to the compositor.
How does Windows come off a lot more power efficient in this regard? What's wrong with Linux?
MS and Android put a lot of work into their compositor and media APIs to enable these use cases out of the box. One vendor defines the ecosystem. On Linux you have lots of compositors and desktop environments, each with their own goals and schedules and features. The problem is that not all hardware supports this so most compositors and media players tend to focus on the solution that will serve the largest number of users across a wide range of hardware which is usually OpenGL. Windows and Android can mandate certain hardware features from the platforms they support.
In the best case scenario what could power consumption for e.g. 4K VP/AV1 60fps stream playback be?
Regarding MPV, mpv --vo=dmabuf-wayland --hwdec=vaapi today will export the video and OSD (ui/subtitles) as a separate wayland surface, and use VCN hw decode through vaapi.
However, most (if not all) wayland compositors today will GFX compose the exported surfaces to a single framebuffer before sending it to the display hardware.
I've compared mpv --vo=gpu with --vo=dmabuf-wayland and here on my side there's basically no difference. Both come at ~10.5W for a 1080p 60fps H.264 video.
That's under Wayfire 0.8.0-2.fc39.x86_64 with 1.33 scaling and a 2560x1600 120Hz display. Probably the compositor uses the GPU for scaling which makes the GPU run at higher 3D frequencies which negates any benefits from using the dmabug-wayland output. Even with scaling set to 1.0, there's barely any difference.
Looks like more work needs to be done.
I've tried pure decoding, not sure if I did it right, but this command requires less than 1.5W on average:
This comes off at roughly 7W. Far from what I got with Intel (3.5W-4W despite using the exact same software stack, i.e. Fedora + XFCE + Firefox) but better than ~12W that I get under X.org with AMD.
I still don't understand why Intel fairs so much better and I feel like there's some very large inefficiency on the AMD side.
Like I mentioned earlier, playing 1080p 60Hz H.264 videos in Firefox under AMD keeps the CPU fluctuating between 3.4 and 4.7GHz (!!) while on Intel I'm looking at 0.7GHz with no spikes.
You can test this with this URL, live 24 hours a day:
I'm not familiar with Intel hardware, but I suspect they may have a general purpose blitter engine what can be used independent of GFX which supports CSC and scaling and is lower power than GFX. They could use this in VAAPI rather than GFX.
It's not just the power consumption, one of my main use cases for hardware decoding is being able to watch 4K videos while doing CPU intensive tasks which is impossible to do even on the top of the line APUs: #2996 (comment 2266151)
Currently I see no advantage in hardware decoding because power consumption is the same (and thus battery life) and it doesn't help you while running heavy tasks.
If AMD's implementation needs architectural changes to be useful (because Intel doesn't have such problems) IMHO the involved upstreams should be made aware of that so that compositors/video players can eventually implement such features. Right now no one seems to be even aware of the issue.
AMD recently introduced a similar feature called VPE. I think it's also a fixed-function hardware similar to Intel's SFC, thus changing the current situation of AMDGPU using GFX for CSC and scaling in video playback. But I can't find any detailed documentation about this feature, or even the minimum supported hardware.
I also tested mpv --hwaccel=vaapi --vo=dmabuf-wayland but running on Weston since it seems like Wlroots based compositors still don't work with this?? (https://youtu.be/SMCMZwAiw2w?t=618), anyways I followed (https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/3177) and I noticed a 2w difference between running on Hyprland vs Weston. Unfortunately it seems to be using my Nvidia dGPU, so I can't exactly tell if it would make a different with AMD.
@agd5f While the sacling and compositing layer are definitely a concern, @marcoen ffmpeg testing in the other ticket showed that for decode only (without any composing or scaling and the like) vaapi on linux used more than 3x the power than it takes for decoding and compositing on windows. Is that also not a driver issue?
Is it possible to test how much power the display pipeline takes by showing a pre-decoded video or something?
The display pipeline isn't the thing that's burning (lots of) power. The things that burn power are GPU and CPU. Windows has optimizations that use fixed-function HW (the display pipeline) for video playback via direct scanout surfaces. Most Linux compositors don't have the same optimizations (yet).
This obviously doesn't explain everything, but it's one piece of the puzzle.
@hwentland are you aware of ongoing efforts to use the fixed-function hw in (major) Linux compositors? Is AMD involved in those efforts? If you could provide relevant links, that'd be splendid.
If there are no ongoing efforts to support this in major compositors, is there at least a rough plan and documentation on the things that need to be done in order to get us there? Any hints/advice are much appreciated.
@leoli from my team is looking at it from the AMD side. He's mostly looked at sway/wlroots/libliftoff, but is now also having an eye on Weston. @leoli, can you post your PR and/or issue for your sway/wlroots/work?
We're looking at sway and Weston since compositors are complex, we're new to the scene, and those are very accesible. There have been some minor talks with Gnome devs about support in mutter.
Related work has been done by Robert Mader. His mastodon account tends to have lots of news. His FOSDEM talk is a good overview.
Wow, thanks for the info, that's quite encouraging!
In particular https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/3177 seems interesting (wondering what's the state of non-YUV buffer support now...)
This gives me some hope things are about to improve.
Will definitely watch the FOSDEM video. Thanks again for the pointers!
with sway using the scenegraph api (merged on latest)
Kernel patches to make AMD cursor behave better, which I have yet to submit
There are still lots left to tidy up before it's in any state to merge. I was initially looking at sway as a vehicle to tie everything together, but mutter/kwin/weston are definitely not out of the picture.
KWin 6.0 does direct scanout of fullscreen videos, with scaling and letterboxing when needed, but there's no overlay plane usage yet. That is planned for the near-ish future though.
One kind-of-blocker is that this only works for limited range BT.601 videos, until the color representation protocol is done. As the color management protocol is nearing completion, it shouldn't take that long anymore for the representation one to be done either though.
I tried doing (https://mastodon.social/@rmader@floss.social/112015648160051940) on my Ryzen 7 6800h machine running the latest version of Sway and PPD and while playing back the 4k version of BigBuck bunny, it only consumed 14-15 W! Way below the 21-25W I often encounter using MPV/ Firefox.
Steps:Download files from here:https://cloud.silentundo.org/s/r8733siTjP4yRJp?dir=undefined&openfile=996972Install: - flatpak remote-add --user --if-not-exists gnome-nightly https://nightly.gnome.org/gnome-nightly.flatpakrepo - flatpak install --user livi-[amd64|aarch64].flatpakRun: - flatpak run org.sigxcpu.Livi [bbb_sunflower_2160p_60fps_normal.mp4]Btw, I encountered issues with permissions on flatpack,it couldn't access my FS solved it using flatpak override --user org.sigxcpu.Livi --filesystem=host
This isn't the full scope of the problem but I want to call it out because it will influence power consumption of the CPUs on the package.
The CPU is running around 4700MHz (amd-pstate-epp driver running by default)
When using amd-pstate, are you explicitly changing EPP values manually or taking what is programmed by the firmware? The amd-pstate driver doesn't change the CPPC request MSR value from the boot firmware, it will default to default.
But userspace can explicitly set EPP values such as using the string balance_power or balance_performance. If you use power-profiles-daemon, upgrade to 0.20 to make sure this happens.
I just don't understand why the CPU is involved so much, granted most of the stuff is performed by the GPU (at the very least decoding which must be the most computationally expensive operation).
And what's interesting is that using "power", e.g.
echo power | tee /sys/devices/system/cpu/cpufreq/*/energy_performance_preference
may cut power consumption by at least 30 to 40%.
For AV1/VP9 4K 60Hz this doesn't work so well though as I get dropped frames but it absolutely works for 1080p H.264 60Hz streams.
I won't pretend that I understand how it all works but it looks like the CPU governor overshoots quite a lot (by default).
High CPU usage might indicate that the dedicated video decoder (VCN) isn't used, even though you've enabled media.ffmpeg.vaapi.enabled. You could also try to set media.hardware-video-decoding.enabled and media.hardware-video-decoding.force-enabled.
Even if VCN is being used I've seen high battery consumption from CPUs running other tasks when the EPP profile isn't actively set and it boots with firmware defaults. It's quite important for userspace to actively set one of the balanced EPP profiles.
Just tried this with PPD0.20 set to power save, running XORG since I couldn't run UMR --gui on wayland and can confirm that vcn_dec_0 is indeed working power consumption is still high same as before.
I did want to investigate how much of the excess power draw was decoding and how much was the rest of presentation so I did a bunch of testing inspired by @marcoen over in the other ticket.
Here's the test setup: I am using big buck bunny in the 480p24, 720p24, 1080p30 and 4k60 version (all h264) so others could reproduce it.
For the ffmpeg tests I use ffmpeg -re -i file -an -c:v rawvideo -f null /dev/null for sw decoding and ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -re -i file -an -c:v rawvideo -f null /dev/null for hw decoding and for actual playback I am using kodi with vaapi enabled and disabled. I am not 100% sure the accelerated ffmpeg command is entirely correct, up to the 1080 60 file it works as expected, pretty much no cpu usage and 1x rate but with the 4k60 file it produces a ton of cpu usage and plays back at around 0.3x rate, not quite sure what's going on there.
To get the power readings I take the output of powerstat over 8 minutes and I let it settle back to baseline between runs.
The testes were done on my AMD Framework 13 on arch with kernel 6.8.2 and everything up to date.
Average power above idle:
ffmpeg sw
ffmpeg hw
kodi sw
kodi hw
ffmpeg hw-sw
kodi hw-sw
480p 24
0.45
2.74
2.13
3.66
2.29
1.53
720p 24
0.48
2.79
2.42
3.78
2.31
1.36
1080p 30
1.35
3.2
3.58
4.22
1.85
0.64
4k 60
10.48
borked
14.22
6.38
-7.84
So with and without presentation hardware decoding uses significantly more power than software decoding for 1080 30 and less, above it it helps a lot. From previous testing with 8k 60 youtube videos the difference there is over 30w.
Now if we compare the ffmpeg and the kodi values we can get an upper limit for the impact of presentation (there is also other stuff going on but it can't be more than full video playback - decode only unless I am missing something).
sw
hw
480p 24
1.68
0.92
720p 24
1.94
0.99
1080p 30
2.23
1.02
4k 60
3.74
Not quite sure what I am seeing here but looks like presentation hurts the accelerated case less.
So @agd5f it still looks like the decode part alone uses excessive power, this part specifically could/would have to be addressed in the driver/firmware right?
Hi @chestwood96, based on the suggestions in this ticket of using Wayland and playing with mpv --vo=dmabuf-wayland --hwdec=vaapi, I've done some measurements and the results are very promising. I'm using a very minimal Wayland setup using labwc at the moment.
I used the same video files as you and ran sudo powerstat -R -d 3 in the background. This measures CPU+GPU power every second for 1 minute and then prints the average. The actual power draw from the battery is a bit more of course.
For software decoding I play back with mpv --vo=gpu --hwdec=no. Here are the results:
Idle: 0.41 Watts on average with standard deviation 0.14
Video
SW
HW
480p24
3.25 Watts on average with standard deviation 0.06
3.59 Watts on average with standard deviation 0.05
720p24
3.46 Watts on average with standard deviation 0.10
3.62 Watts on average with standard deviation 0.05
1080p30
6.09 Watts on average with standard deviation 1.54
3.77 Watts on average with standard deviation 0.05
4k60
12.20 Watts on average with standard deviation 0.16
4.42 Watts on average with standard deviation 0.05
I don't know why the power consumption goes up and down so much for the 1080p30 video when doing SW decoding. I re-ran the test and got basically the same result.
I also don't know why my numbers are so much above idle compared to yours
So far I'm quite happy with these results from where I started. I'm looking forward to testing the work that @leoli is doing.
Note that I'm running a self-built mpv, since there is a bug in the current release (v0.37.0) which caused some decoding error with the 4K60 video:
[ffmpeg/video] h264: get_buffer() failed[ffmpeg/video] h264: decode_slice_header error[ffmpeg/video] h264: no frame!Error while decoding frame (hardware decoding)![ffmpeg/video] h264: get_buffer() failed[ffmpeg/video] h264: decode_slice_header error[ffmpeg/video] h264: no frame!Error while decoding frame (hardware decoding)![ffmpeg/video] h264: get_buffer() failed[ffmpeg/video] h264: decode_slice_header error[ffmpeg/video] h264: no frame!Error while decoding frame (hardware decoding)!Attempting next decoding method after failure of h264-vaapi.[autoconvert] HW-uploading to vaapi[hwupload] upload yuv420p -> vaapi[nv12]
@marcoen Given we are taking different measurement values and a different player some of the (especially the hw decoding ones) are quite close to each other, especially on the low end. Could you try without the dmabuf thing, I am quite curious if the HW values would match even better? Would make sense that the savings from that would increase with higher resolution. The lowend number are still sad but 4ish W for 4k60 is a lot more palatable, it may finally match my t480s XD.
Not quite sure why your sw numbers are higher, not sure what kodi uses but it may not be the same as "--vo=gpu".
Something else has been raised on the Framework forums regarding power consumption that might not be obvious. Turning on scaling in GNOME or KDE does increase power consumption as graphics is going to be active to do those scaling operations.
I discovered that in Fedora, compared to 100% setting the displays scale to 200% increase the powerconsumption about 10% and 125% to 175% increase the consumption about 40%
KWin uses the display hardware for scaling instead of the GPU whenever that's possible (since 5.27), and with Gnome 46 it should happen in Gnome as well.
The increase in power consumption with 125% sounds like the application does not directly pass the video buffer to the compositor though but is rendering at 2x and then being downscaled for fractional scaling, where the rendering with ca. 2.56x as many pixels ofc causes quite a bit of overhead.
I'm on KDE Plasma 6 myself, and while I haven't done any thorough testing, I can't appreciate any difference between scaling (125%) on and off on my FW13 internal display.
Which, coupled with what you said, would confirm that the DE is doing a good job.
I have been facing a similar problem on a Sapphire NITRO+ AMD Radeon™ RX 7900 XTX Vapor-X 24GB. 7800x3D cpu with smart access memory enabled. I use KDE+Wayland with usually with two display monitors attached and powered on (it doesn't seem to make a difference if I disconnect the additional display monitors).
I've gone as far as testing mpv with vulkan decoding, and with vaapi.
Tested with 100% scaling and adaptive sync off on 3840x2160 monitor at 60hz.
mpv command line for vulkan:
env RADV_PERFTEST=video_decode mpv --vo=gpu-next --gpu-api=vulkan --hwdec=vulkan --gpu-context=waylandvk --player-operation-mode=pseudo-gui
mpv command line for vaapi:
env RADV_PERFTEST=video_decode mpv --hwdec=auto --hwdec-codecs=all --player-operation-mode=pseudo-gui
Note: confirmed settings were used by hitting i in mpv window to show display overlay statistics.
In my case the power used (as reported by CoreCtrl) goes to roughly 60-70 watts from 6-16 watts while not playing video.
I was hoping this would have been solved by now
My setup is Lenovo Ideapad 5 pro with AMD 7840HS and Radeon 780M. I have not been able to see HW decoding usage on nvtop. I am on Fedora (Gnome+Wayland) and mesa-va-drivers-freeworld. I tested it on Firefox having enabled media.ffmpeg.vaapi.enabled and media.hardware-video-decoding.force-enabled. I only see video encoding usage:
Is it a configuration problem on my end that decoding is done on the software side?
There's a cpufreq variable called energy_performance_preference which by default is set to performance and while watching online streams it results in the CPU consistently running at maximum (boost) frequencies and consuming a ton of energy.
When changing this preference to either of balance_performance|balance_power|power the consumption drops approximately twofold where power consumption scales this way balance_performance > balance_power > power.
I've not noticed any frame drops by using any of these options.
I wonder what's going on and why that's the case. I've settled down on balance_performance in the end because the remaining two options are slow to ramp up frequencies which results in a higher latency and much worse performance for different tasks.
I wrote a script which alters this for all logical CPU cores: