Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
So, this problem curiously seems to avoided by the POLARIS11 OpenCL support and not specific to the Mesa version itself.
volatile has been legal in OpenCL code at least since OpenCL 1.1 and is used in the above projects to improve register allocation (without which AMD hardware gets a performance penalty).
mesa/clover does not touch the clc code. that is processed by clang/llvm.
can you run with CLOVER_DEBUG=llvm,native CLOVER_DEBUG_FILE=dump-file and post the files?
I just cloned/built CLBlast and it appears to be running ./clblast_tuner_xgemm (mentioned as the specific failing case in https://github.com/CNugteren/CLBlast/issues/298) successfully, so I don't know that the issue is polaris10 specific.
In my case, I've got Mesa 18.2.0-b21b38c4 (latest upstream as of the last hour or so), libclc r335280 (latest revision since late June), and llvm 7.0.0-svn as of r337934 (a few minutes ago).
I did notice that the LLVM version for both failing cases is different than the passing ones, so I went and downgraded to llvm 6.0.1... but it still works.
w/ LLVM 6.0.1 (first section of 578 tests):
Found best result 1.43 ms: 1505.0 GFLOPS
w/ LLVM 7.0.0svn:
Found best result 1.50 ms: 1428.2 GFLOPS
I'd agree with Jan that a dump of the dumped llvm bitcode would be useful. Also, it may be interesting to try to upgrade libclc or mesa to the latest upstream code to see if one of those has an effect.
You can see in my original comment that there are passes with LLVM 6.0.0 and failures for LLVM 6.0.1. So indeed, it can't be LLVM either.
I'm not sure if it's possible for libclc to be different between those, but it looks like it must be?
I asked one of the people suffering from this bug if he could provide the requested output. The affected distro appears to be Debian, so maybe we're looking at the version of this: https://packages.debian.org/sid/libclc-amdgcn
My original theory (before I tested 6.0.1 myself) was that the spectre mitigation changes that went into 6.0.1 broke something about compilation of CL kernels, but that doesn't seem to have been the case (I did go through the entire commit history of 6.0.0->6.0.1 to try to identify things that looked suspicious).
I do run Ubuntu at home, so it would be feasible for me to at least install the libclc version they've got and give it a spin. LLVM/Mesa might be a bit more work, but if we run out of other options, I might give it a try. I'd have to downgrade my whole stack, but it's possible they're compiling with different flags/features which changes behavior somehow.
I think for now, it's worth giving the original reporter a bit of time to try to dump the bitcode and get us a bit more info that might help us reproduce/diagnose this since it seems there's something version/distro specific possibly at play. It might be useful to know the/an exact build/tag/revision that is failing for CLBlast as well, just to eliminate that. I'm assuming that the latest git master is broken, but feel free to tell me otherwise.
And to answer the implied question: libclc supports multiple LLVM versions, and we've supported LLVM 6.0.x in libclc for a while now (and still build against 3.9 - 7.0.0svn).
I reverted llvm to the 6.0.0 version packaged in the padoka ppa (from my 7.0svn build) and libclc to the package manager's version and managed to reproduce the failure.
Since mesa hadn't changed and LLVM didn't seem to be the issue, I upgraded libclc to the current upstream revision (still built using llvm 6.0.0), and it started working.
Figuring that debian's version included the git checkout date, I checked out the latest code as of both Mar 8 and Mar 12, but both of those worked as well.
So, I checked out the debian sources and re-built their .debs on my system, those failed.
They've patched configure.py in libclc to add in some debian-specific build flags. When I copied those changes to upstream libclc as of 20180312 it started failing in the same way. Those same changes also break the current upstream libclc source.
The specific lines that debian have patched libclc's configure.py with that break things seem to be:
Where the output of 'dpkg-buildflags --get CFLAGS' for me is:
-g -O2 -fdebug-prefix-map=/home/awatry/src/libclc=. -fstack-protector-strong -Wformat -Werror=format-security
Something in there is breaking the bitcode compilation for libclc in debian.
For background, this is the CFLAGS that Debian passes when compiling all C programs. From the GCC man page:
-fstack-protector
Emit extra code to check for buffer overflows, such as stack smashing
attacks. This is done by adding a guard variable to functions with
vulnerable objects. This includes functions that call "alloca", and
functions with buffers larger than 8 bytes. The guards are initialized when
a function is entered and then checked when the function exits. If a guard
check fails, an error message is printed and the program exits.
-fstack-protector-strong
Like -fstack-protector but includes additional functions to be protected ---
those that have local array definitions, or have references to local frame
addresses.
Of course GPUs are different so this may not be appropriate. I'm not familiar with how OpenCLs work so please confirm what the most appropriate solution is for Debian, either:
append -fno-stack-protector to the Debian flags, if the other flags are appropriate/relevant here
omit all the flags completely and leave clang_bc_flags alone, if this is such a special thing that Debian's system-policy CFLAGS should be completely ignored
I guess a third possibility is that stack protectors are actually relevant for GPUs but Clang/LLVM is not generating correct code for those in this case.
GPUs don't really have a stack (not for data anyway) and AMDGCN backend currently inlines all function calls anyway.
I'm not sure what kind of checks the flag adds.
If anyone can upload the different libclc bitcode it should be easy to spot.
My guess would be that it adds some initialized global variables used in internal checks. This is illegal in CLC. only variables in constant address space can have initializers.
IMO Debian should not be arbitrarily adding compilation flags unless they know what they're doing and they have tested the resulting package.
local variables are stored in private address space which is backed either by register file, or private buffers.
using volatile, to control where it is located, is a rather hacky workaround of suboptimal register allocation/instruction scheduling.
Reporting a llvm bug with a reproducer can help. it'd need to be reproducible using llvm-7, there are no further releases of llvm-5 or 6 planned.
Didn't yet get a chance to test it with LLVM 7 yet but I didn't manage to
find any related bugs in their tracker so perhaps nobody fixed it yet.
Hi,
sorry for the late reply. given the already strained time/resources, this is not a priority. The fix to this bug is for Debian to stop modifying libclc cflags.
There's a fix for this in the Debian experimental package for libclc. If you haven't already done so and this bug affects you, please test and give feedback on the Debian bug.