Skip to content

Memory errors/corruptions in cl-api

Jose Luis Cercos-Pita requested to merge sanguinariojoe/piglit:canary into main

Track and report memory errors in:

  • cl/get-platform-info
  • cl/get-device-info
  • cl/get-context-info
  • cl/get-command-queue-info
  • cl/get-mem-object-info
  • cl/get-image-info
  • cl/get-program-info
  • cl/get-program-build-info
  • cl/get-kernel-info
  • cl/get-kernel-arg-info
  • cl/get-kernel-work-group-info
  • cl/get-event-info

Some platforms are trying to write in param_value even if param_value_size is smaller than size of return type. For instance I noticed (by means of Valgrind) that AMDGPU privative OpenCL implementation is invalid writing exactly 1 byte of information in clGetDeviceInfo() call.

In such case, the test execution gets corrupted. After that, several things may happen:

  • The program is successfully finishing, silently hiding the error
  • The program is immediately crashing (I actually never experienced that), letting at least the user know which platform is responsible
  • The program is crashing at the end, even after reporting passed test, due to 'double free/corruption'. The test is finally reported as crashed in the results
  • The program is randomly crashing during the execution of another platform, becoming very confusing

So I suggest this classic solution, consisting on a canary to track when memory is wrongly written, without causing actual memory errors. Hopefully the same idea can be applied somewhere else, in some other tests.

Changes in the results:

After running piglit with the following 2 platforms:

  • Portable Computing Language (OpenCL 1.2 pocl 1.5-pre/tags/v1.4-RC2-0-g81763647, Release, LLVM 8.0.1, RELOC, SPIR, SLEEF, POCL_DEBUG)
  • AMD Accelerated Parallel Processing (OpenCL 2.1 AMD-APP (2766.4))

The following results are obtained:

cl/get-platform-info: "Pass" -> "Fail"

Memory error is detected now for AMD Accelerated Parallel Processing platform, that would silently corrupt the program without the canary.

cl/get-device-info: "Fail" -> "Fail"

In both cases AMD Accelerated Parallel Processing failed due to this bug. However, with the canary it is also reporting memory errors.

cl/get-context-info: "Pass" -> "Pass"

cl/get-command-queue-info: "Fail" -> "Fail"

In both cases AMD Accelerated Parallel Processing reported an unexpected CL_INVALID_COMMAND_QUEUE error while querying CL_QUEUE_SIZE. No memory errors were detected.

cl/get-mem-object-info: "Pass" -> "Pass"

cl/get-image-info: "Pass" -> "Pass"

cl/get-program-info: "Pass" -> "Pass"

cl/get-program-build-info: "Pass" -> "Pass"

cl/get-kernel-arg-info: "Crash" -> "Fail"

Another different memory corruption due to a bug was causing crashes. Fixed the problem, the test reports CL_SUCCESS when CL_KERNEL_ARG_INFO_NOT_AVAILABLE was expected, but the canary has not detected any memory corruption.

cl/get-kernel-info: "Pass" -> "Fail"

Memory error is detected now for AMD Accelerated Parallel Processing platform, that would silently corrupt the program without the canary.

cl/get-kernel-work-group-info: "Pass" -> "Fail"

Memory error is detected now for AMD Accelerated Parallel Processing platform, that would silently corrupt the program without the canary.

cl/get-event-info: "Pass" -> "Pass"

Edited by Jordan Justen

Merge request reports