Performance compare to Intel® proprietary realization

ilia uploaded an attachment:

gcc dcttest.c -O3 -lOpenCL -lm -o dcttest

for 3 available platforms run it as

for i in 0 1 2; do ./dcttest $i; done

Attachment 118912, "Source code":
dcttest.c

Rong Yang @rongyang said:

I take a quick look to your kernel, there are two improvement points:

use native_cos instead of cos, but it would lose precision, if your program is precision sensitive, can't use this method.
add some #pragma unroll, data[128] and res[128] are private arrays, beignet will store them in global memory. Because array data's visit are always constant in the loop, so you could add the unroll hint to compiler, then it is promoted to register, could improve performance significant.

For more optimization tips, please refer to http://www.freedesktop.org/wiki/Software/Beignet/optimization-guide/.

#ifndef INFINITY #define INFINITY 1.0/0 #endif #ifndef M_PI #define M_PI 3.14159265358979323846 #endif void dct_ii(float *x, float *X) { float sum = 0.; #pragma unroll for (int n = 0; n < 128; ++n) { sum += x[n]; } X[0] = sum; for (uint k = 1; k < 128; ++k) { sum = 0.; #pragma unroll for (int n = 0; n < 128; ++n) { sum += x[n] * native_cos((float)(M_PI * (n + .5) * k / 128)); } X[k] = sum; } }

__kernel void test_dct( __global float *gdata, __global float gres){ uint gid = get_global_id(0); uint idx = gid128 ; float data[128]; float res[128]; #pragma unroll for(uint i=0; i<128; i++){ data[i] = gdata[idx+i]; } //for(uint i=5; i<=128; i++){ dct_ii(data, res); //} #pragma unroll for(uint i=0; i<128; i++){ gres[idx+i] = res[i]; }

}

assigned to @rongyang

Performance compare to Intel® proprietary realization

Submitted by ilia

Description

Designs

Child items ...

Activity

for 3 available platforms run it as

Admin message

Admin message

Performance compare to Intel® proprietary realization

Submitted by ilia

Description

Activity

for 3 available platforms run it as