Hello,
I am having difficulty in parallelizing local memory store operations and would appreciate help. The load from local memory appears to be parallelized however in the report.html's System Viewer tab it's showing a dependency on store(lmem[idx[0]] = result[0]) -> store(lmem[idx[1]] = result[1]) -> etc. I've experimented with setting the numbanks attribute and the tool correctly banks the memory but there seems to be no effect on the store dependencies.
This is in a single work item kernel. This particular code snippet is part of a larger loop. During a given loop iteration this code will never attempt to store to the same address (no address collisions); but depending on the loop iteration # it may or may not have bank conflicts. The only way to have zero bank conflicts for all iterations is to use registers - but the lmem size is too large to use registers.
Based on the above description I understand during some loop iterations the store operations will end up being sequential (when all are to the same bank); however I want to take advantage of parallel stores during the loop iterations that will not have bank conflicts.
Things I've tried on the store section:
- removing the #pragma unroll. This resulted in the compiler automatically unrolling the stores.
- #pragma unroll 1. This bottlenecks my algorithm to the point where I won't see any benefit from vectorization.
- flipping the #pragma ivdep and #pragma unroll. No effect.
I would've expected the #pragma ivdep to resolve this - can someone please provide help?
Thank you.
See the code snippet below:
__private float2 operand[8];
__private float2 result[8];
__private uint idx[8];
__local float2 __attribute__((bankwidth(8))) lmem1[8192];
__local float2 __attribute__((bankwidth(8))) lmem2[8192];
...
... Code that computes the idx array
...
#pragma unroll
#pragma ivdep
for (uint ii = 0; ii < 8; ++ii)
{
operand[ii] = lmem1[idx[ii]];
}
...
... Code that computes the result array
...
#pragma unroll
#pragma ivdep
for (uint ii = 0; ii < 8; ++ii)
{
lmem2[idx[ii]] = result[ii];
}
I am having difficulty in parallelizing local memory store operations and would appreciate help. The load from local memory appears to be parallelized however in the report.html's System Viewer tab it's showing a dependency on store(lmem[idx[0]] = result[0]) -> store(lmem[idx[1]] = result[1]) -> etc. I've experimented with setting the numbanks attribute and the tool correctly banks the memory but there seems to be no effect on the store dependencies.
This is in a single work item kernel. This particular code snippet is part of a larger loop. During a given loop iteration this code will never attempt to store to the same address (no address collisions); but depending on the loop iteration # it may or may not have bank conflicts. The only way to have zero bank conflicts for all iterations is to use registers - but the lmem size is too large to use registers.
Based on the above description I understand during some loop iterations the store operations will end up being sequential (when all are to the same bank); however I want to take advantage of parallel stores during the loop iterations that will not have bank conflicts.
Things I've tried on the store section:
- removing the #pragma unroll. This resulted in the compiler automatically unrolling the stores.
- #pragma unroll 1. This bottlenecks my algorithm to the point where I won't see any benefit from vectorization.
- flipping the #pragma ivdep and #pragma unroll. No effect.
I would've expected the #pragma ivdep to resolve this - can someone please provide help?
Thank you.
See the code snippet below:
__private float2 operand[8];
__private float2 result[8];
__private uint idx[8];
__local float2 __attribute__((bankwidth(8))) lmem1[8192];
__local float2 __attribute__((bankwidth(8))) lmem2[8192];
...
... Code that computes the idx array
...
#pragma unroll
#pragma ivdep
for (uint ii = 0; ii < 8; ++ii)
{
operand[ii] = lmem1[idx[ii]];
}
...
... Code that computes the result array
...
#pragma unroll
#pragma ivdep
for (uint ii = 0; ii < 8; ++ii)
{
lmem2[idx[ii]] = result[ii];
}