Hello,
I'm trying to understand the relationship between channels, unrolls and used RAM (M20K)
For this purpose, I've created this simple program composed of three kernels:
- the first inject data into the channel
- the second accumulates it
- the third receives the result of the accumulation and stores it into global memory
The first and second kernel exploits unroll (to speedup computation). The unrolling factor is derived by using the constant U.
In the second kernel, I made explicit the read from channel just for readability.
Now by varying the number U, I obtain (in the report) different values in terms of used blocks of RAM (M20K).
The code is compiled with the v18.0 of Quartus for the Arria10 board.
In particular:
U=4 RAM=16 (16 used by sink kernel)
U=8 RAM=17 (1 consumer kernel, 16 sink kernel)
U=16 RAM=21 (5 consumer, 16 sink)
U=32 RAM=38 (22 consumer, 16 sink)
U=64 RAM=70 (54 consumer,16 sink)
I believe that the 16 RAMs used by the sink kernel are due to device RAM interface.
What I can not understand is the amount of RAMs used by the consumer kernel:
This should correspond to the case with U=16 (being a float of 4 bytes) but it doesn'seem so
Any suggestions on how to read this numbers?
Thanks
I'm trying to understand the relationship between channels, unrolls and used RAM (M20K)
For this purpose, I've created this simple program composed of three kernels:
- the first inject data into the channel
- the second accumulates it
- the third receives the result of the accumulation and stores it into global memory
Code:
__kernel void generator_float_vector(int N){
int outer_loop_limit=(int)(N/U);
//we cannot have double write
for(int i=0;i<outer_loop_limit;i++)
{
#pragma unroll
for(int j=0;j<U;j++)
write_channel_intel(channel_float_vector,(float)(1.0));
}
}
__kernel void consumer(int N)
{
int outer_loop_limit=(int)(N/U);
float acc_o=0;
float x[U];
for(int i=0; i<outer_loop_limit; i++)
{
float acc=0;
#pragma unroll
for(int j=0;j<U; j++)
x[j]=read_channel_intel(channel_float_vector);
#pragma unroll
for(int j=0;j<U; j++)
acc+=x[j];
acc_o+=acc;
}
write_channel_intel(channel_float_sink,acc_o);
}
__kernel void sink_single(__global float * restrict out)
{
float r=read_channel_intel(channel_float_sink);
*out=r;
}
The first and second kernel exploits unroll (to speedup computation). The unrolling factor is derived by using the constant U.
In the second kernel, I made explicit the read from channel just for readability.
Now by varying the number U, I obtain (in the report) different values in terms of used blocks of RAM (M20K).
The code is compiled with the v18.0 of Quartus for the Arria10 board.
In particular:
U=4 RAM=16 (16 used by sink kernel)
U=8 RAM=17 (1 consumer kernel, 16 sink kernel)
U=16 RAM=21 (5 consumer, 16 sink)
U=32 RAM=38 (22 consumer, 16 sink)
U=64 RAM=70 (54 consumer,16 sink)
I believe that the 16 RAMs used by the sink kernel are due to device RAM interface.
What I can not understand is the amount of RAMs used by the consumer kernel:
- from the programming guide, the compiler should try to exploit private memory (register) if the data used is less than 64bytes.
This should correspond to the case with U=16 (being a float of 4 bytes) but it doesn'seem so
- starting from U=16, the number of RAMs used increased with U, which should be somehow related to the unrolling
Any suggestions on how to read this numbers?
Thanks