================================================== =============
for(uint t = 0; t < loop_cnt; t++) {
//load data to data buffer
for(uint w = 0; w < TILE_WIDTH; w++) {
data[w] = read_channel_altera(data_in_ch);
}
for(uint h = 0; h < TILE_HEIGHT; h++) {
weight[h] = read_channel_altera(weight_in_ch);
}
//comput the matrix tile multiplication using the PE(mac) array
#pragma unroll
for(uint w = 0; w < TILE_WIDTH; w++) {
float data_temp = data[w];
#pragma unroll
for(uint h = 0; h < TILE_HEIGHT; h++) {
float weight_temp = weight[h];
float temp = data_temp * weight_temp;
if(t == 0)
output[h * TILE_WIDTH + w] = temp;
else
output[h * TILE_WIDTH + w] = output[h * TILE_WIDTH + w] + temp;
}
}
}
//declare output data to be enqueued in altara channel
lane output_lane;
for(uint w = 0; w < TILE_WIDTH; w++) {
#pragma unroll
for(uint h = 0; h < TILE_HEIGHT; h++) {
//multiply with scale and plus bias before moving it out
output_lane.lane_data[h] = output[h * TILE_WIDTH + w] * scale[h] + bias[h];
}
write_channel_altera(output_ch, output_lane);
}
================================================== ======================================
Here is a snippet of my code. Basically what I am doing is doing matrix multiplication and move the data out by channel if the accumulation is finished. But according to the hardware run, the output is not fully accumulated (it's moved out before the accumulation is finished, for example, if the correct output pattern is all 36, the hardware run result would be a mix of values smaller than 36). And the compilation report seems to support this (with TILE_WIDTH 4 and TILE_HEIGHT 8, the number of simultaneous reads to output local buffer should be 32, but in the report it's 40, which is because after accumulation I have 8 simultaneous reads to move the data out (32 + 8 = 40). So it looks like the accumulation and moving out is happening at the same time!! This is very weird because moving out should happen after accumulation is finished.
below is the report of local buffer output
================================================== =========================================
- Local memory: Optimal. Requested size 128 bytes (rounded up to nearest power of 2), implemented size 128 bytes, stall-free, 40 reads and 32 writes. Additional information: - Banked on lowest dimension into 32 separate banks (this is a good thing). - Reducing accesses to exactly one read and one write for all on-chip memory systems may increase overall system performance.
================================================== ========================================
And advice would be greatly appreciated!!