What is the best way to organize matrix operations in CUDA (in terms of performance)? For example, I want to calculate the Should I write different tasks for multiplication, reciprocity or write or write a function for whole expression? What is the fastest way? I think the answer to this is the size of your matrix. If you can fit a matrix in the shared memory, then I will probably use a single block to calculate and within the same kernel (possibly bigger, where this count is only part of its About) Hopefully, if you have more matrix, and you need to calculate the above equation many times, you can do this in parallel, use all the GPU computing power. However, if your metrics are very high, then you want more blocks to calculate (check the matrix times example in the CUDA manual) You need a guarantee to move forward from the next part of your equation Before multiplication ends with all the blocks, and if so, you will need a kernel call for each of your operations. C * C ^ (- 1) * B ^ T + C ,
C and
b matrix
Comments
Post a Comment