OpenCL Tutorials 2 - Memory Bandwidth
In this tutorial we will cover the different types of memory present in the GPU and how can we optimize bandwidth by using the best memories and access data in an efficient way.
First of all, what you should know is that memmory optimization is one of the most important issues if you want performance. We don't want our GPU to be stopped waiting while data is being transfered from host to device.
1. Memory Hierarchy
The memory is distributed in different types, some bigger and slower, some smaller and faster. If we want our code to run faster, we need to know what types there are and how to make good use of them.
The following table shows an overview of the memory hierarchy and map OpenCL and CUDA names.
||No data sharing at all. Data written by a thread is only visible by the same thread.
||Can be accessed by the whole work-group (thread block), much faster than the Global memory
||All threads within a kernel can access it
2. Memory Access
To get a good performance in our programs, we'll want to minimize data transfers between host and device.
Access to Global Memory by thread of a half-warp can be coalesced into one single transaction to obtain higher performance. For that, it's necessary that the k-th thread in the warp access the k-th word in the memory, even if not all the threads participate.
For devices of compute capability 1.2 or higher, any pattern that fits into the segment size are coalesced.
For devices of compute capability 1.1 or lower, if the coalescing reqiurement are not met, each word is executed by one transaction.
3. Example: Matrix Transpose
3.1. Uncoalesced Access
3.2. Coalesced Access
Please let me know if you have any doubts or considerations.