OpenCL Tutorials 2 - Memory Bandwidth

In this tutorial we will cover the different types of memory present in the GPU and how can we optimize bandwidth by using the best memories and access data in an efficient way.

First of all, what you should know is that memmory optimization is one of the most important issues if you want performance. We don't want our GPU to be stopped waiting while data is being transfered from host to device.

1. Memory Hierarchy

The memory is distributed in different types, some bigger and slower, some smaller and faster. If we want our code to run faster, we need to know what types there are and how to make good use of them.

The following table shows an overview of the memory hierarchy and map OpenCL and CUDA names.
OpenCL CUDA Overview
Private Register No data sharing at all. Data written by a thread is only visible by the same thread.
Local Shared Can be accessed by the whole work-group (thread block), much faster than the Global memory
Global Global All threads within a kernel can access it


2. Memory Access

To get a good performance in our programs, we'll want to minimize data transfers between host and device.

2.1. Coalesced

Access to Global Memory by thread of a half-warp can be coalesced into one single transaction to obtain higher performance. For that, it's necessary that the k-th thread in the warp access the k-th word in the memory, even if not all the threads participate. For devices of compute capability 1.2 or higher, any pattern that fits into the segment size are coalesced.

2.2. Misaligned

For devices of compute capability 1.1 or lower, if the coalescing reqiurement are not met, each word is executed by one transaction.

2.3. Strided

3. Example: Matrix Transpose

3.1. Uncoalesced Access

3.2. Coalesced Access

Please let me know if you have any doubts or considerations.

Last edited Aug 18, 2010 at 6:05 PM by bjurkovski, version 10


No comments yet.