In Part 1 of this series, I explained what GPU programming is on a high level. Now we can move on to exploring your first GPU algorithm: scan/prefix sum.
If you've ever tried programming a GPU before, you're probably familiar with vector addition. This is the use case where you have two large arrays of numbers, and you want to build a new array, representing the element-wise sum. This algorithm is normally used to demonstrate the power of GPU programming, by showcasing one way to exploit its massively parallel capabilities. However, although this example is useful, it misses out on the techniques involved in more advanced GPU programming, because it views parallel programming as simply a tool that exists to parallelize certain workloads, rather than a language or set of computational tools in its own right.
Here, we cover vector addition as a prelude to more advanced algorithms, which we discuss later in this article. Briefly, we can write such a vector adding algorithm as follows:
Given two vectors (arrays), a
and b
each with size N
, we want to compute the sum, c
. We can do this in series, using the CPU version on the left, or we can do this in parallel, using the GPU version on the right. Note that we use Python-style syntax for the CPU, and C++-style syntax for the GPU. This is because GPU computations are normally written in low-level C-like languages such as CUDA, and CPU computations are often handled by high-level Python wrappers such as Torch or Numba.
CPU version:
for i in range(N):
c[i] = a[i] + b[i]
GPU version:
void add_kernel(int* a, int* b, int* c){
int idx = get_global_id();
c[idx] = a[idx] + b[idx];
}
Note that we no longer have an array index. Instead, we replace this with something called a "global ID." On the GPU, we have many copies of the kernel program, each running on a different thread, so we need some way to know which thread we're running on, and how to correspond the current thread to the data we're accessing. This program works because each thread can execute independently. Now, we will use these techniques to discuss a more advanced prefix sum algorithm.
Prefix sum
Let’s take an array, a : int array
. Then, given this array, we want to generate a new array, ps : int array
, such that for every index idx
, we have ps[idx] = a[0] + a[1] + … + a[idx]
. So, for example, if we have the array a = [1, 4, 7, 1, 3]
, we would have ps = [1, 5, 12, 13, 16]
. Note that ps
is known as the "prefix sum" for a
. The prefix sum might seem arbitrary, but is interesting for two reasons:
- It comes up a lot in algorithm puzzles. So, if you're interested in learning parallel algorithms for LeetCode, this may come in handy.
- It is useful for certain parallel algorithms like QuickSort, for reasons we’ll explore in the next module.
The most efficient CPU version of a prefix sum calculation looks like this:
ps[0] = a[0]
for i in range(1, N):
ps[i] = ps[i - 1] + a[i]
Note that this is the most efficient possible implementation, because we do O(N) operations, and we need at least O(N) operations to fill the prefix sum array.
Let’s ignore the finite instruction length restrictions for now while we design our GPU algorithm. Take a moment to think of a naive way of parallelizing this process. Remember that we need each kernel to be independent from the next.
One way we can do this is as follows: for each index, we can write a kernel that simply adds up the sum of all the elements before, like this:
void kernel(int* a, int* ps){
int idx = get_global_id();
ps[idx] = 0;
for(int i = 0; i <= idx; i++){
ps[idx] += a[i];
}
}
However, notice that this procedure isn’t necessarily more efficient than the CPU version, because even if all of our kernels run completely in parallel, they still have to do O(n) addition operations. In fact, this version might actually run slower in practice, because GPUs are much slower at single core operations.
We’ll introduce some terminology here: let the work of a program be the amount of total operations we must do, across all kernels. Let the span of a program be the amount of operations that each kernel must do. In essence, the span allows us to determine how long a given algorithm must take if run in parallel, and the work is how long the same algorithm must take if run sequentially.
A better algorithm is incredibly non-obvious, but if you're curious to discover one on your own, here are a couple hints:
This algorithm requires launching multiple kernels back to back, and cannot be done by using one kernel alone.
Most parallel algorithms get faster fundamentally by taking advantage of some kind of binary tree-like structure, and exploring both branches in parallel.
This technique is very similar to that of a binary indexed tree, or a Fenwick tree.
It turns out there exists an O(n log n) work and O(log n) span algorithm that can do this prefix sum. Here’s the intuition:
- We’re going to launch multiple kernels, back to back.
- For each index
i
, we want to conceptualize the idea of a “window,” representing the number of elements prior to i that we have summed up. For example, if we have the arraya = [1, 4, 7, 1, 3]
, then if ourwindow_size = 2
elements, thenps = [1, 5, 11, 8, 4]
, since each index stores the sum of the current element and the element before. - At every iteration, we want to double this window size. Thus, using log(n) iterations (aka by launching log(n) kernels), we can successfully sum up all of the elements in the entire array.
This is what the pseudocode for this algorithm looks like:
Kernel:
void psum_it(int* prev, int* next, int last_window_size){
int idx = get_global_id();
if(idx - last_window_size >= 0)
next[idx] = prev[idx] + prev[idx - last_window_size];
else
next[idx] = prev[idx];
}
Host program:
last_window_size = 1
current = gpu.copy(a)
next = gpu.empty_like(a)
for i in range(ceil(math.log2(N)):
# currently: window_size = last_window_size * 2
psum_it(current, next, last_window_size)
last_window_size *= 2
current, next = next, current
Let’s go back to the example where the current window size is equal to 2. We have the following:
a = [1, 4, 7, 1, 3]
ps = [1, 5, 11, 8, 4]
Let’s also keep track of the range of indices that each element in ps has summed up:
range = [(0, 0), (0, 1), (1, 2), (2, 3), (3, 4)]
Consider the element at index = 3
. Currently, it stores the elements at indices 2
and 3
. So, its window extends one to the left. When we double the window size, we want to consider the first element outside of the current window, i.e., the element at index = 3 - last_window_size = 3 - 2 = 1
. Because the current window_size = 2
, not only does ps[3]
store 2 elements, but ps[1]
also stores 2 elements. So, if we add ps[1] + ps[3]
, we have essentially doubled the window size, because we add all the elements from index = 3
extending 1 to the left (2
and 3
), and then we add all the elements from index = 3 - 2 = 1
extending 1 to the left (0
and 1
).
If we repeat the above process, the window size will eventually become the entire array, and we will have calculated the prefix sum!
One crucial question is whether this algorithm still work in the cases where we can no longer double the window size, because there are not enough elements left and we would leave the bounds of the array. Luckily, we can prove that this is not a problem using induction.
We prove the statement that, at a given window size 2n, ps[i] = sum(a[max(0, i - 2^n):i])
for all i.
Consider our base case, where our window_size = 1
. In this case for all i
, ps[i] = a[i]
. So, the statement works.
As inductive hypothesis, assume that this works for our last_window_size = 2n-1
. Let 2n = current_window_size
. For each index i
, we have the following cases:
Case 1: i - 2n-1 < 0
. In this case, we set ps_new[i] = ps_old[i]
, and by IH, ps_old[i] = sum(a[0:i])
.
Case 2: i - 2n-1 >= 0
. In this case, we set ps_new[i] = ps_old[i] + ps_old[i - 2n-1]
.
We know that ps_old[i]
extends from the range (i-2n-1+1, i)
, and ps_old[i - 2n-1]
extends from the range (max(0, i-2*2n-1), i-2n-1) = (max(0, i-2n), i-2n-1)
through IH. So, our new range for ps_new[i]
is exactly equal to (max(0, i-2n), i)
.
Now, this proof for case 2 may not be satisfying, so here I’ll present another way of thinking about it that helped convince me that this algorithm is correct.
Intuitively, at every iteration, we’re jumping back the biggest power of two that we can go. It looks kind of like the guy in Figure 1 below:
Let's take a look at an example. Take an element at index = 100
, which I chose because it is not a power of two. Let ps[pow][i]
represent the ith element of the ps array at the window size of 2pow. Let’s say our last_window_size = 64
, so pow = 6
.
So, at the last iteration where we’re able to jump back, we set ps[7][100] = ps[6][100] + ps[6][36]
. We can be easily convinced that ps[6][100]
stores all of the elements from 100
to 100 - 64 + 1 = 37
, because it does not go out of range of the left side of the array. But, it might be much less clear that ps[6][36]
stores all the elements from 0…36
. So, let’s expand ps[6][36]
a bit:
ps[7][100]
= ps[6][100] + ps[6][36]
= ps[6][100] + ps[5][32] + ps[5][4]
= ps[6][100] + ps[5][32] + ps[4][4] + 0
= ps[6][100] + ps[5][32] + ps[3][4] + 0
= ps[6][100] + ps[5][32] + ps[2][4]
Hopefully you can see that kind of “jump back as far as you can each time” structure here. If you notice as well, we essentially express 100 = 64 + 32 + 4
, which has a nice relationship to its binary number representation, 1100100
.
This entire algorithm is incredibly non-obvious, but that's why we have it here as a technique that we can learn, and then apply to other problems later on, without coming up with it from scratch. Because these techniques are so foundational, they can become part of our toolbox. Next time, we will use this algorithm to design a GPU-accelerated quicksort routine.