In Part 1 of this series, I explained what GPU programming is on a high level. Now we can move on to exploring your first GPU algorithm: scan/prefix sum.

If you've ever tried programming a GPU before, you're probably familiar with vector addition. This is the use case where you have two large arrays of numbers, and you want to build a new array, representing the element-wise sum. This algorithm is normally used to demonstrate the power of GPU programming, by showcasing one way to exploit its massively parallel capabilities. However, although this example is useful, it misses out on the techniques involved in more advanced GPU programming, because it views parallel programming as simply a tool that exists to parallelize certain workloads, rather than a language or set of computational tools in its own right.

Here, we cover vector addition as a prelude to more advanced algorithms, which we discuss later in this article. Briefly, we can write such a vector adding algorithm as follows:

Given two vectors (arrays), `a`

and `b`

each with size `N`

, we want to compute the sum, `c`

. We can do this in series, using the CPU version on the left, or we can do this in parallel, using the GPU version on the right. Note that we use Python-style syntax for the CPU, and C++-style syntax for the GPU. This is because GPU computations are normally written in low-level C-like languages such as CUDA, and CPU computations are often handled by high-level Python wrappers such as Torch or Numba.

CPU version:

```
for i in range(N):
c[i] = a[i] + b[i]
```

GPU version:

```
void add_kernel(int* a, int* b, int* c){
int idx = get_global_id();
c[idx] = a[idx] + b[idx];
}
```

Note that we no longer have an array index. Instead, we replace this with something called a "global ID." On the GPU, we have many copies of the kernel program, each running on a different thread, so we need some way to know which thread we're running on, and how to correspond the current thread to the data we're accessing. This program works because each thread can execute independently. Now, we will use these techniques to discuss a more advanced prefix sum algorithm.

## Prefix sum

Let’s take an array, `a : int array`

. Then, given this array, we want to generate a new array, `ps : int array`

, such that for every index `idx`

, we have `ps[idx] = a[0] + a[1] + … + a[idx]`

. So, for example, if we have the array `a = [1, 4, 7, 1, 3]`

, we would have `ps = [1, 5, 12, 13, 16]`

. Note that `ps`

is known as the "prefix sum" for `a`

. The prefix sum might seem arbitrary, but is interesting for two reasons:

- It comes up a lot in algorithm puzzles. So, if you're interested in learning parallel algorithms for LeetCode, this may come in handy.
- It is useful for certain parallel algorithms like QuickSort, for reasons we’ll explore in the next module.

The most efficient CPU version of a prefix sum calculation looks like this:

```
ps[0] = a[0]
for i in range(1, N):
ps[i] = ps[i - 1] + a[i]
```

Note that this is the most efficient possible implementation, because we do O(N) operations, and we need at least O(N) operations to fill the prefix sum array.

Let’s ignore the finite instruction length restrictions for now while we design our GPU algorithm. Take a moment to think of a naive way of parallelizing this process. Remember that we need each kernel to be independent from the next.

One way we can do this is as follows: for each index, we can write a kernel that simply adds up the sum of all the elements before, like this:

```
void kernel(int* a, int* ps){
int idx = get_global_id();
ps[idx] = 0;
for(int i = 0; i <= idx; i++){
ps[idx] += a[i];
}
}
```

However, notice that this procedure isn’t necessarily more efficient than the CPU version, because even if all of our kernels run completely in parallel, they still have to do O(n) addition operations. In fact, this version might actually run slower in practice, because GPUs are much slower at single core operations.

We’ll introduce some terminology here: let the **work** of a program be the amount of total operations we must do, across all kernels. Let the **span** of a program be the amount of operations that **each kernel** must do. In essence, the span allows us to determine how long a given algorithm must take if run in parallel, and the work is how long the same algorithm must take if run sequentially.

A better algorithm is incredibly non-obvious, but if you're curious to discover one on your own, here are a couple hints:

This algorithm requires launching multiple kernels back to back, and cannot be done by using one kernel alone.

Most parallel algorithms get faster fundamentally by taking advantage of some kind of binary tree-like structure, and exploring both branches in parallel.

This technique is very similar to that of a binary indexed tree, or a Fenwick tree.

It turns out there exists an O(n log n) work and O(log n) span algorithm that can do this prefix sum. Here’s the intuition:

- We’re going to launch multiple kernels, back to back.
- For each index
`i`

, we want to conceptualize the idea of a “window,” representing the number of elements prior to i that we have summed up. For example, if we have the array`a = [1, 4, 7, 1, 3]`

, then if our`window_size = 2`

elements, then`ps = [1, 5, 11, 8, 4]`

, since each index stores the sum of the current element and the element before. - At every iteration, we want to double this window size. Thus, using log(n) iterations (aka by launching log(n) kernels), we can successfully sum up all of the elements in the entire array.

This is what the pseudocode for this algorithm looks like:

Kernel:

```
void psum_it(int* prev, int* next, int last_window_size){
int idx = get_global_id();
if(idx - last_window_size >= 0)
next[idx] = prev[idx] + prev[idx - last_window_size];
else
next[idx] = prev[idx];
}
```

Host program:

```
last_window_size = 1
current = gpu.copy(a)
next = gpu.empty_like(a)
for i in range(ceil(math.log2(N)):
# currently: window_size = last_window_size * 2
psum_it(current, next, last_window_size)
last_window_size *= 2
current, next = next, current
```

Let’s go back to the example where the current window size is equal to 2. We have the following:

`a = [1, 4, 7, 1, 3]`

`ps = [1, 5, 11, 8, 4]`

Let’s also keep track of the range of indices that each element in ps has summed up:

`range = [(0, 0), (0, 1), (1, 2), (2, 3), (3, 4)]`

Consider the element at `index = 3`

. Currently, it stores the elements at indices `2`

and `3`

. So, its window extends one to the left. When we double the window size, we want to consider the first element outside of the current window, i.e., the element at `index = 3 - last_window_size = 3 - 2 = 1`

. Because the current `window_size = 2`

, not only does `ps[3]`

store 2 elements, but `ps[1]`

*also *stores 2 elements. So, if we add `ps[1] + ps[3]`

, we have essentially doubled the window size, because we add all the elements from `index = 3`

extending 1 to the left (`2`

and `3`

), and then we add all the elements from `index = 3 - 2 = 1`

extending 1 to the left (`0`

and `1`

).

If we repeat the above process, the window size will eventually become the entire array, and we will have calculated the prefix sum!

One crucial question is whether this algorithm still work in the cases where we can no longer double the window size, because there are not enough elements left and we would leave the bounds of the array. Luckily, we can prove that this is not a problem using induction.

We prove the statement that, at a given window size 2^{n}, `ps[i] = sum(a[max(0, i - 2^n):i])`

for all i.

Consider our base case, where our `window_size = 1`

. In this case for all `i`

, `ps[i] = a[i]`

. So, the statement works.

As inductive hypothesis, assume that this works for our `last_window_size = 2`

. Let ^{n-1}`2`

. For each index ^{n} = current_window_size`i`

, we have the following cases:

**Case 1:** `i - 2`

. In this case, we set ^{n-1} < 0`ps_new[i] = ps_old[i]`

, and by IH, `ps_old[i] = sum(a[0:i])`

.

**Case 2:** `i - 2`

. In this case, we set ^{n-1} >= 0`ps_new[i] = ps_old[i] + ps_old[i - 2`

. ^{n-1}]

We know that `ps_old[i] `

extends from the range `(i-2`

, and ^{n-1}+1, i)`ps_old[i - 2`

extends from the range ^{n-1}] `(max(0, i-2*2`

through IH. So, our new range for ^{n-1}), i-2^{n-1}) = (max(0, i-2^{n}), i-2^{n-1})`ps_new[i]`

is exactly equal to `(max(0, i-2`

. ^{n}), i)

Now, this proof for case 2 may not be satisfying, so here I’ll present another way of thinking about it that helped convince me that this algorithm is correct.

Intuitively, at every iteration, we’re jumping back the biggest power of two that we can go. It looks kind of like the guy in Figure 1 below:

Let's take a look at an example. Take an element at `index = 100`

, which I chose because it is not a power of two. Let `ps[pow][i]`

represent the ith element of the ps array at the window size of 2^{pow}. Let’s say our `last_window_size = 64`

, so `pow = 6`

.

So, at the last iteration where we’re able to jump back, we set `ps[7][100] = ps[6][100] + ps[6][36]`

. We can be easily convinced that `ps[6][100]`

stores all of the elements from `100`

to `100 - 64 + 1 = 37`

, because it does not go out of range of the left side of the array. But, it might be much less clear that `ps[6][36]`

stores all the elements from `0…36`

. So, let’s expand `ps[6][36]`

a bit:

```
ps[7][100]
= ps[6][100] + ps[6][36]
= ps[6][100] + ps[5][32] + ps[5][4]
= ps[6][100] + ps[5][32] + ps[4][4] + 0
= ps[6][100] + ps[5][32] + ps[3][4] + 0
= ps[6][100] + ps[5][32] + ps[2][4]
```

Hopefully you can see that kind of “jump back as far as you can each time” structure here. If you notice as well, we essentially express `100 = 64 + 32 + 4`

, which has a nice relationship to its binary number representation, `1100100`

.

This entire algorithm is incredibly non-obvious, but that's why we have it here as a technique that we can learn, and then apply to other problems later on, without coming up with it from scratch. Because these techniques are so foundational, they can become part of our toolbox. Next time, we will use this algorithm to design a GPU-accelerated quicksort routine.