## Arbitrarily-sized Uniform Sampling

Suppose you have a dataset which is too big to process in its entirety. It might be from a firehose of an incoming data stream, or maybe it is a point cloud of Lidar data from a self-driving car project. You need to downsample to a usable level. I present an algorithm, which as far as I can tell, has not been suggested, and I compare it two other common algorithms with timing data. Spoiler alert: it is not the fastest algorithm.

### For-loop with step

A simple way of downsampling is to simply take every *n*-th point.
This will only be uniform downsampling if the incoming data is uncorrelated to
its nearby other points. In my case of the Lidar data, the point cloud was too
big to render smoothly when rotating it. I could keep the same vertex array(s)
and use a subsampled index array. While rotating, draw with the subsampled index
array, and afterwards draw with the full array. For this purpose, while
immediate neighbors in the dataset will have a correlation, this will decrease
rapidly with distance from a point.

```
std::vector
``` uniformStep(size_t maxIdx, size_t step) {
std::vector indices;
size_t n = maxIdx / step;
indices.reserve(n);
for (size_t i = 0; i < n; ++i) {
indices.push_back(step * i);
}
return indices;
}

This lets me subsample to n / 2, n / 3, ..., but what if I want something in-between, such as I want to keep 3/4 of the points? Or I want an arbitrary number of points, like 1050273 because maybe my embedded system is doing everything in software and that works out to be exactly 30 fps?

### Randomly choose *n*

The standard approach is pick a random number, and as long as that number hasn’t been picked yet and we still need more points, add it to our list.

```
std::vector
``` uniformRandom(size_t maxIdx, size_t n) {
std::random_device rd; // for seed
std::mt19937 gen(rd()); // Mersenne twister with seed
std::uniform_int_distribution<> uniform(0, maxIdx - 1);
std::set<size_t> used;
std::vector<size_t> indices;
indices.reserve(n);
auto t1 = std::chrono::high_resolution_clock::now();
size_t idx;
while (indices.size() < n) {
idx = uniform(gen);
if (used.find(idx) == used.end()) {
indices.push_back(idx);
used.insert(idx);
}
}
return indices;
}

This is about 5 times slower, due to the random number generation and set
lookup. Since we are using `size_t` we could use a hashed set,
`unordered_set`, since `size_t` hashes quickly. However, if
`n` is close to `maxIdx`, this will take much longer, because
the random number will tend to be already chosen. If `n` is much smaller
than `maxIdx` so that we can assume that the random numbers won’t collide,
this algorithm is roughly O(n). If `n` is close to `maxIdx`,
however, choosing the last indices requires `maxIdx` tries, on average,
since there will be roughly `maxIdx` points already chosen. Thus in
this case the runtime is O(n^2).

Also, the output is not sorted, so that if you need to indices sorted that will be another O(n log n) on top of the subsampling algorithm to sort them. (Well, except in this special case our values are integers, so we could potentially use radix sort to get O(n))

### For-loop, floating-point step

What if we could have a non-integer step value in the first function?
The idea is here is to overlay the indices along the subsampled grid and take
the indices that are centered on the lines of the subsampled grid. I keep
a “dist” variable that has a sort of conceptual distance from the last point.
It increments by one for each index, but also keeps the error resulting from
comparing on integer boundaries. This algorithm usually produces exactly *n*
points, but occasionally floating point error results in one fewer.

```
std::vector<size_t> uniformFloatStep(size_t maxIdx, size_t n)
{
std::vector<size_t> indices;
indices.reserve(n);
auto t1 = std::chrono::high_resolution_clock::now();
// float can only accurately represent integers up to 2^24 = 16 million,
// which is probably too small if we need to downsample.
double step = double(maxIdx) / double(n);
double dist = 0.0;
indices.push_back(0);
for (size_t i = 1; i < maxIdx; ++i) {
if (dist >= step) {
indices.push_back(i);
dist -= step;
}
dist += 1.0;
}
return indices;
}
```

### Special case: maxIdx = n

There is a special case where `maxIdx` == `n`. Obviously this
is not useful for subsampling, but for graphics code it could be useful to
call the same function for the full index array of { 0, 1, 2, ..., maxIdx }
as for the subsampled one. Instead of a loop, it turns out there is
`std::iota()`. Note that this is “iota” (the Greek letter) not
“itoa” (int-to-ascii)! Apparently this is named after the function named ι
(iota) in the APL language. For some values of
`maxIdx` it can be 3 - 4 times as fast, at least with clang 11.0
on a Core i9.

### Timings

maxId=100M |
Big O | n=10M |
n=1M |
n=100k |

uniform step | O(n) | 0.382 | 0.00363 | 0.000356 |

uniform random | [O(n log n) → O(n^2)] + O(n log n) | 10.3 | 0.552 | 0.0282 |

uniform random (hash set) | [O(n) → O(n^2)] + O(n log n) | 4.44 | 0.354 | 0.0166 |

uniform float step | O(maxIdx) | 0.229 | 0.229 | 0.216 |

Timings were done with clang 11.0 using `clang -std=c++17 -O3` on a 2.9 GHz Core i9 with 32 GB of memory.

### Summary

The fastest algorithm is clearly a for loop with a step, since it is O(n).
However, if you need an exact number of points evenly spaced, or if `n`
is within a factor of 10 of `maxIdx` my “float step” algorithm is
a good choice. Otherwise, and assuming you do not need to sort the results,
the standard uniform random algorithm using a hash set is the fastest.

- Fastest algorithm
- Does not produce exactly
`n`results - Evenly samples, which may be useful for something like audio data
- Returns in sorted order
- Easily used for a continuous stream

- Slow unless
`n`is much less than`maxIdx`(about two orders of magnitude) - Returns exactly
`n`results - Produces uniform results even if input data is correlated by locality
- Not sorted (although that may not be problem)
- Could be adapted for a continuous stream

- Consistent speed, but not usually the fastest
- Good results when
`n`is >= 10% of`maxIdx` - Usually returns exactly
`n`results, sometimes one less. - Evenly samples, which may be useful for something like audio data
- Returns in sorted order
- Easily used for a continuous stream