Blog

Arbitrarily-sized Uniform Sampling

Suppose you have a dataset which is too big to process in its entirety. It might be from a firehose of an incoming data stream, or maybe it is a point cloud of Lidar data from a self-driving car project. You need to downsample to a usable level. I present an algorithm, which as far as I can tell, has not been suggested, and I compare it two other common algorithms with timing data. Spoiler alert: it is not the fastest algorithm.

For-loop with step

A simple way of downsampling is to simply take every n-th point. This will only be uniform downsampling if the incoming data is uncorrelated to its nearby other points. In my case of the Lidar data, the point cloud was too big to render smoothly when rotating it. I could keep the same vertex array(s) and use a subsampled index array. While rotating, draw with the subsampled index array, and afterwards draw with the full array. For this purpose, while immediate neighbors in the dataset will have a correlation, this will decrease rapidly with distance from a point.


std::vector uniformStep(size_t maxIdx, size_t step) {
    std::vector indices;
    size_t n = maxIdx / step;
    indices.reserve(n);
    for (size_t i = 0; i < n; ++i) {
        indices.push_back(step * i);
    }
    return indices;
}

This lets me subsample to n / 2, n / 3, ..., but what if I want something in-between, such as I want to keep 3/4 of the points? Or I want an arbitrary number of points, like 1050273 because maybe my embedded system is doing everything in software and that works out to be exactly 30 fps?

Randomly choose n

The standard approach is pick a random number, and as long as that number hasn’t been picked yet and we still need more points, add it to our list.


std::vector uniformRandom(size_t maxIdx, size_t n) {
    std::random_device rd;  // for seed
    std::mt19937 gen(rd()); // Mersenne twister with seed
    std::uniform_int_distribution<> uniform(0, maxIdx - 1);

    std::set<size_t> used;
    std::vector<size_t> indices;
    indices.reserve(n);

    auto t1 = std::chrono::high_resolution_clock::now();
    
    size_t idx;
    while (indices.size() < n) {
        idx = uniform(gen);
        if (used.find(idx) == used.end()) {
            indices.push_back(idx);
            used.insert(idx);
        }
    }

    return indices;
}

This is about 5 times slower, due to the random number generation and set lookup. Since we are using size_t we could use a hashed set, unordered_set, since size_t hashes quickly. However, if n is close to maxIdx, this will take much longer, because the random number will tend to be already chosen. If n is much smaller than maxIdx so that we can assume that the random numbers won’t collide, this algorithm is roughly O(n). If n is close to maxIdx, however, choosing the last indices requires maxIdx tries, on average, since there will be roughly maxIdx points already chosen. Thus in this case the runtime is O(n^2).

Also, the output is not sorted, so that if you need to indices sorted that will be another O(n log n) on top of the subsampling algorithm to sort them. (Well, except in this special case our values are integers, so we could potentially use radix sort to get O(n))

For-loop, floating-point step

What if we could have a non-integer step value in the first function? The idea is here is to overlay the indices along the subsampled grid and take the indices that are centered on the lines of the subsampled grid. I keep a “dist” variable that has a sort of conceptual distance from the last point. It increments by one for each index, but also keeps the error resulting from comparing on integer boundaries. This algorithm usually produces exactly n points, but occasionally floating point error results in one fewer.


std::vector<size_t> uniformFloatStep(size_t maxIdx, size_t n)
{
    std::vector<size_t> indices;
    indices.reserve(n);

    auto t1 = std::chrono::high_resolution_clock::now();

    // float can only accurately represent integers up to 2^24 = 16 million,
    // which is probably too small if we need to downsample.
    double step = double(maxIdx) / double(n);
    double dist = 0.0;
    indices.push_back(0);
    for (size_t i = 1; i < maxIdx; ++i) {
        if (dist >= step) {
            indices.push_back(i);
            dist -= step;
        }
        dist += 1.0;
    }

    return indices;
}

Special case: maxIdx = n

There is a special case where maxIdx == n. Obviously this is not useful for subsampling, but for graphics code it could be useful to call the same function for the full index array of { 0, 1, 2, ..., maxIdx } as for the subsampled one. Instead of a loop, it turns out there is std::iota(). Note that this is “iota” (the Greek letter) not “itoa” (int-to-ascii)! Apparently this is named after the function named ι (iota) in the APL language. For some values of maxIdx it can be 3 - 4 times as fast, at least with clang 11.0 on a Core i9.

Timings

`maxId`=100M	Big O	`n`=10M	`n`=1M	`n`=100k
uniform step	O(n)	0.382	0.00363	0.000356
uniform random	[O(n log n) → O(n^2)] + O(n log n)	10.3	0.552	0.0282
uniform random (hash set)	[O(n) → O(n^2)] + O(n log n)	4.44	0.354	0.0166
uniform float step	O(maxIdx)	0.229	0.229	0.216

Timings were done with clang 11.0 using clang -std=c++17 -O3 on a 2.9 GHz Core i9 with 32 GB of memory.

Summary

The fastest algorithm is clearly a for loop with a step, since it is O(n). However, if you need an exact number of points evenly spaced, or if n is within a factor of 10 of maxIdx my “float step” algorithm is a good choice. Otherwise, and assuming you do not need to sort the results, the standard uniform random algorithm using a hash set is the fastest.

For-loop with step

Fastest algorithm
Does not produce exactly n results
Evenly samples, which may be useful for something like audio data
Returns in sorted order
Easily used for a continuous stream

Uniform random (hash set)

Slow unless n is much less than maxIdx (about two orders of magnitude)
Returns exactly n results
Produces uniform results even if input data is correlated by locality
Not sorted (although that may not be problem)
Could be adapted for a continuous stream

For-loop with floating-point step

Consistent speed, but not usually the fastest
Good results when n is >= 10% of maxIdx
Usually returns exactly n results, sometimes one less.
Evenly samples, which may be useful for something like audio data
Returns in sorted order
Easily used for a continuous stream