diff --git a/content/blog/load-balancing-within-openmp-static-schedules.md b/content/blog/load-balancing-within-openmp-static-schedules.md new file mode 100644 index 0000000..2d0b0a7 --- /dev/null +++ b/content/blog/load-balancing-within-openmp-static-schedules.md @@ -0,0 +1,220 @@ +--- +title: "Load Balancing within OpenMP Static Schedules" +date: 2023-05-05T21:42:56-04:00 +draft: false +tags: [] +math: false +medium_enabled: false +--- + +OpenMP allows C++ developers to create shared memory multi-processing applications. +This is particularly beneficial for single program multiple data (SPMD) tasks. +However given a large batch of data to process, +it is often the case that the data cannot be evenly split among all the available parallel threads. +This creates an imbalance on how long each thread takes to finish. +One popular is to use a dynamic scheduling algorithm, +though this might not be desirable in all cases. +What if we know a unit of work that the idle threads can complete before the other threads finish? +To take advantage of this potential opportunity, +we need to understand how OpenMP distributes work within static schedules. + +Let's showcase static scheduling through an example. +Imagine we have a vector of integers from 0 to 50 and +we wish to find the sum of this vector without using a formula. + +```c++ +std::vector numbers(50); +std::iota(std::begin(numbers), std::end(numbers), 0); +``` + +Say we have eight threads. +Let's initialize the data structures that will both keep track of the sum +and keep track of the assignment of indices to each thread. + +```c++ +int sum = 0; +std::vector> assignments(numThreads, std::unordered_set()); +``` + +In standard static scheduling, we would write the parallel portion as follows: + +```c++ +#pragma omp parallel for reduction(+:sum) +for (size_t i = 0; i < numbers.size(); i++) +{ + const int threadId = omp_get_thread_num(); + assignments[threadId].insert(i); + sum += numbers[i]; +} +``` + +If we print out the assignments vector, it would look like the following: + +``` +Thread ID: 0 Allocations: 06 05 04 03 02 01 00 +Thread ID: 1 Allocations: 13 12 11 10 09 08 07 +Thread ID: 2 Allocations: 19 18 17 16 15 14 +Thread ID: 3 Allocations: 25 24 23 22 21 20 +Thread ID: 4 Allocations: 31 30 29 28 27 26 +Thread ID: 5 Allocations: 37 36 35 34 33 32 +Thread ID: 6 Allocations: 43 42 41 40 39 38 +Thread ID: 7 Allocations: 49 48 47 46 45 44 +``` + +A couple things to notice: + +- The first two threads have one extra unit of work. +This makes sense since if you divide fifty by 8 you would have two remaining. +In mathematical speak: `size % numThreads = 2`. +A generalization of the pidgeon-hole principal. + +- The portions of the array that are divided are all sequential. +This is to make use of the principal of locality. +Memory locations near an access are cached as programs are more likely to access nearby locations. + +To emulate this, we'll need to determine the start and end indices that each thread would consider. +To help out let's introduce two new variables. + +```c++ +const int leftOvers = numbers.size() % numThreads; +const int elementsPerThread = numbers.size() / numThreads; +``` + +The rest of this example will be within a parallel block. +Since we're not going to let OpenMP automatically distribute the work, +we'll have to specify the macro a little differently. + +```c++ +omp_set_dynamic(0); +omp_set_num_threads(numThreads); +#pragma omp parallel reduction(+:sum) +{ + const int threadId = omp_get_thread_num(); + // Later code is within here +} +``` + +The start and end indices need to take into account +not only how many elements each thread process, +but how many of the prior thread ids have an extra unit of work. + +```c++ +size_t start; +if (threadId > leftOvers) { + start = (threadId * elementsPerThread) + leftOvers; +} else { + start = threadId * elementsPerThread + threadId; +} +size_t end; +if (threadId + 1 > leftOvers) { + end = (threadId + 1) * elementsPerThread + leftOvers; +} else { + end = (threadId + 1) * elementsPerThread + threadId + 1; +} +``` + +Now we can ask each thread to iterate over their start and end indices. + +```c++ +for (size_t i = start; i < end; i++) +{ + assignments[threadId].insert(i); + sum += numbers[i]; +} +``` + +How do we know if this thread is performing less units of statically allocated work? +We check its thread id. +Since thread ids from 0 to leftOvers get assigned one more unit of work, +the thread ids that are larger than leftOvers are free game. + +```c++ +if (threadId > leftOvers && leftOvers > 0) { + // Do extra work here +} +``` + +Now of course it's up for you to figure out what the right type of extra work is. +Since if the extra work ends up taking longer than the already allocated work, it'll remain imbalanced. +This technique, however, gives you another tool in your parallel computing toolbelt if the dynamic scheduling algorithm is suboptimal for your application. + +Full load balance code example: + +```c++ +#include +#include +#include +#include +#include +#include + +int main () +{ + std::vector numbers(50); + std::iota(std::begin(numbers), std::end(numbers), 0); + + const int numThreads = 8; + const int leftOvers = numbers.size() % numThreads; + const int elementsPerThread = numbers.size() / numThreads; + + int sum = 0; + std::vector> assignments(numThreads, std::unordered_set()); + + omp_set_dynamic(0); + omp_set_num_threads(numThreads); + #pragma omp parallel reduction(+:sum) + { + const int threadId = omp_get_thread_num(); + size_t start; + if (threadId > leftOvers) { + start = (threadId * elementsPerThread) + leftOvers; + } else { + start = threadId * elementsPerThread + threadId; + } + size_t end; + if (threadId + 1 > leftOvers) { + end = (threadId + 1) * elementsPerThread + leftOvers; + } else { + end = (threadId + 1) * elementsPerThread + threadId + 1; + } + + for (size_t i = start; i < end; i++) + { + assignments[threadId].insert(i); + sum += numbers[i]; + } + + if (threadId > leftOvers && leftOvers > 0) { + // Do extra work here + } + } + + // Print reduced sum + std::cout << "sum = " << sum << std::endl; + + // Print thread allocations + for (size_t i = 0; i < assignments.size(); i++) + { + std::cout << "Thread ID: " << i << " Allocations: "; + for (const auto identifier : assignments[i]) { + std::cout << std::setfill('0') << std::setw(2) << identifier << " "; + } + std::cout << std::endl; + } + + std::cout << "Min Allocations per Thread: " << (elementsPerThread) << std::endl; + std::cout << "Extra: " << (leftOvers) << std::endl; + + return 0; +} +``` + +Note to future self, to compile an OpenMP application you need to specify it within `g++` + +```bash +g++ -fopenmp filename.cpp -o executablename +``` + + + +