|dc.description.abstract||The largest supercomputers have millions of independent processors, and concurrency levels are rapidly increasing. For ideal efficiency, developers of the simulations that run on these machines must ensure that computational work is evenly balanced among processors. Assigning work evenly is challenging because many large modern parallel codes simulate behavior of physical systems that evolve over time, and their workloads change over time. Furthermore, the cost of imbalanced load increases with scale because most large-scale scientific simulations today use a Single Program Multiple Data (SPMD) parallel programming model, and an increasing number of processors will wait for the slowest one at the synchronization points.
To address load imbalance, many large-scale parallel applications use dynamic load balance algorithms to redistribute work evenly. The research objective of this dissertation is to develop methods to decide when and how to load balance the application, and to balance it effectively and affordably. We measure and evaluate the computational load of the application, and develop strategies to decide when and how to correct the imbalance. Depending on the simulation, a fast, local load balance algorithm may be suitable, or a more sophisticated and expensive algorithm may be required. We developed a model for comparison of load balance algorithms for a specific state of the simulation that enables the selection of a balancing algorithm that will minimize overall runtime.
Dynamic load balancing of parallel applications becomes more critical at scale, while also being expensive. To make the load balance correction affordable at scale, we propose a lazy load balancing strategy that evaluates the imbalance and computes the new assignment of work to processes asynchronously to the main application computation. We decouple the load balance algorithm from the application and run it on potentially fewer, separate processors. In this Multiple Program Multiple Data (MPMD) configuration, the load balance algorithm can execute concurrently with the application and with higher parallel efficiency than if it were run on the same processors as the simulation. Work is reassigned lazily as directions become available, and the application need not wait for the load balance algorithm to complete. We show that we can save resources by running a load balance algorithm at higher parallel efficiency on a smaller number of processors. Using our framework, we explore the trade-offs of load balancing configurations and demonstrate a performance improvement of up to 46%.||