Distributed Modeling: Ring-Reduce vs. All-Reduce
Updated: Jan 20
In this blog post, I’d like to share some of the insights from my work at the High-Performance Computing (HPC) Texas Advanced Computing Center (TACC) cluster (circa 2010, TACC had "Lonestar" cluster with 5200 2Gb-nodes) within the HPC Technology Group at Conoco-Phillips Corp. Specifically, I want to address the issue of Data vs. Model distributed computations.
I noticed that in discussions of All-Reduce vs. Ring-Reduce distribution we often do not consider the price we would need to pay for Ring-Reduce. All-Reduce essentially is equivalent to non-distributed training, i.e. we should expect same results with both approaches, but models size will be limited by one's node memory. Ring-Reduce allows us to train large models that do not fit in a single node memory. However, there is a computational price that is needed to be paid and there are methods to deal with it in order to minimize this price.
Here is an intuitive explanation of Ring-Reduce using a big matrix solving notation that illustrates the 'price' that one pays both in terms of convergence rate and attainable convergence accuracy.
First of all, there is no model distribution with a synchronous All-Reduce, which reproduces the single node computations (up to shuffling in case of deep learning batch training). The main intuition behind introducing Ring-Reduce for distributed computations is that Ring-Reduce will reproduce one's single node computations for a diagonal block matrix (diagram below, top), with a trivial model-nodes distribution which requires no intra-node communication.
We will need to improve Ring-Reduce for a non-diagonal but sparse matrix. Model distribution is still possible, but now each node will need to communicate to its neighbor. Notice that each node will have to deal with a 'delayed' part of the model computed from its neighbor on the previous iteration. This deviates from the model update scheme for a single node, when the whole model is updated at step n+1 using model at step n.
I illustrate All-reduce vs. Ring-reduce using a small MPI c-program (Github) : https://github.com/romanonly/romankazinnik_blog/blob/master/distributed/model_gs_mpi/README.md
To introduce some complexity, I have chosen a non-diagonally-dominant matrix with a non-guaranteed convergence Gauss-Seidel. As a result, we will be able to observe Ring-Reduce not being able to converge for some ranges of matrix size n (<300).
On the right there is a synchronous All-Reduce MPI matrix solver run that attains absolute up to arbitrary epsilon convergence with Gauss-Seidel iterations.
Ring-Reduce (left) is able to distribute large models (large matrices) and provides a straightforward asynchronous modification. However, an equivalent Gauss-Seidel solver may not attain an absolute arbitrary-epsilon convergence (for example one can see it for n=100), and the convergence error is by magnitude larger. Ring-Reduce requires a larger absolute amount of computations, i.e., in my example each node will make up to ten inter-node iterations between the intra-node communications.
All-Reduce vs. Ring-reduce is implemented in MPI with non-blocking synchronous MPI Reduce and point-to-point Send and Receive on each node:
To sum it up:
1. Ring-Reduce allows a straightforward optimization both for inter-node communications and asynchronous nodes parallelization.
2. While synchronous All-Reduce guarantees to reproduce single-node convergence, Ring-Reduce may not converge to the arbitrary solution precision.
3. Ring-Reduce performance greatly depends on the linear equations sparsity. For a general non-sparse model no model distribution could be attained.