We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

MegaBlocks: Efficient Sparse Training with Mixture-of-Experts

2023/12/14

Papers Read on AI

AI Deep Dive AI Insights AI Chapters Transcript

Trevor Gale, D. Narayanan, C. Young, M. Zaharia：MegaBlocks 系统通过块稀疏运算重新定义了混合专家模型 (MoE) 的计算方式，解决了现有框架中动态路由的限制，避免了标记丢弃或填充带来的计算和内存浪费。该系统在现代硬件上高效映射，实现了高达 40% 的端到端训练速度提升，相比于使用最先进的 Tuttle 库训练的 MoE 模型提升了 40%，相比于使用高度优化的 Megatron-LM 框架训练的 DNN 模型提升了 2.4 倍。MegaBlocks 使用块稀疏矩阵乘法来处理 MoE 层中的动态和负载不平衡计算，并开发了高性能的 GPU 内核，支持所有转置和非转置输入的组合。该系统还支持 MoE 的分布式训练，包括数据并行和专家模型并行。 Trevor Gale, D. Narayanan, C. Young, M. Zaharia：在避免标记丢弃方面，MegaBlocks 的方法显著优于 Tuttle 的动态容量因子技术。实验结果表明，MegaBlocks 在 MoE 模型训练中实现了显著的端到端训练速度提升，与使用 Tuttle 的方法相比，MoXS、MoE-small 和 MoE-medium 模型的训练速度分别提升了 1.38 倍、2.0 倍和 4.35 倍。与 Megatron-LM 训练的 Transformer 模型相比，MegaBlocks 训练的 DMOE 模型的训练速度也提升了 1.8 倍到 2.4 倍。MegaBlocks 的块稀疏矩阵乘法内核性能与 Kubla 的批处理矩阵乘法相当，实现了 98.6% 的吞吐量。

Deep Dive

Key Insights

What are the limitations of current frameworks for Mixture-of-Experts (MoE) training?

Current frameworks restrict dynamic routing in MoE layers, forcing a tradeoff between model quality and hardware efficiency. Users must choose between dropping tokens or wasting computation and memory on padding.

How does MegaBlocks address the limitations of existing MoE frameworks?

MegaBlocks reformulates MoE computation in terms of block-sparse operations and develops new GPU kernels to handle dynamic routing efficiently. It avoids dropping tokens and maps well to modern hardware.

What are the performance improvements of MegaBlocks compared to other frameworks?

MegaBlocks achieves up to 40% end-to-end training speedups over state-of-the-art Tutel library and 2.4x speedups over highly-optimized Megatron-LM framework for dense neural networks (DNNs).

Why is fine-grained sparse computation less efficient on hardware accelerators like GPUs and TPUs?

Hardware accelerators like GPUs and TPUs are optimized for dense computation, making fine-grained sparse computation less efficient due to the irregularity of sparse operations.

What is the role of block-sparse operations in MegaBlocks?

Block-sparse operations allow MegaBlocks to handle the dynamic and load-imbalanced computation in MoE layers efficiently, enabling the system to process all tokens without dropping any.

What is the significance of the block size in MegaBlocks' implementation?

The block size is chosen to ensure high arithmetic intensity and efficient use of GPU resources. A 128x128 block size was selected based on performance benchmarks, enabling high throughput on modern GPUs.

How does MegaBlocks handle dynamic routing in MoE layers?

MegaBlocks uses block-sparse matrix multiplication kernels that can handle variable numbers of tokens assigned to experts, ensuring no tokens are dropped and enabling efficient computation.

What is the impact of token dropping on model quality?

Token dropping significantly reduces model quality. For example, an MoE model avoiding token dropping achieved a 0.26 reduction in validation loss, compared to 0.15 for a model with token dropping.

What are the challenges in implementing MoE layers on TPUs?

TPUs require static tensor shapes and struggle with fine-grained operations like scatters and gathers, making it difficult to implement dynamic routing in MoE layers directly on TPUs.

How does MegaBlocks compare to Tutel in terms of memory usage?

MegaBlocks reduces memory usage compared to Tutel, which increases memory requirements due to padding. This allows MegaBlocks to use larger micro-batch sizes, improving hardware efficiency.

Chapters

Current MoE training frameworks limit dynamic routing in MoE layers due to software and hardware constraints. This forces a trade-off between model quality and hardware efficiency, requiring users to choose between dropping tokens or wasting computation on padding. The paper introduces Megablocks to address these limitations.

Current MoE frameworks restrict dynamic routing due to software and hardware limitations.
This leads to a trade-off between model quality and hardware efficiency.
Users must choose between dropping tokens or wasting computation on padding.

Shownotes Transcript

Translations:

中文

From weekend brunches to dinner parties, be the ultimate host this holiday season with the power of Finish Ultimate. Finish Ultimate tackles tough burn-on stains in the toughest conditions, so you can always stay ready to impress your guests. Find it in the gold packaging at Target today. Papers Read on AI with Rob, keeping you up to date with the latest research. This reading is brought to you by Mars Race, stake acclaim on the red planet. Available on Android and iOS.

Megablocks. Efficient sparse training with mixture of experts. Authored 2022 by Trevor Gale, D. Narayanan, C. Young, M. Zaharia. Trevor Gale 1, Deepak Narayanan 2, Cliff Young 3, Matej Zaharia 1. A. B. S. T. R. A. C. T. We present Megablocks, a system for efficient mixture of experts, Mo, training on GPUs.

Our system is motivated by the limitations of current frameworks, which restrict the dynamic routing in MoLayers to satisfy the constraints of existing software and hardware. These formulations force a trade-off between model quality and hardware efficiency, as users must choose between dropping tokens from the computation or wasting computation and memory on padding.

To address these limitations, we reformulate Mo computation in terms of block sparse operations and develop new block sparse GPU kernels that efficiently handle the dynamism present in Mo's. Our approach never drops tokens and maps efficiently to modern hardware, enabling end-to-end training speedups of up to 40% over Mo's trained with the state-of-the-art Tuttle library and 2.4 times over DNN's trained with the highly optimized Megatron LM framework. 1.

I N T R O D U C T I O N. Exploiting sparsity in the weights, activations and input data of deep neural networks, DNNs, is an effective technique for reducing the amount of computation that is needed to achieve a given model quality. Khan et al. 2015. Gale et al. 2019.

The past decade has seen significant progress in algorithms and high-performance software to make sparsity practically useful. Gray et al. 2017. Naring et al. 2017. Kouchbrenner et al. 2018. Elson et al. 2020. Gale et al. 2020. One area that remains a challenge for sparsity is model training on accelerators.

DNNs are most commonly trained on hardware accelerators like GPUs, NVIDIA, 2020, and TPUs, Jupy et al., 2017, which exploit the regularity of dense computation to deliver high performance. Consequently, fine-grained sparse computation is less efficient on these processors.

To enable efficient computation on accelerators, structure can be enforced on the sparse matrices, Naring et al. 2017. Gray et al. 2017. Yao et al. 2019. An emerging class of models with underlying structured sparsity is mixture of experts, Moze, Shazir et al. 2017. Each layer in NMO is a collection of experts, which are themselves small DNNs.

As data is passed through the Mo layers, each token is dynamically routed to a subset of the experts for computation. By exploiting this sparse computation, Mo's have reduced training times by as much as four times for applications in natural language processing and computer vision. Artetics et al. 2021. Raquel May et al. 2021.

These 1 Stanford University, Stanford, California, USA 2 Microsoft Research, Redmond, Washington, USA 3 Google Research, Mountain View, California, USA. Correspondence to Trevor Gale less than T Gale at cs.stanford.edu greater than

Gains have translated to new levels of scale for model training. Pushing model sizes past 1 trillion parameters, Artetics et al. 2021. Do et al. 2021. FADIS et al. 2022. The challenge in computing MOS efficiently is handling the dynamic routing and load-imbalanced computation that are fundamental to these architectures. However, existing hardware and software for deep learning make it difficult to meet this challenge.

For example, TPUs in their XLA compiler require all tensor shapes to be known statically and often struggle with fine-grained operations like scatters and gathers . These constraints make it difficult to implement MOS directly on TPUs. While GPUs are more flexible, the sparse computation in MOS does not map cleanly to the software primitives supported in major frameworks and libraries.

State-of-the-art frameworks for Mo training sidestep these challenges by placing rigid constraints on Mo routing. In order to remove the dynamism from the computation, the set of tokens mapped to each expert are trimmed or padded to a user-specified size. Lepikin et al. 2020. Fadus et al.

2022. Wong et al. 2022. This Procrustian formulation introduces a trade-off between model quality and hardware efficiency, as users must decide whether to drop tokens or waste computation and memory on padding. This decision is often made through hyperparameter tuning, which increases the complexity of using MoS. To address these challenges, we develop an approach for Mo routing and computation based on sparse primitives.

Our approach never drops tokens and maps efficiently to modern GPUs, enabling end-to-end training speedups of up to 40% in 2.4 times over state-of-the-art frameworks for Mo and DNN training, respectively. We make the following specific contributions. We show how the computation in an Mo layer can be expressed as block sparse operations to accommodate imbalanced assignment of tokens to experts. We use this formulation to train dropless Mo's, DMOEs.

We develop high-performance GPU kernels for block-sparse matrix products that efficiently handle dynamic Mo computation. Our kernels use two techniques, blocked CSR-COO encoding and transpose indices, to enable efficient matrix products with sparse inputs and outputs in transposed or non-transposed order. We have implemented these techniques in a system called Megablocks, which builds on the state-of-the-art Megatron LM library for training transformer models, Shueibe et al. 2019.

We evaluate our system through both micro benchmarks and end-to-end training of transformer language models. Two BAKGROUND MOE-LIRs. MOE layers are made up of many experts, which are themselves small neural networks. Each token one is dynamically routed to a subset of the experts for computation based on scores computed by a router. The experts are commonly defined to be small multi-layer perceptrons, MLPs.

It is typical for tokens to be sent to a small number of experts, often between 1 and 4. FATUS et al. 2022. Mo layers are often interleaved with other DNN layers and are most commonly used to replace the feed-forward net1 for natural language. Data is commonly called tokens. For vision, the data is typically pixels or patches. Dostoevsky et al. 2021.

For simplicity, we use the term token throughput this paper. Work, FFN, Layers in Transformers, Shazir et al. 2017. Fadis et al. 2022. This hybrid architecture has demonstrated strong results on both natural language and vision tasks, Du et al. 2021. Raquel May et al. 2021. It is conjectured that these improvements are a result of experts specializing to different parts of the data distribution, Shazir et al. 2017.

We illustrate an Mo layer in Figure 1 and describe it in detail in the remainder of this section. 2.1 Routing The first stage of an Mo layer is the router, which is responsible for determining the assignment of tokens to experts. In addition to expert assignments, Mo routers also produce probabilities for each assignment that reflect the confidence of the mapping.

These weights are encoded as a matrix of scores for each token expert pair, which are used to linearly combine the top-k expert outputs for each token . The most common style of Mo routing is the learned router proposed by Shazir et al. . In this router, the tokens are projected from hidden size elements to num_expert scores by multiplying with a weight matrix that is learned jointly with the other model parameters.

The scores are normalized with a softmax and the routing decisions are made by greedily selecting the top K scoring experts for each token. 2.2. Permutation. State-of-the-art Mo implementations aim to compute all expert layers in parallel in order to make effective use of the parallelism available on GPUs and TPUs. Lepikin et al. 2020. Fadis et al. 2022. Wong et al. 2022.

The standard primitive used by implementations is batched matrix validation loss multiplication, which computes a set of matrix products of the same shape . However, mapping Mo computation to this primitive is non-trivial. In order to respect the shape constraints of batched matrix multiplication, the experts must be constrained to have weight matrices of the same shape and the number of tokens assigned to each expert must be equal.

The latter constraint is particularly problematic because the learned routing algorithm described above provides no guarantees of a load-balanced assignment of tokens to experts. In order to satisfy this constraint, prior work has defined a fixed expert capacity, which is the number of tokens that each expert can be assigned. Lepikin et al. 2020, Fedus et al. 2022. If the number of tokens assigned to an expert exceeds its capacity, the extra tokens are dropped.

That is to say, they are not passed to any expert for computation and the model relies on a residual connection to reintroduce the dropped token's representation after the mo layer. If an expert layer is not assigned enough tokens to fill its capacity, its set of tokens is padded to fill the remaining space.

Expert capacity is typically specified in terms of a capacity factor hyperparameter, which is a multiplier on the expected number of tokens that would be assigned to each expert under a perfect uniform distribution. Expert capacity equals num tokens num experts times capacity factor. The capacity factor can be thought of as a parameter that reduces the chance of dropping a token.

This hyperparameter represents a trade-off between additional computation and model quality. As such, it is desirable to minimize the amount of load imbalance in the assignment of tokens to experts. The typical mechanism for doing so is auxiliary load. In addition to enabling batch computation of the expert layers, these constraints allow all tensor shapes to be known statically, which is required by TPUs and XLA. 2.3. Computation. Once the data has been permuted, the experts can be computed in parallel.

For models where the experts are MLPs, this entails computing each layer for all experts using batched matrix multiplication. For convolutional experts, the layers can be computed with grouped convolutions. 2.4. Unpermutation. After the experts are computed, the resulting feature vectors are unpermuted such that their ordering matches that of the input to the layer.

The last step in Mo computation is to scale the output tokens by the scores with which they were assigned to their respective experts. When tokens are routed to more than one expert, these weighted results are summed to produce the final layer output for each token. 3MOTIVATION. TOKND ROPPING IN MOES. Despite the use of load balancing losses, prior work has shown that token routing is still highly imbalanced. Wong et al. 2022.

To quantify the effect of token dropping on model quality, we trained Mo language models on the pile: GAO et al. 2020, with a range of capacity factors. We train transformer Mo's similar to those used by FADUS et al. 2022, where each model is a transformer with the FFN layers replaced with 64 expert Mo layers where each expert is a two-layer MLP matching the original FFN dimensions.

We used top-1 routing and based our Mo model diamond. MegaBlocks. Efficient sparse training with mixture of experts. Scions on the transformer small model described in table 1. All models were trained using the tokenization from GPT-2, Radford et al. 2019, for 10B tokens with sequence length 1024, the atom optimizer, and the learning rate and gradient clipping settings from ShueiB et al. 2019.

We trained all models on a single A100 GPU with a batch size of 512 sequences. We trained MOS with capacity factor 1, 1.5, and 2 as well as the dynamic capacity factor technique proposed by Tuttle, Wong et al. 2022, where the capacity factor is set dynamically to the minimum value that would avoid token dropping. As a baseline, we trained standard transformer models across a range of sizes.

All Transformer and Moe models have vocabulary size 51,200, sequence length 1024 and an attention head size of 64. Our model configurations are summarized in Table 1 and the results of the experiments are shown in Figure 2. For these models, we observed that the impact of token dropping is significant.

While the Mo with capacity factor of 1 achieved a 0.15 reduction in validation loss, the Mo that avoided dropping tokens provided a reduction of 0.26, 1.73 times larger than the gain of the former model and enough to exceed the quality of transformer medium. While dropping tokens reduces model quality, increasing capacity factor comes at the cost of additional computation and memory. In this example, Mo layer math operations increased by over 2 times in order to avoid dropping tokens.

Wong et al. 2022 showed that some MOs require capacity factors as high as 11 in order to avoid dropping tokens, and other models where the necessary capacity factor to avoid dropping tokens spiked unpredictably during training. In addition to the computational overhead of increasing the capacity factor, having to tune an additional hyperparameter can significantly increase the number of models that need to be trained for a target task.

This is particularly cumbersome for large neural networks, where the cost to train a single model can run into the hundreds of thousands of dollars. Mosaic ML, 2022. Possibly as a result of this, some large studies on MOS have declined to explore different capacity factors at all. Artetics et al. 2021. Clark et al. 2022. 4NOTOKNLFTBE Hind with B-Lock SPARSITY2.

This section describes how we formulate Moe layer computation in terms of block sparse computation in order to avoid dropping tokens. The motivation for using block sparse primitives to express Moe computation is manifold. First, as we show below, block sparse matrices are a natural and flexible way of describing the dynamic and load imbalanced computation in Moes. Second, block sparsity maps efficiently to hardware accelerators built around systolic array matrix multipliers like GPUs and TPUs.

Because of the coarse granularity of Mo experts, we can select a block size for our implementation that is large enough to enable the computation to realize high fractions of peak device throughput. Lastly, block sparse kernels like matrix multiplication and convolution are general purpose primitives that are useful across a range of applications. Marring et al. 2017. Gray et al. 2017. Child et al. 2019. Elson et al. 2020.

This makes investment in high-performance kernels more practical, as work can be amortized across target tasks. We could similarly invest in variable-sized batched matrix multiplication kernels, but the utility of this would be limited to Moe architectures as they are designed today. In addition to these considerations, the block sparse formulation of Moes exposes a new perspective on these algorithms as a form of dynamic, structured, activation sparsity.

To the name no token left behind references the technique briefly discussed by Fadis et al. 2022, which was an unsuccessful attempt to regain the quality lost from dropping tokens. This perspective draws parallels to much of the literature on sparse training algorithms and opens up the opportunity to further improve MoS with insights from this adjacent field.

preliminaries sparse matrix product notation in the remainder of this paper we often refer to matrix multiplication where one of the three matrices the two inputs and one output is sparse and the others are dense we borrow the notation from triton tillett et al 2019 to describe these different operations each operation is described with a three character string where each character is either s for sparse or d for dense the order of characters is output followed by the left input followed by the right input

For example, the product of two dense matrices with a sparse output is SDD, which is also referred to as sampled densidence matrix multiplication, SDDMM. This notation is useful to distinguish operations like DSD and DDS, which are different forms of sparse matrix dense matrix multiplication, SPMM. Superscript T indicates transposition of the input arguments. For example, SDDT indicates an SDD where the right-hand input matrix is transposed.

4.1 Expert Computation with Block Sparsity The key insight behind our method is shown in Figure 3. Rather than the prevailing approach of computing the experts within an MoL layer using batched matrix multiplication, we could equivalently compute the experts as in SDD where the output sparse matrix has block diagonal structure, as shown in Figure 3b in this formulation. Allowing for a load-imbalanced assignment of tokens to experts is analogous to allowing for the blocks in the block diagonal matrix to have a variable number of rows.

To achieve this, we propose to compute each block as many smaller fixed size blocks using block sparse matrix multiplication, as shown in figure 3c. To construct multi-layer experts, we can iterate between SDD and DSD operations, see figure 6.

In this formulation, we could also relax the constraint on the number of columns in each block to build MO layers with variable-sized experts, as is shown in Figure 3c. While this is an interesting direction for future work, we did not explore these configurations as more research is needed to identify how this capability can be used to increase efficiency.

With sufficiently large blocks, block-sparse matrix multiplication is capable of reaching high fractions of peak throughput on modern GPUs, gray at all. 2017 NVIDIA 2021

The coarse-grained sparsity in MOs lends itself to this requirement in transformer models using MO-FFN layers. The number of columns in the blocks shown in figure 3B corresponds to FFN hidden size, which is commonly between 1024 and 8192. Vaswani et al. 2017. Radford et al. 2019. Brown et al. 2020.

The number of rows in these blocks corresponds to the number of tokens assigned to each expert, which is expected to be equal to the number of tokens divided by the number of experts under a uniform distribution. This can range from a few thousand to tens of thousands of tokens per expert. Lepikin et al. 2020. Artetics et al. 2021. Fadus et al. 2022.

These coarse-grained blocks are many times larger than the largest tile dimensions used for dense matrix multiplication kernels, which give us the flexibility to select a block size that can match their throughput. 5MEGAB LOCKS. A F R A M E W O R K for E F F I C I E N T M O E T reining. We implemented our techniques in a system called MegaBlocks, which builds on Megatron LM, Shuabi et al. 2019, and PyTorch, Poshke et al. 2019.

In addition to high-performance dropless MOE layers, our system supports distributed training of MOEs with both data and expert model parallelism . This section discusses the design of our DMOE implementation, including our block sparse kernels, and other considerations for building an efficient system. Section 5.1.1 discusses the limitations of existing block sparse kernels.

Section 5.1.2 analyzes the effects of the block size on block sparse product performance. Section 5.1.3 describes our hybrid-blocked CSR-COO sparse matrix format, which enables efficient matrix products with sparse input and output operands. Section 5.1.4 introduces transpose indices as a mechanism for efficient iteration over block sparse matrices in transposed order.

Lastly, Section 5.2 discusses efficient routing and permutation for DMOEs. Preliminaries. Matrix multiplication on GPUs. Matrix multiplication kernels on GPUs exploit tiling, where the output matrix is broken up into statically sized two-dimensional blocks of values, NVIDIA, 2022C. The computation of these tiles can be parallelized, and the individual tiles can be sized to trade off arithmetic intensity and parallelism.

The group of threads assigned to a tile is called a threadblock. 5.1 Efficient block sparse kernels for MOS. To train MOS with block sparse kernels we need primitives for the forward and backward passes. Consider an MOFFN layer where each expert is a two-layer MLP. For this configuration, the forward pass requires an SDD operation followed by a DSD, figure 6.

For the backward pass, we compute SDDT and DSTD for the second layer data gradient and weight gradient, respectively, followed by DSDT and DDTS for the first layer data gradient and weight gradient, respectively. 5.1.1 Existing Blocksparse Primitives We considered two existing libraries for Blocksparse matrix multiplication on GPUs, NVIDIA CUSPARSE, NVIDIA 2022B, and Triton Blocksparse, Tillett et al. 2019.

CUSPARSE supports the blocked L sparse matrix format for DSD. However, as of CUDA 11.8 this operation does not support transposition of the sparse matrix input. CUSPARSE also provides no SDD primitive with a blocked L matrix. In addition to these limitations, the blocked L format requires that all rows in the sparse matrix have the same number of nonzeros, which would defeat our goal of supporting load-imbalanced matrices.

Block sparse supports SDD, DSD, and DDS as well as all combinations of transposed and non-transposed inputs. However, these primitives assume that the topology of the sparse matrices does not change between invocations 3. The library API takes a bitmask describing the sparse operand and then pre-computes lookup tables and block groupings to accelerate computation. For our use case, the sparse matrix topology varies across every iteration of training and every mo layer in the model.

In order to use Blocksparse, we would have to pay the cost of these preprocessing steps repeatedly. Based on this analysis, we opted to write our own Blocksparse primitives in order to tailor them to the dynamism of MoExpert computation. We implemented SDD, DSD, and DDS operations targeting NVIDIA GPUs. Our kernels support all combinations of transposed and non-transposed inputs. The remainder of this section details the design and implementation of our kernels.

5.1.2 Selecting block size for MOS. In order to efficiently use modern GPUs, we want to use sparse blocks that have sufficient arithmetic intensity to. 3. This is likely because they were written for applications like sparse attention where the sparse matrix topology is determined prior to training. Child et al. 2019

keep matrix multiplication units busy. Large blocks are also desirable to amortize the cost of storing and operating on sparse matrix metadata, since metadata-like column indices only need to be kept for each block of non-zeros. To select our target block size, we studied the performance of dense matrix multiplication kernels from NVIDIA Cutlass, NVIDIA-2022C, with different tile dimensions.

We benchmarked mixed precision, FP16 + FP32 accumulation, matrix multiplication on square matrices with power of two side lengths from 512 to 16384 in every set of tile dimensions supported in Cutlass. For rectangular tiles, we show only the configurations where the first tile dimension is larger as we found these to slightly outperform the alternative ordering for these problems.

We ran all benchmarks on an A100SXM4 80GB GPU with CUDA 11.5 and Cutlass 2.5. These benchmarks are shown in figure 4. Across these benchmarks, we observed that 128x128 tiles consistently perform on par or better than other configurations. Anecdotally, we observe that this same configuration is commonly selected by NVIDIA Kubla's NVIDIA 2022A for the dense transformer models we studied.

Based on this analysis, we opted to use 128x128 block sparsity. While the tile dimensions of a block sparse matrix multiplication and the block size in the sparse matrix do not need to be equal, we found that for 128x128 blocks the highest performing tile dimensions in our workloads were also 128x128.

To implement our kernels, we extended Cutlass, NVIDIA, 2022C, to support block sparse matrices and reused their machinery for high-performance matrix multiplication with different data types and GPU architectures. Megablocks. Efficient sparse training with mixture of experts. 5. 1.3 Computing sparse outputs with hybrid-blocked CSR-COO. We use Blocked Compressed Sparse Row, BCSR, as our primary sparse matrix format.

BCSR makes it simple to iterate across the non-zeros in a row, which is necessary for operations like DSD and DDST. Iterating over blocks also has minimal overhead with BCSR, as identifying a block's position in the matrix only requires a single load of its column index. We discuss our approach for efficiently iterating across the non-zeros in a column with this format in section 5.1.4. One challenge with BCSR sparse matrices is efficiently computing SDD operations in parallel.

On kernel launch, each thread block needs to identify the row and column of its output blocks so that it knows which rows and columns of the input matrices are needed to compute it. Because BCSR only encodes column indices for each block, identifying the row index of a non-zero block requires a search through the row offsets. One solution to this problem is to launch the maximum number of thread blocks that could be needed to compute each row of the output if it were fully dense.

On startup, each threadblock can check whether its column offset is out of range for the number of non-zeros in its row in return if there is no work to do. Gale et al. 2020 showed that the overhead introduced by launching extra thread blocks was negligible for moderately sparse matrices, 50-90% zeros. We experimented with this approach but observed that for MOS the cost of launching these unused thread blocks was significant, particularly for models with high expert counts where the level of sparsity in the block sparse matrices is very high.

To efficiently parallelize SDD, we additionally materialize the row indices for each nonzero block so that thread blocks can trivially look up the coordinates of sparse blocks in the output matrix. The storage required for this additional metadata is negligible since we only need to store one index per 16,384 nonzero values in a 128x128 block.

Even with this additional metadata, we maintain the row-wise ordering of non-zero blocks so the matrix can be operated on as either BCSR or Blocked Coordinate Format, BCOO. We illustrate this hybrid blocked CSR-COO encoding in figure 5. 5.1.4 blocks sparse transposition with transpose indices. Computing forward and backward passes for model training requires sparse matrix transposition.

However, iterating over BCSR matrices in transposed order requires searching through each row to identify if the block in the target column is non-zero. Bullock et al. 2009. We could materialize a transposed version of the sparse matrix explicitly, but this would incur runtime and storage costs as all of the non-zero values in the matrix would need to be copied.

To enable efficient iteration over BCSR matrices in transposed order, we construct the metadata for the transposed matrix but do not explicitly transpose the non-zero values. Instead, we construct an array of indices, one for each non-zero block, which are stored in transposed order and contain the offset of each non-zero block in memory. This additional metadata allows efficient iteration through the matrix in transposed order with a layer of indirection, as shown in figure 5.

This idea is similar to a secondary index in a database, which allows efficient access to entries in a different order than the primary index. Similar to our hybrid-blocked CSR-COO encoding, this technique relies on the fact that storage and computation is many times cheaper for metadata than it is for non-zero values thanks to our large block sizes.

5.2 Efficient Routing and Permutation As currently implemented, our block sparse matrix multiplication kernels require the number of tokens assigned to each expert to be a multiple of the block size. In order to respect this constraint, we pad each group of tokens with zeros to the nearest multiple of 128 and fuse this operation into custom permutation kernels.

We could remove this constraint by supporting partial blocks at the fringes of the problem similar to how matrix multiplication handles matrices that are not divisible by the tile dimensions. However, the performance impact of this feature would be minimal given we expect the number of tokens assigned to each expert to be thousands or tens of thousands.

Megablocks. Efficient sparse training with mixture of experts. Table 3. Micro batch sizes used for model training. We used the largest micro batch size that fit in memory for all experiments. Model micro batch size. Megatron LM Transformer XS64 Transformer Small 32 Transformer Medium 16 Transformer Large 16 Transformer XL8. Megablocks DMOE XS64 DMOE Small 32 DMOE Medium 8.

TUTL DMOE XS32 DMOE SMALL 8 DMOE MEDIUM 1. Once the expert assignments have been computed by the router, we create the metadata for the block sparse matrix using a custom CUDA kernel. We additionally construct the transposed metadata at this time to amortize the cost over the multiple block sparse matrix multiplications that use it across forward and backward computation. 6.

This section analyzes the performance of our system compared to state-of-the-art libraries Microsoft Tuttle, Wong et al. 2022, and NVIDIA Megatron LM, Shueibe et al. 2019, for training transformer MOS and standard transformers respectively. In order to ensure fair comparisons, we extended Megatron LM to additionally support MO training using Tuttle's MO layer.

All experiments were conducted on NVIDIA A100SXM 480GB GPUs with CUDA 11.5, Cutlass 2.5 and used Mixed Precision Training, Misikovicius et al. 2018, as implemented in Megatron LM.

6.1 Mo training without dropping tokens. To assess the efficiency of our technique for avoiding token dropping, we compared to the DMOE method proposed by Wang et al. 2022, where the capacity factor is set dynamically to the minimum value that avoids token dropping. We trained decoder-only transformer language models on the pile, Gao et al. 2020, with the same hyperparameters described in Section 3.

For transformer MOS, we trained models scaled from RxS, small, and medium models with each FFN layer replaced with 64 expert MOS layers using top-1.0.00.51.01.52.02.5 training time, days, 2.22.32.42.52.62.72.82.9.

Validation loss. Transformer. Megatron LM. DMOE. Tootle. E64. Top 1. DMOE. Megablocks. E64. Top 1. Figure 7. Megablocks DMOEs. Tootle DMOEs and Megatron LM Transformers trained on the pile.

MegaBLOX uses block-sparse operation to handle the dynamic and load-imbalanced computation in MOS, which enables 1.38x, 2.0x and 4.35x end-to-end training speedups for MOXS, MOS-small, and MOS-medium respectively compared to the padding-based approach used by TUTL. The advantage of our approach increases with the size of the model, as the memory requirements of padding expert batches forces TUTL to use smaller micro-batch sizes which decreases hardware efficiency.

Compared to dense transformer language models, Megablocks achieves 1.8x2.4x end-to-end training speedups for the same validation loss across these models. Routing. We also trained standard transformer models from 46M to 1.3B parameters, equivalent to transformer base, Viswani et al. 2017, up to GPT-3XL, Brown et al. 2020, as a dense baseline.

We trained all models on 8A100SXM4 80GB GPUs using 8-way expert model parallelism for MO layers and data parallelism for all other layers. We use gradient accumulation for all models in train with a batch size of 512 sequences and the largest micro-batch size that does not run out of memory, norionon at all. 2021A.

Our model configurations are summarized in tables 1 and 2. For each model, we report the end-to-end training time and final loss achieved on a validation set in figure 7. Compared to the prevalent padding-based approach for avoiding token dropping, our technique for adaptive Mo computation with block sparsity enables end-to-end training speedups of 1.38x, 2.0x and 4.35x for MoXS, MoE-small, and MoMedium, respectively.

In addition to computational overhead, the padding-based approach implemented in Toodle significantly increases the amount of memory required to store activations in the MoLayers. This is particularly problematic because Mo's already require many times more storage for their large weight matrices compared to standard transformers.

For these models, we observed this increase in memory usage reduced the maximum micro-batch size that Tootle could use by 2x, 4x, and 8x compared to MegaBloks for MoXS, MoSmall, and MoMedium, respectively. This in turn increases training time because of reduced hardware efficiency. As a result, we observe that the advantage of MegaBloks over Tootle grows with model size. The micro-batch size used for each model configuration are shown in Table 3.

Compared to transformer models trained with Megatron LM, DMOEs trained with Megablocks reduce the training time required to reach a given validation loss by 1.8 x 2.4 times. The variation in this comparison is primarily a result of the increased weight memory usage of MO models, which forced Megablocks to use a 2x smaller micro-batch size for MO medium than the analogous transformer model. These results highlight the importance of reducing memory usage in MOs as a direction for future research.

For these transformer models, we observed that Megatron LM sustains between 21% and 48% of the 2.5 petaflop peak throughput of this 8 GPU system with efficiency increasing with model size. The speedups achieved by MegaBLOX over this state-of-the-art framework demonstrates the efficiency of our system in the efficacy of MOS. 6.2 MO training with token dropping. We additionally compare our DMOE models to token dropping MOS trained with TUTL.

In order to find the most efficient configurations, we trained Moe XS, Moe E-Small and Moe Medium models with capacity factors of 1x, 1.5x, and 2x for a total of 9 additional models. For these configurations, all token-dropping Moe models were able to use the same micro-batch size as the analogous DMOE without running out of GPU memory. We report the end-to-end training time and validation loss for these models along with our DMOE and standard transformer results in Figure 8.

comparing MoEs and DMOEs for the same 6.3 block sparse matrix multiplication performance. To assess the quality of our block sparse matrix multiplication kernels, we benchmarked the problem configurations used in training MoXS, MoSmall and MoMedium models and compared to Kubla's batched matrix multiplication. This includes the forward pass, backward weights, and backward data operations for the two layers in each FFN layer.

In total, we benchmark 18 problems 6 problems for each of the 3 models. To allow for comparison with batched matrix multiplication, we benchmark each problem with a uniform distribution of tokens to experts in the same micro-batch size listed in table 3. These benchmarks can be viewed as an ablation assessing the overhead that would be introduced if one were to use our block sparse kernels to implement a standard, token dropping MO.

For each problem we averaged throughput over 100 executions. We do not include the time taken to construct the sparse matrix metadata in these benchmarks as these operations amortize over all six problems within an FNN layer. The results of these benchmarks are shown in figure 9. On these problems, we observe that our block sparse kernels are able to realize 98.6% of the throughput of Kublai's with a standard deviation of 4%. The maximum relative throughput was 104% and the minimum was 91%.

Overall, our kernels slightly outperformed Kublai's on half of the problems and slightly underperformed on the other half. While benchmarking Cutlass, we observed that altering the order in which tiles of the output matrix are computed can change the throughput of the operation by as much as 10% due to L2 caching effects. We believe that most of the performance discrepancy in these results can be attributed to the reordering of computation that occurs with block-sparse matrices, although further investigation is needed.

One case where we note additional overhead is in the DSTD operations used to compute weight gradients. Because we use a secondary index to iterate over the sparse operand in transposed order the access patterns when iterating through this matrix exhibit little spatial locality which in turn reduces the throughput of the overall operation.

While this is an interesting problem for further study, the overall impact on model performance is minimal because of the limited opportunity for improvement, less than 10%, combined with the relatively small amount of end-to-end runtime that these two operations represent. 7.

R-related W-ORC. Moe Routing. Improved routing algorithms for Moes is an active area of research. Base layers formulate Moe Routing as a linear assignment problem trying to maximize the aggregate token expert affinities under the constraint of a perfectly balanced assignment. Lewis et al. 2021. This method guarantees no tokens are dropped by rerouting tokens to different experts as needed.

Clark et al. 2022, found that base layers can incur significant runtime overhead and proposed an approximate version using the Synchern algorithm. Because their approximation is no longer guaranteed to avoid token dropping. Clark et al. 2022, use a capacity factor of 2 for all experiments. Other techniques have been proposed to statically decide tokens to expert mappings ahead of time based on hash functions. Roller et al. 2021.

However, Clark et al. observed that this approach did not perform as well as the other routing algorithms they studied. More recently, Joe et al. proposed to reverse the routing problem such that each expert selects its top K scoring tokens. While this guarantees a load-balanced assignment of tokens to experts, this method still suffers from token dropping because the same token can be selected by multiple experts.

We expect that improved routing algorithms complement our method for efficient and flexible expert computation. Exploring how these methods could be combined is an interesting direction for future research. High-performance MoS. To scale Mo training. Tutel implements optimized distributed communication primitives for MoS and techniques for hiding the communication costs of expert model parallelism. Wong et al. 2022.

He et al. 2022, proposed Faster Mo, a system for distributed training of Mo's based on efficient communication strategies and changes to the Mo routing algorithm to avoid network congestion. Our implementation could additionally benefit from these techniques, particularly for large-scale distributed training. Sparse kernels. Sparse matrix formats that allow for efficient transposed access are well studied. Bullock et al. 2009. Smith and Kharapis, 2015. Lee et al. 2018.

Exploring how these formats can be adapted to large block sparsity on modern GPUs is an interesting direction for future research. 8. CONCLUSION. We introduced Megablocks, a system for efficient MO training on GPUs. Our system is based on a reformulation of MOs in terms of block sparse operations and new block sparse GPU kernels that efficiently handle the dynamism present in MOs.

Our approach never drops tokens and maps efficiently to modern hardware accelerators, enabling end-to-end training speedups of up to 40% over Moes trained with the state-of-the-art Tuttle library and 2.4 times over DNNs trained with the highly optimized Megatron LM framework.

Megablocks. Efficient sparse training with mixture of experts. R.E.F.E.R.E.N.C.E.S. Artetics. M. Beausoleil. S. Goyle. N. Mihailov. T. Ott. M. Schleifer. S. Lin. X.V. Du. J. Iyer. S. Passanuru. R. Anantharaman. G. Lee. X. Chen. S. Akin. H. Baines. M. Martin.

L. Joe. X. Cora. P.S. Ohoro. B. Wong. J. Zetlimoyer. L. Diab. M. T. Kozareva. Z. and Stoyanov. V. Efficient Large-Scale Language Modeling with Mixtures of Experts. Core. ABS. 2112.10684. 2021.

Brown. T.B. Mann. B. Ryder. N. Sabia. M. Kaplan. J. Dariwal. P. Neelakanton. A. Shyam. P. Sastry. G. Askell. A. Agarwal. S. Herbert Voss. A. Kruger. G. Hennigan. T. Child. R. Ramesh. A. Ziegler. D. M. Wu. J. Winter. C. Hess. C.

Chen. M. Sigler. E. Litwin. M. Gray. S. Chess. B. Clark. J. Berner. C. McCandlish. S. Radford. A. Sutskever. I. and Amaday. D. Language models or few-shot learners.

In Advances in Neural Information Processing Systems 33. Annual Conference on Neural Information Processing Systems 2020. NeurIPS 2020. December 6-12. 2020. Virtual. 2020. Bullock. A. Feynman. J.T. Frigo. M. Gilbert. J.R. and Leiserson. CE Parallel Sparse Matrix Vector and Matrix Transpose Vector Multiplication Using Compressed Sparse Blocks.

In SPAA 2009. Proceedings of the 21st Annual ACM Symposium on Parallelism in Algorithms and Architectures, Calgary, Alberta, Canada, August 11–13, 2009, pp. 233–244. ACM, 2009. doi. 10. 11451583991.

1,584,053. Child, R., Gray, S., Radford, A., and Sutzkever. I. Generating long sequences with sparse transformers. Core, Abs., 1904.10509. 2019.

Clark. A. De Los Casas. D. Guy. A. Mensch. A. Paganini. M. Hoffman. J. DeMock. B. Heckman. B. A. Kai. T. Bourgeois. S. Van Dendriche. G. Rutherford. E. Hennigan. T. Johnson. M. Millikan. K. Casseret. A. Jones. C. Bachatskaya. E. Budden. D. Seifer. L.

Osindaro, S. Vinyals, O. Ray, J. W. Elson, E. Cavicuoglu, K., and Simonian, K. Unified Scaling Laws for Routed Language Models. Core, Abs., 2202.01169, 2022.

Dosovitsky, A. Bayer, L. Kolesnikov, A. Weissenburn, D. Jai, X. Untertheiner, T. Degani, M. Minderer, M. Heigold, G. Gelli, S. Uskarite, J. and Holzby, N. An image is worth 16 by 16 words. Transformers for image recognition at scale.

In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.

OpenReview.net. 2021. Du. N. Huang. Y. Dai. A. M. Tong. S. Lepikin. D. Xu. Y. Kirkin. M. Zhou. Y. Yu. A. W. Farat. O. Zouf. B. Fedus. L. Bosma. M. Zhou. Z. Wang. T. Wang. Y. E. Webster. K. Pellet. M. Robinson.

K. Meyer-Helstern. K. Duke. T. Dixon. L. Zhang. K. Le. Q. V. Wu. Y. Chen. Z. and Q. C. Glam. Efficient Scaling of Language Models with Mixture of Experts, 2021. Elsin. E. Duhan. M. Gale. T. and Simonian. K. Fast Sparse Convinets.

In 2020 IEEE, CVF Conference on Computer Vision and Pattern Recognition. CVPR 2020. Seattle, Washington. USA. June 13-19. 2020. pp. 14617. 14626. Computer Vision Foundation. IEEE. 2020. DOI. 10. 1109. CVPR 42600.

2020. 01464. FADUS, W, ZOEF, B, and Shazir, N-Switch Transformers. Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Journal of Machine Learning Research. 23, 120, 1-39, 2022. Gale, T., Elson, E., and Hooker, S.

The State of Sparsity in Deep Neural Networks. Core. ABS. 1902.09574. 2019. Gail. T. Zaharia. M. Young. C. and Elson. eSparse GPU Kernels for Deep Learning. In Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis. South Carolina 2020. Virtual Event. Atlanta, Georgia. USA. November 9-19. 2020.

IEEE, ACM, 2020. Gao. L. Biderman. S. Black. S. Golding. L. Hop. T. Foster. C. Feng. J. He. H. Thight. A. Nabashima. N. Presser. S. and Leahy. C. The pile. An 800GB dataset of diverse text for language modeling.

Archive Preprint Archive. 2101.00027. 2020. Gray, S., Radford, A., and Kingma, D.P. Blocksparse GPU Kernels. https://blog.openai.com. Blocksparse GPU Kernels. 2017. Hahn, S., Poole, J., Tran, J., and Dally. WJ Learning Both Weights and Connections for Efficient Neural Network.

In Advances in Neural Information Processing Systems 28. Annual Conference on Neural Information Processing Systems 2015. December 7-12, 2015. Montreal. Quebec. Canada. 2015. He. J. Jai. J. Antunes. T. Wong. H. Luo. F. Shi. S. and Li. Q. Fastremo. Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models.

In Proceedings of the 27th ACMSIGPLAN Symposium on Principles and Practice of Parallel Programming. PPOPP. 22. pp. 120-134. New York, NY. USA. 2022. Association for Computing Machinery. ISBN 9781450392044. DOI. 10. 1145350321sts.

3,508,418. Wong. C. Kui. W. Shang. Y. Yang. Z. Lu. Z. Hu. H. Wong. Z. Solace. R. Jose. J. Ram. P. Chow. J. Cheng. P. Yang. F. Yang. M. and Shang. Y. Toodle. Adaptive Mixture of Experts at Scale, 2022.

Thanks for listening to this reading. For the entire paper, and more, check out our homepage.

MegaBlocks: Efficient Sparse Training with Mixture-of-Experts 49:13 Share