Current frameworks restrict dynamic routing in MoE layers, forcing a tradeoff between model quality and hardware efficiency. Users must choose between dropping tokens or wasting computation and memory on padding.
MegaBlocks reformulates MoE computation in terms of block-sparse operations and develops new GPU kernels to handle dynamic routing efficiently. It avoids dropping tokens and maps well to modern hardware.
MegaBlocks achieves up to 40% end-to-end training speedups over state-of-the-art Tutel library and 2.4x speedups over highly-optimized Megatron-LM framework for dense neural networks (DNNs).
Hardware accelerators like GPUs and TPUs are optimized for dense computation, making fine-grained sparse computation less efficient due to the irregularity of sparse operations.
Block-sparse operations allow MegaBlocks to handle the dynamic and load-imbalanced computation in MoE layers efficiently, enabling the system to process all tokens without dropping any.
The block size is chosen to ensure high arithmetic intensity and efficient use of GPU resources. A 128x128 block size was selected based on performance benchmarks, enabling high throughput on modern GPUs.
MegaBlocks uses block-sparse matrix multiplication kernels that can handle variable numbers of tokens assigned to experts, ensuring no tokens are dropped and enabling efficient computation.
Token dropping significantly reduces model quality. For example, an MoE model avoiding token dropping achieved a 0.26 reduction in validation loss, compared to 0.15 for a model with token dropping.
TPUs require static tensor shapes and struggle with fine-grained operations like scatters and gathers, making it difficult to implement dynamic routing in MoE layers directly on TPUs.
MegaBlocks reduces memory usage compared to Tutel, which increases memory requirements due to padding. This allows MegaBlocks to use larger micro-batch sizes, improving hardware efficiency.
We present MegaBlocks, a system for efficient Mixture-of-Experts (MoE) training on GPUs. Our system is motivated by the limitations of current frameworks, which restrict the dynamic routing in MoE layers to satisfy the constraints of existing software and hardware. These formulations force a tradeoff between model quality and hardware efficiency, as users must choose between dropping tokens from the computation or wasting computation and memory on padding. To address these limitations, we reformulate MoE computation in terms of block-sparse operations and develop new block-sparse GPU kernels that efficiently handle the dynamism present in MoEs. Our approach never drops tokens and maps efficiently to modern hardware, enabling end-to-end training speedups of up to 40% over MoEs trained with the state-of-the-art Tutel library and 2.4x over DNNs trained with the highly-optimized Megatron-LM framework.
2022: Trevor Gale, D. Narayanan, C. Young, M. Zaharia