Efficient Inference of Mixture-of-Experts (MoE)-based Large Models with Theoretical Guarantees

Title: Efficient Inference of Mixture-of-Experts (MoE)-based Large Models with Theoretical Guarantees

Abstract: Mixture-of-Experts (MoE) architectures have emerged as a powerful paradigm for scaling large models by routing inputs to specialized subnetworks (experts), achieving impressive performance with reduced computation during training. However, efficient inference of MoE models remains challenging due to memory and computational overhead, especially when deployed in resource-constrained environments. In this talk, I will first introduce a provably efficient expert pruning method for fine-tuned MoE models, which preserves test-time accuracy by pruning experts with minimal change in router norm from their pre-trained values. This technique significantly reduces inference cost and memory without sacrificing accuracy and is validated on large-scale vision MoEs like V-MoE and E3-MoE across standard benchmarks including CIFAR-10, CIFAR-100, and ImageNet.

Next, I will present a new expert-wise mixed-precision quantization strategy that also leverages router norm changes to assign precision levels. We theoretically show that experts with small router norm shifts, though less frequently activated, encode rare but important features and benefit from higher precision, while other experts can be quantized more aggressively. Applied to large language MoE models such as Switch Transformer and Mixtral 8×7B, this approach achieves state-of-the-art performance in ultra-low-bit regimes (sub-3-bit), outperforming prior methods in both accuracy and inference efficiency.

Bio: Meng Wang is a Professor in the Department of Electrical, Computer, and Systems Engineering at Rensselaer Polytechnic Institute. She received B.S. and M.S. degrees from Tsinghua University, China, in 2005 and 2007, respectively, and a Ph.D. degree from Cornell University, Ithaca, NY, USA, in 2012. Prior to joining RPI in December 2012, she was a postdoc research scholar at Duke University. Her research areas include machine learning and data analytics, energy systems, signal processing, and optimization. She is a recipient of the Young Investigator Program (YIP) Awards from the Air Force Office of Scientific Research (AFOSR) in 2019 and the Army Research Office (ARO) in 2017. At Rensselaer, she received the James M. Tien '66 Early Career Award and Grant for Faculty in 2022 and the School of Engineering Research Excellence Award in 2018. She has been an Associate Editor of IEEE Transactions on Smart Grids since 2020 and an Associate Editor of IEEE Transactions of Signal Processing since 2024. She was a guest editor of the IEEE Journal of Selected Topics in Signal Processing Special Issue on Signal and Information Processing for Critical Infrastructures in 2018.

Trustworthiness of Machine-Learning-Based Systems (TrustML) Research Cluster

About UBC

UBC Campuses

UBC Sites