Title: Efficient Inference of Mixture-of-Experts (MoE)-based Large Models with Theoretical Guarantees

Abstract: Mixture-of-Experts (MoE) architectures have emerged as a powerful paradigm for scaling large models by routing inputs to specialized subnetworks (experts), achieving impressive performance with reduced computation during training. However, efficient inference of MoE models remains challenging due to memory and computational overhead, especially when deployed in resource-constrained environments. In this talk, I will first introduce a provably efficient expert pruning method for fine-tuned MoE models, which preserves test-time accuracy by pruning experts with minimal change in router norm from their pre-trained values. This technique significantly reduces inference cost and memory without sacrificing accuracy and is validated on large-scale vision MoEs like V-MoE and E3-MoE across standard benchmarks including CIFAR-10, CIFAR-100, and ImageNet.

Next, I will present a new expert-wise mixed-precision quantization strategy that also leverages router norm changes to assign precision levels. We theoretically show that experts with small router norm shifts, though less frequently activated, encode rare but important features and benefit from higher precision, while other experts can be quantized more aggressively. Applied to large language MoE models such as Switch Transformer and Mixtral 8×7B, this approach achieves state-of-the-art performance in ultra-low-bit regimes (sub-3-bit), outperforming prior methods in both accuracy and inference efficiency.

Bio: Meng Wang is a Professor in the Department of Electrical, Computer, and Systems Engineering at Rensselaer Polytechnic Institute. She received B.S. and M.S. degrees from Tsinghua University, China, in 2005 and 2007, respectively, and a Ph.D. degree from Cornell University, Ithaca, NY, USA, in 2012. Prior to joining RPI in December 2012, she was a postdoc research scholar at Duke University. Her research areas include machine learning and data analytics, energy systems, signal processing, and optimization. She is a recipient of the Young Investigator Program (YIP) Awards from the Air Force Office of Scientific Research (AFOSR) in 2019 and the Army Research Office (ARO) in 2017. At Rensselaer, she received the James M. Tien '66 Early Career Award and Grant for Faculty in 2022 and the School of Engineering Research Excellence Award in 2018. She has been an Associate Editor of IEEE Transactions on Smart Grids since 2020 and an Associate Editor of IEEE Transactions of Signal Processing since 2024. She was a guest editor of the IEEE Journal of Selected Topics in Signal Processing Special Issue on Signal and Information Processing for Critical Infrastructures in 2018.


UBC Crest The official logo of the University of British Columbia. Urgent Message An exclamation mark in a speech bubble. Caret An arrowhead indicating direction. Arrow An arrow indicating direction. Arrow in Circle An arrow indicating direction. Arrow in Circle An arrow indicating direction. Bluesky The logo for the Bluesky social media service. Chats Two speech clouds. Facebook The logo for the Facebook social media service. Information The letter 'i' in a circle. Instagram The logo for the Instagram social media service. External Link An arrow entering a square. Linkedin The logo for the LinkedIn social media service. Location Pin A map location pin. Mail An envelope. Menu Three horizontal lines indicating a menu. Minus A minus sign. Telephone An antique telephone. Plus A plus symbol indicating more or the ability to add. Search A magnifying glass. Twitter The logo for the Twitter social media service. Youtube The logo for the YouTube video sharing service.