Machine Learning

tinyML Asia – Jungwook Choi: Quantization Techniques for Efficient Large Language Model Inference



Quantization Techniques for Efficient Large Language Model Inference
Jungwook CHOI
Assistant Professor
Hanyang University

The Transformer model is a Representation Learning method employing Self-Attention to extract input data features. Initially proposed in 2012, it became a key technique for Neural Machine Translation (NMT) in Natural Language Processing (NLP) by 2017. Central to its design is the Multi-Head Attention mechanism, which facilitates versatile representation learning through layered feature extraction. Its potential as a pre-trained feature extractor has grown, especially in Pre-Trained language models. Here, performance boosts have been observed with increases in model size, dataset volume, and training computations, leading to Large Language Models (LLMs) with hundreds of billions of parameters. The Transformer’s applications have broadened to include computer vision and voice recognition. However, leveraging massive Transformer models necessitates extensive computations and data retrieval during inference, resulting in high costs and power consumption. This seminar presents advanced techniques like weight and activation quantization to enhance the efficiency of these models by cutting computational and memory demands for the deployment of edge devices.

source

Authorization
*
*
Password generation