GenAI on the Edge Forum: Optimizing Large Language Model (LLM) Inference for Arm CPUs



Optimizing Large Language Model (LLM) Inference for Arm CPUs
Dibakar GOPE, Principal Engineer, Machine Learning & AI, Arm

Large language models (LLMs) have transformed how we think about language understanding and generation, enthralling researchers and developers. Facilitating the efficient execution of LLM models on commodity Arm CPUs will expand their reach to billions of compact devices, such as smartphones and edge devices. In this talk, we will present a set of optimizations to accelerate LLM inference on Arm CPUs. These optimizations will span matrix multiplication with low numerical precision and compression techniques to reduce memory traffic and improve overall model performance. In particular, we will discuss how developers can employ the SDOT and SMMLA instructions available on Arm CPUs in conjunction with 4-bit quantization schemas to bring efficient LLM everywhere, from smartphones to edge devices.

source

Authorization
*
*
Password generation