tinyML Asia 2022 Xiaotian Zhao: TILE-MPQ: Design Space Exploration of Tightly Integrated…
TILE-MPQ: Design Space Exploration of Tightly Integrated Layer-WisE Mixed-Precision Quantized Units for TinyML Inference
Xiaotian ZHAO, PhD Student, Shanghai Jiao Tong University
Deep learning and machine learning have enjoyed great pop- ularity and achieved huge success in the last decade. With the wide adoption of inferences on tinyML or Edge AI platforms, it becomes increasingly critical to develop optimal models that can fit the limited hardware budget [1]. However, state-of-the- art models with higher accuracy also come with large sets of parameters which requires significant amount of computing resources and memory footprint. As a result, this limits the deployment of real-time DNN models on the edge, such as real- time health monitoring, autonomous driving, real-time transla- tion and various other application domains. Quantization has been demonstrated as one of the most efficient model compres- sion solutions for reducing the size and runtime of a network via reduction of bit-width [2], [3]. Quantization reduces the precision of network components to the extent that it is more lightweight but without a “noticeable” difference in efficacy. The most straightforward way of applying quantization is called uniform quantization, where the resulting quantized values (aka quantization levels) are uniformly space. For example, uniform INT8 has been widely used in inference AI models [4]. As the demand for inference speed keeps increasing, and the on-device storage becomes limited, lower precision quantization scheme such as INT4 has been employed. But uniform scheme (e.g. all layers are quantized to 4 bits) can cause significant accuracy loss.
In order to further reduce the hardware consumption and compress the model, mixed precision quantization (MPQ) was proposed [5]–[9], where some layers maintain higher precision and others are kept lower precision quantization. Therefore, it strikes a better balance between accuracy and hardware cost [10]–[16]. However, many of these techniques were purely accuracy driven without considering hardware implementation costs. When it comes to the edge inference units, the limitation set by hardware resources requires a very fine-grained design space searching process that is able to provide explainable tradeoffs. This certainly applies to the problem of finding the best mixed-precision quantization scheme at the edge that needs to be both hardware and application friendly. While there were few existing MPQ solutions that introduced hardware awareness in the search process, the searching is either com- puting intensive or model specific. Thus a light and explainable hardware-aware MPQ methodology is required. In this work,
we propose a novel MPQ searching algorithm that firstly “sam- ples” the layer-wise sensitivity with respect to a newly proposed metric, which incorporates both accuracy (from application perspective) and proxy of cost (from hardware perspective). Based on the hardware budget and accuracy requirements, a candidate search space can be decided, and the optimal MPQ scheme is further explored. Evaluation results show that the proposed solution achieves 3%-11% higher inference accuracy with similar hardware cost compared to the state of the art MPQ strategies. At the hardware level, we propose a new processing- in-memory (PIM) architecture that tightly integrates the optimal MPQ policies as part of the processor pipeline through Instruc- tion Set Architecture (ISA) and micro-architecture co-design. To summarize, this work makes the following contributions.
We define a new metric that takes consideration of hard- ware cost and accuracy loss simultaneously while perform- ing the search of optimal MPQ schemes.
We propose a light MPQ searching methodology by an- alyzing single layer sensitivity first and then narrowing down the search space. The proposed strategy outperforms other existing hardware-aware MPQ search solutions for lightweight neural network models.
We look into the adaptation of the optimal MPQ schemes at the hardware level and propose a tightly integrated layer-wise quantized mixed-precision unit that resides in the CPU pipeline. The customized architecture allows seamless processing of both AI and non-AI tasks on the same hardware.
In this talk, we will discuss the details of the proposed search methodology for layer-wise hardware-aware MPQ, including the design space exploration, hardware architecture design methodology and evaluation results.
source