tinyML Summit 2021 https://www.tinyml.org/event/summit-2021
Hardware aware Dynamic Inference Technology
Urmish THAKKER, Principal Engineer , SambaNova Systems Inc
There has been a recent surge in research in dynamic inference technologies to reduce the cost of inference without sacrificing the accuracy of the model. These models are based on the assumption that not all parts of the output feature map (OFM) are equally important for all inputs. The parts of the output feature maps that are deemed unimportant for a certain input can be skipped entirely or computed at a lower precision leading to reduced number of computation. This can enable faster inference of a large network leading to high accuracy. However, we show that the two popular methods that optimize different aspects of the OFM (channel and spatial) lead to sparse matrix multiplication during inference on a CPU which can lead to poor run-time characteristics in-spite of reduced number of MAC operations. We show a way to make these techniques SIMD Vector Length aware leading to block sparse matrices which can run more efficiently on a hardware with vector compute units. Our technique allows these models to create blocks of vector length 2, 4 and 8 with minimal loss in accuracy beating traditional pruning methods by a large margin for image classification task.