Machine Learning

tinyML EMEA 2022 Manuele Rusci: Continual On-device Learning on Multi- Core RISC-V MicroControllers



Continual On-device Learning on Multi- Core RISC-V MicroControllers
Manuele RUSCI, Embedded Machine Learning Engineer, Greenwaves

In recent years, the combination of novel hardware technologies and optimized software tools have contributed to the success stories of many Deep Learning (DL) powered MicroControllers (MCUs) based sensors. “Train-once-deploy-everywhere” has resulted in the most widespread design paradigm: DL models are initially trained on pre-collected
data using High Performance Computing facilities and then deployed on MCUs at scale. However, this approach reports severe weakness when real-world data differs from training data, causing online mispredictions and failures on deployed sensors. To overcome this issue, future smart sensor devices are expected to continuously adapt their local Deep Learning models, i.e. update the coefficients, based on the data coming from
the ever-changing environment. This demands for on-device adaptation capabilities, which belong to the Continual Learning domain, currently not in-the-scope of TinyML devices.
In this context, this talk will firstly review the system architecture needs to bring efficient Continual Learning methods on low-power sensor platforms. With a focus on multi-core RISC-V MCU systems, we analyze the trade-off between memory, energy consumption and accuracy of back-propagation based learning techniques that makes use of Latent Replays (LRs) to prevent the catastrophic forgetting phenomena. More in details, we
show how to effectively map Latent Replay based learning on a multi-core device and how to reduce the LR memory requirement by means of low-bitwidth quantization.
In the second part of the talk, we focus on the on-device learning task and we present PULP-TrainLib, the fastest compute library to run backpropagation based learning on MCUs. The library includes a set of parallel software DNN primitives for the execution of the forward and backward steps. Results on an 8-core RISC-V MCU show that our auto-
tuned primitives improve MAC/clk by up to 2.4× compared to “one-size-fits-all” matrix multiplication, achieving up to 4.39 MAC/clk – 36.6× better than a commercial STM32L4 MCU executing the same DNN layer training workload. Furthermore, our strategy proves to be 30.7× faster than AIfES, a state-of-the-art training library for MCUs, while training a complete TinyML model.

source

Authorization
*
*
Password generation