tinyML Asia 2022 Runxi Wang: An All-Digital Reconfigurable SRAM-Based Compute-in-Memory Macro for…

March 16, 2023 tinyML AllDigital, Asia, ComputeinMemory, for.., Macro, Reconfigurable, Runxi, SRAMBased, tinyml, Wang

An All-Digital Reconfigurable SRAM-Based Compute-in-Memory Macro for TinyML Devices
Runxi WANG, Ph.D Student, University of Michigan-Shanghai Jiao Tong University Joint Institute

Tiny machine learning (TinyML), which targets energy efficiency and low cost in AI-to-device integration, has been proven to be an effective implementation for data-intensive smart IoT devices. But device heterogeneity and MCU con- straints are two main challenges against further application based on tiny ML [1]. Conventional computer architecture, like Von Neumann architecture, separates the memory and computing part, will incur significant costs in transferring data on TinyML devices. To overcome this barrier, emerging architecture called compute-in-memory (CIM) shed new light by eliminating the boundary of memory and storage. This is especially beneficial for tinyML models that run inference on power-constrained devices. Among different flavours of CIM architectures, SRAM-based CIM gains increasing interests due to its great compatibility with the advanced technology and performance advantages [2].

Device heterogeneity has introduced tremendous advantages to today’s computing landscape, yet it also leads to costly design efforts and longer time to market. This is even more critical in architecting CIM, where customization efforts are required [1]. Learning from the history, reconfigurability has offered great opportunities for leveraging the performance tradeoffs and design costs. It has enabled generations of new architectures that can be reconfigured at different levels. Ex- isting CIM architectures were usually designed under the con- straint of very specific applications. For example, [3] proposed a 12T cell design useful for bit-wise XNOR calculation. [4] proposed two peripheral designs for bit-serial and bit-parallel vector accelerators for throughput-oriented and efficiency- oriented scenarios respectively. Design in [5] looked at full- system design of DNN computing cache architecture. There has been a few recent CIM work that attempted to integrate the recongigurability. [6] presented a system-level design that offers reconfigurable precision supports for cloud computing. [7] proposed a CIM macro supporting reconfigurable bit- wise operations. These work showed feasibility of adopting reconfigurable CIM architectures in AI applications and also great potential of integrating more functionalities on a single macro.

Inspired by these work, we propose to develop a cross-layer reconfigurable CIM solution, where opportunities need to be gained from cells, array, peripherals, to architecture and system designs. In this talk, we will show the progress of the first step on the roadmap where we design a novel all-digital SRAM cell macro that is able to support multiple commonly used operators from TinyML models in an all-in-one fashion.

Multiple compute-in-memory cell designs integrate the computing logic circuits mainly in the memory array pe- ripherals. This approach highly relies on operands’ physical locations, and performance improvement is thus limited. In order to support more flexibility for operands while still guaranteeing less latency for simple arithmetic operations, part of the computation workload along with the reconfigurable capability are integrated within the SRAM cell in this work. We introduce a novel 10T multi-logic SRAM cell that can support at least five basic operation modes, they are writ- ing, reading, and bit-wise AND/OR/XOR logical calculations. More complex functions were then be built upon these cell- level operations with extended circuitry. In Table I, we list the current supported function modes from the proposed CIM macro.

It is worth to mention that pipelined scheme has been em- ployed at the peripherals to further improve the performance. The design also makes use of concurrent reading of two adja- cent cells and separates the carry-ripple adder into two stages to reduce latency in addition operations. The proposed design has been implemented and simulated in Cadence Virtuoso with an industry-standard 28nm process node. Since it is purely digital, the design is synthesizable and compatible with the standard top-down digital implementation flow. In the talk, we will present detailed simulation results, design methodology and the comparisons against other state of the art designs. The final goal of this project is to deliver a system level compute-in-memory design approach that can support a wide range of tiny ML algorithms.

source

Related

You May Also Like