GenAI on the Edge Forum: ViT@Edge: Distilled Vision Transformer based Foundation Model for Efficient



ViT@Edge: Distilled Vision Transformer based Foundation Model for Efficient Edge Deployment
Hasib-Al RASHID, Ph.D. Student, University of Maryland

. MOTIVATION AND PROBLEM FORMULATION

The rise of large-scale foundational models built on trans-former architectures has revolutionized AI capabilities across image recognition (Vision Transformers – ViTs [1]) and natural language processing (e.g., ChatGPT [2]). While these models demonstrate remarkable performance, their massive size and computational requirements present a fundamental obstacle to their deployment on resource-constrained edge devices. For instance, ViT-base [1] contains 86 million parameters, resulting in a 344 MB model – far too large for embedded systems. Our goal is to develop innovative compression techniques that drastically reduce the footprint of foundational transformer models, enabling their widespread adoption in edge and tinyML applications without compromising their breakthrough capabilities.

II. PROPOSED VIT@EDGE

ViT@Edge proposes a novel solution that leverages the strengths of Transformer models within edge computing environments. Transformers, known for their foundational role in various domains such as NLP, Computer Vision, and multimodal areas, offer superior capabilities in modeling long-range dependencies. However, their complexity and the quadratic computational demand of their self-attention mechanism present significant challenges for real-world, industrial deployment, particularly in resource-constrained settings. On the other hand, while Convolutional Neural Networks (CNNs) are celebrated for their efficiency and practicality, especially in industrial applications due to their translation equivalence bias, they fall short in capturing global information. This is where ViT@Edge comes into play, merging the global information processing power of Transformers with the efficiency and practical deployment capabilities of CNNs. Given the distinct advantages and limitations of both Vision Transformer models and CNNs, exploring Knowledge Distillation (KD) between these two diverse architectures emerges as a fascinating area of study [3]–[5]. KD provides a pathway for distilling knowledge from large CNN models to ViT (Vision Transformer) models, obviating the necessity for extensive labeled datasets to supplement the inductive bias of the latter. This solution aims to address the computational hurdles of traditional Transformer models, making them viable for edge computing applications without sacrificing the comprehensive data understanding that Transformers provide. This approach not only optimizes computational resources but also maintains the adaptability and performance excellence of foundational models in various applications.

III. EXPECTED VIT@EDGE RESULTS

We have shown in [6] that with the inclusion of vanilla knowledge distillation with uniform 8-bit quantization we got 296× memory reduction for CNN-based multimodal pose classification task. Our Raspberry Pi 4B real-time deployment has 303.93 GOP/s/W power efficiency. With the proposed ViT@Edge, we expect similar or even more memory compression and power efficiency while deploying real-time edge processors/devices.

IV. ACTIONS CALL FOR EFFICIENT EDGE DEPLOYMENT

Our approach enhances CNNs by infusing them with global insights from vision transformers through representation-level distillation. This method surpasses traditional logitbased distillation, tapping into the deeper, interdependent knowledge within transformer representations. It is particularly effective for models trained via self-supervised methods, offering a versatile and task-agnostic solution. This task aims to:

• Conduct a comprehensive study of KD between Trans-formers and CNNs to leverage the strengths of both architectures.

• Develop methodologies for effective knowledge transfer from vision-transformer models to CNNs, focusing on both logits and representation levels.

• Explore and evaluate the impact of transferring global information from vision transformers to CNNs on various applications.

The exploration of Knowledge Distillation between Trans-former models and Convolutional Neural Networks serves as a bridge to amalgamate the strengths of both architectures, paving the way for innovative solutions in various domains.

source

Authorization
*
*
Password generation