Conferences

cpuC: a dynamic reconfigurable architecture for CNNs Acceleration
Presented at IEEE MCSoC 2025

cpuC is a reconfigurable compute architecture designed to accelerate convolutional neural networks (CNNs) on embedded and edge devices, where GPUs are often too power-hungry and CPUs lack sufficient parallelism. CNNs are now ubiquitous in cameras, autonomous systems, mobile devices, and IoT, and cpuC aims to bridge this gap by delivering GPU-like throughput at a fraction of the power and area, making it practical for low-power environments.

Unlike most CGRAs that rely on a fixed 2D mesh of processing elements, cpuC is built around a dynamically reconfigurable crossbar. This allows functional units—registers, adders, multipliers, and multi-port memory interfaces—to be physically reconnected every clock cycle, enabling arbitrary dataflow graphs and full resource utilization in a single multi-operation instruction. This design removes the connectivity limitations common in mesh-based CGRAs and allows larger, more expressive parallel instructions.

In evaluation, cpuC achieved ~10× speedup over an optimized Intel i5-2400 CPU and ~36× speedup over direct convolution implementations, with ~0.15W power consumption in FPGA synthesis. The architecture was validated on real CNN workloads including YOLOv7, Fast R-CNN (ResNet-50), and LeNet, and introduces pipeline-convolution, a practical optimization that significantly reduces memory and compute requirements while preserving inference quality.

Using Multiple Clocks in High-Level Synthesis to Overcome Unbalanced Clock Cycles
Presented at IEEE MCSoC 2023

This work proposes a novel High-Level Synthesis (HLS) scheduling technique that uses multiple overlapping clocks instead of a single global clock. By aligning clock edges with computation layers and flattening control structures, this approach significantly reduces idle time caused by latency imbalance in traditional HLS scheduling.

The method was implemented in LLVM, demonstrating major improvements in performance and power efficiency compared to a commercial HLS tool (Vitis). Across benchmark applications, the technique achieved up to 16× faster execution and up to 2× lower power consumption.

🔗 Paper: link

Conferences

cpuC: a dynamic reconfigurable architecture for CNNs AccelerationPresented at IEEE MCSoC 2025

Using Multiple Clocks in High-Level Synthesis to Overcome Unbalanced Clock CyclesPresented at IEEE MCSoC 2023

About

cpuC: a dynamic reconfigurable architecture for CNNs Acceleration
Presented at IEEE MCSoC 2025

Using Multiple Clocks in High-Level Synthesis to Overcome Unbalanced Clock Cycles
Presented at IEEE MCSoC 2023