Conventional neural accelerators rely on isolated self‐sufficient functional units that perform an atomic operation while communicating the results through an operand delivery‐aggregation logic. Each single unit processes all the bits of their operands atomically and produce all the bits of the results in isolation.
Engineers (Hadi Esmaeilzadeh and Soroush Ghodrati) from UC San Diego have designed a neural accelerator that uses a new hardware implementation for performing vector dot‐product operation. This innovative compute engine for vector dot-product also supports dynamic flexibility to support vector dot‐product operation with flexible bit‐widths. The compute engine then is integrated in a conventional architecture to accelerate deep neural networks (neural accelerator).The building block of the UCSD researchers neural accelerator is a Composable Vector Unit that is a collection of Narrower‐Bitwidth Vector Engines, which are dynamically composed or decomposed at the bit granularity.
This invention can be exploited in any neural processing unit for acceleration of neural networks. It can be also used in any other computer hardwares for application that require tensor operations such as linear algebra, digital signal processing, Artificial Intelligence, mobile devices, Hardware Acceleration, Deep Neural Networks, Robotics.
This invention leverages an innovative design, where each unit is only responsible for a slice of the bit‐level operations to interleave and combine the benefits of bit‐level parallelism with the abundant data‐level parallelism in deep neural networks. A dynamic collection of these units cooperate at runtime to generate bits of the results, collectively. Such cooperation requires extracting new grouping between the bits, which is only possible if the operands and operations are vectorizable. The abundance of Data‐Level Parallelism and mostly repeated execution patterns, provides a unique opportunity to define and leverage this new dimension of Bit‐Parallel Vector Composability. This design intersperses bit parallelism within data‐level parallelism and dynamically interweaves the two together. As such, the building block of our neural accelerator is a Composable Vector Unit that is a collection of Narrower‐Bitwidth Vector Engines, which are dynamically composed or decomposed at the bit granularity.
The compute engine in a systolic array architecture to evaluate the effectiveness of our methods on acceleration of deep neural networks. Using six diverse CNN and LSTM deep networks, the inventors evaluated this design style across four design points: with and without algorithmic bitwidth heterogeneity and with and without availability of a high-bandwidth off-chip memory. Across these four design points, Bit-Parallel Vector Composability brings (1.4x to 3.5x) speedup and (1.1x to 2.7x) energy reduction. They also comprehensively compare their design style to the Nvidia RTX 2080 TI GPU, which also supports INT-4 execution. The benefits range between 28.0x and 33.7x improvement in Performance-per-Watt.