How to increase embedded AI processing for autonomous systems

Advanced artificial intelligence (AI) processing, such as recognition of the surrounding environment, decision of actions, and motion control, is required in various aspects of society, including factories, logistics, medical care, service robots operating in the city, and security cameras.

Systems need to handle advanced AI processing in real time and the system must be embedded within the device to enable a quick response to its constantly changing environment. AI chips at the same time consuming less power while performing advanced AI processing in embedded devices with strict limitations on heat generation.

To meet these market needs, Renesas developed DRP-AI (Dynamically Reconfigurable Processor for AI) as an AI accelerator for high-speed AI inference processing combining low power and flexibility required by the edge devices. This reconfigurable AI processor technology is embedded in the RZ/V series of MPUs targeted at AI applications. The next-generation of the DRP-AI - DRP-AI3 – achieve power efficiency approximately 10 times higher than that of the previous generation to support further evolution of AI and the sophisticated requirements of robotics and automation applications.

DRP-AI3 Accelerator Features – High-Speed, Low-Power Hardware Features based on the Pruning AI Model

DRP-AI3 combines both the hardware and software to deliver the heterogeneous architecture for an AI-MPU.

Figure 1: Cooperative Design of the Hardware and Software for DRP-AI3

DRP-AI3 is a hardware architecture supporting the main model compression technology of bit count reduction (INT8) and pruning technology. The flexibility of DRP-AI3 allows faster random pruning models, which is difficult to achieve with existing hardware. Processing time can be reduced to as little as 1/16 and power consumption to about 1/8 compared to before pruning was applied.

Figure 2: Model Compression Technology Applied to DRP-AI3

DRP-AI3 introduces high-speed and low-power methods that support following major AI model compression methods:

Quantization: Lower bit weights for neural network weight information (weight) and input/output data (feature map) for each layer. Change from 16-bit floating-point arithmetic in conventional DRP-AI to 8-bit integer arithmetic (INT8).
Pruning: A technique to skip calculations by setting weight information (branches) that do not affect recognition accuracy to zero.

(1) Ideally, quantization is expected to yield more than around 2 times less power than conventional DRP-AI (16-bit processing), since the size of the arithmetic unit and the amount of data access are lighter in relation to the number of bits. (2) In addition, pruning depends on the AI model as to how much weight information can be retained, but if, for example, 90% pruning can be achieved, the expected value will be about 10 times higher speed and lower power consumption.

A major challenge with the current AI hardware is that it cannot efficiently process AI models, especially (2) pruned AI models. AI hardware is generally based on the SIMD (Single Instruction Multiple Data) architecture, which performs many simultaneous multiply-accumulate (MAC)operations to efficiently process large neural networks. Since the locations of weights do not affect recognition accuracy are randomly located in the matrix, even if some of the weights become zero inside the parallel MAC operation, the parallel computation is still performed, together with non-zero weights. Therefore, not reducing the number of computations by pruning branches (Figure 3).

Figure 3: Pruned Model Processing with General Parallel Architecture

Structured pruning, where values are set to zero (e.g., zero for each column of a weight matrix) without compromising parallelism, is a well-known pruning method used in AI hardware. However, this method cannot achieve a high pruning rate because the conditions are significantly different for weight information that is inherently random. Therefore, Renesas developed the flexible N:M pruning method that has the flexibility to skip operations even in more random pruning.

As shown in Figure 4, this technology is to divide the original weight matrix into weight matrix groups of M rows, reconstruct them into smaller N-row weight matrix groups, from which only significant weights are extracted in each group., Parallel operations are then performed on the new weight matrix groups. In this process, DRP-AI3 has a new function allowing the number of operation cycles to be adjusted freely switching the value of N for each weight matrix group, making it possible to perform optimal skipping of operation processing for local varying pruning rates in the actual AI model, as shown in Figure 4. This ability to finely vary N also allows the pruning rate of the entire weight matrix to be set in detail, enabling optimal pruning processing according to the user's required power consumption, operating speed, and recognition accuracy requirements.

Figure 4: Compression of a Pruned Model Using DRP-AI3

This technology reduces the weight data size and the number of processing cycles for AI models by at least 1/10 and 1/16 respectively, resulting in a significant improvement in processing efficiency compared to conventional AI accelerator configurations (Figure 5).

Figure 5: Comparison of Pruned Model Processing Performance by Accelerator

Software features for generating and implementing pruned models

A pruning flow is generally applied, as shown in Figure 6 to improve the pruning rate while suppressing the degradation of recognition accuracy.

Figure 6: DRP-AI3 AI Model Compression and Implementation Flow

Generally, after the initial training, the pruning points are selected to have the least impact on recognition accuracy. Renesas developed a pruning tool (DRP-AI Extension Pack) that selects pruning points to satisfy the aforementioned DRP-AI3 pruning hardware's architecture constraints. Users can apply DRP-AI3's characteristic "flexible N:M pruning" by simply specifying the pruning rate.

To further ease the introduction of pruning, the above tools are provided based on OSS AI frameworks (Pytorch, Tensorflow), which enables pruning and retraining (in Figure 6) by simply adding a few lines to the user's existing training scripts.

In addition, pruned AI models generated by the pruning tool can be converted by DRP-AI TVM for simultaneous INT8 quantization and compilation.

Here, DRP-AI TVM is a tool to convert trained AI models into a format that is executable on Renesas AI MPUs. It is based on Apache TVM, an OSS ML compiler framework, and is capable of allocating operations in each layer of the AI model that can be executed by the DRP-AI and those operations that cannot be executed by DRP-AI to the CPU for processing. This type of computing using multiple processors together is called heterogeneous computing, and can greatly expand the number of AI models to be executed.

Renesas provides adequate software environments to minimize the time and effort of users to introduce pruning while maximizing the DRP-AI hardware architecture through hardware-software co-design, and improve pruning efficiency.

DRP-AI3 Accelerator Features –Heterogeneous Architecture Features in which DRP-AI, DRP, and CPU Operate Cooperatively

Service robots, for example, require advanced AI processing to recognize the surrounding environment. On the other hand, non-AI algorithm-based processing is also required for deciding and controlling the robot's behavior. However, current embedded CPUs lack sufficient resources to perform these various types of processing in real time. Renesas solved this problem by developing a heterogeneous architecture technology that enables the dynamically reconfigurable processor (DRP), AI accelerator (DRP-AI), and CPU to work together.

As shown in Figure 7, DRP can execute applications while dynamically switching the circuit connection configuration of the arithmetic units on the chip at each operating clock according to the content to be processed. Since only the necessary arithmetic circuits are used, the DRP consumes less power than with CPU processing and can achieve higher speed. Furthermore, compared to CPUs, where frequent external memory accesses due to cache misses and other causes will degrade performance, the DRP can build the necessary data paths in hardware ahead of time, resulting in less variation in operating speed (jitter) due to memory accesses.

The DRP also has a dynamic loading function that switches the circuit connection information each time the algorithm changes, enabling processing with limited hardware resources, even in robotic applications that require processing of multiple algorithms.

The DRP is particularly effective in processing streaming data such as image recognition, where parallelization and pipelining directly improve performance. On the other hand, CPU software processing may be more suitable for programs such as robot behavior decision and control require processing while changing conditions and processing details in response to changes in the surrounding environment. Renesas’ heterogeneous architecture technology allows the DRP and CPU to work together by distributing the processing to the right places and to operate in a coordinated manner.

Figure 7: Flexible Dynamically Reconfigurable Processor (DRP) Features

An overview of the MPU and AI accelerator (DRP-AI) architecture is shown in Figure 8.

Figure 8: DRP-AI3-based Heterogeneous Architecture Configuration

Evaluation results

A prototype test chip achieved a maximum of 8 TOPS (8 trillion operations per second) for the processing performance of the AI accelerator. By reducing the number of operation cycles in proportion to the amount of pruning, we could achieve AI model processing performance equivalent to a maximum of 80 TOPS when compared to models before pruning (Note 1). This is about 80 times higher than the processing performance of the conventional DRP-AI, a significant performance improvement to keep pace with the rapid evolution of AI (Figure 9).

Figure 9: Comparison of Measured Peak Performance of DRP-AI

As AI processing speeds up, the processing time for non-AI image processing is becoming a relative bottleneck. In AI-MPUs, a portion of the image processing program is offloaded to the DRP, thereby contributing to the improvement of the overall system processing time (Figure 10).

Figure 10: Heterogeneous Architecture Speeds Up Image Recognition Processing

The same AI real-time processing could be performed on an evaluation board equipped with the prototype chip, without a fan comparable to existing market products equipped with fans (Figure 12).

Figure 11: Comparison of Heat Generation between a Fanless DRP-AI Test Board and a GPU with Fan

Conclusion

Renesas DRP-AI3 – an advanced version of DRP-AI (Dynamically Reconfigurable Processor for AI) – is a unique AI accelerator combining the low power and flexibility required by endpoints, with processing capabilities for lightweight AI models, and 10 times more power efficient (10 TOPS/W) than the previous models.

Visit www.renesas.com/rzv2h to learn more about the device and Renesas DRP-AI. For More information Click here

Download article in PDF