Renesas on kehittänyt RZ/V2H:n, ainutlaatuisen tekoälyprosessorin, joka yhdistää päätepisteiden vaatiman pienen tehon ja joustavuuden. Siinä on prosessointitehoa tekoälymallien karsimiseen, ja se on myös 10 kertaa tehokkaampi kuin aikaisemmat tuotteet.
Kirjoittaja Shingo Kojima, Renesas Electronics
Kun työväestö vähenee syntyvyyden laskun ja kasvavan ikääntyneiden osuuden vuoksi, tarvitaan kehittynyttä tekoälyn (AI) prosessointia, kuten ympäröivän ympäristön tunnistamista, toimintapäätöstä ja liikkeenhallintaa eri osissa yhteiskuntaa: tehtaissa, logistiikassa, sairaanhoidossa, kaupungissa toimivissa palveluroboteissa ja turvakameroissa.
Järjestelmä on sulautettava laitteisiin, jotta se pystyy reagoimaan nopeasti jatkuvasti muuttuvaan ympäristöön. AI-sirujen on kulutettava vähemmän virtaa, jotta ne eivät tuota liikaa lämpöä.
Vastatakseen näihin markkinoiden tarpeisiin Renesas kehitti DRP-AI3:n. Piiri on dynaamisesti uudelleenkonfiguroitava prosessori nopeaan AI-päätelmien käsittelyyn, jossa yhdistyy pieni teho ja reunalaitteiden edellyttämä joustavuus. Tämä uudelleen konfiguroitava AI-kiihdytinprosessoritekniikka, jota on viljelty useiden vuosien ajan, on sulautettu tekoälysovelluksiin suunnattujen MPU-prosessorien RZ/V-sarjaan.
Tässä artikkelissa esitellään, kuinka RZ/V2H ratkaisee lämmöntuotantoon liittyvät haasteet, mahdollistaa suuren reaaliaikaisen käsittelynopeuden ja parantaa tekoälyllä varustettujen tuotteiden suorituskykyä ja vähentää virrankulutusta.
Alla ETNdigi-lehden numerossa 1/2024 ilmestynyt artikkeli kokonaisuudessaan.
RUN AI MODELS WITH VERY LOW POWER
As the working population decreases due to falling birthrates and a growing proportion of the population being elderly, advanced artificial intelligence (AI) processing, such as recognition of the surrounding environment, decision of actions, and motion control, will be required in various aspects of society, including factories, logistics, medical care, service robots operating in the city, and security cameras. Systems will need to handle advanced artificial intelligence (AI) processing in real time in various types of programs. In particular, the system must be embedded within the device to enable a quick response to its constantly changing environment. AI chips need to consume less power while performing advanced AI processing in embedded devices with strict limitations on heat generation.
To meet these market needs, Renesas developed DRP-AI3 (Dynamically Reconfigurable Processor for AI3) as an AI accelerator for high-speed AI inference processing combining low power and flexibility required by the edge devices. This reconfigurable AI accelerator processor technology, cultivated over many years, is embedded in the RZ/V series of MPUs targeted at AI applications.
The RZ/V2H is able to respond to the further evolution of AI and the sophisticated requirements of these applications.
This article introduces how the RZ/V2H solves heat generation challenges, enables high real-time processing speed, and realizes higher performance and lower power consumption for AI-equipped products.
RZ/V2H is a new high-end product of the RZ/V series, achieving power efficiency approximately 10 times higher than that of the previous products. The RZ/V2H is able to respond to the further evolution of AI and the sophisticated requirements of applications such as robots. This article introduces how the RZ/V2H solves heat generation challenges, enables high real-time processing speed, and realizes higher performance and lower power consumption for AI-equipped products.
EFFICIENT PROCESSING OF AI MODELS
As a typical technology for improving AI processing efficiency, pruning is available to omit calculations that do not significantly affect recognition accuracy. However, it is common that calculations that do not affect recognition accuracy randomly exist in AI models. This causes a difference between the parallelism of hardware processing and the randomness of pruning, which makes processing inefficient.
To solve this issue, Renesas optimized its unique DRP-based AI accelerator (DRP-AI) for pruning. By analyzing how pruning pattern characteristics and a pruning method are related to recognition accuracy in typical image recognition AI models (CNN models), we identified the hardware structure of an AI accelerator that can achieve both high recognition accuracy and an efficient pruning rate, and applied it to the DRP-AI3 design. In addition, software was developed to reduce the weight of AI models optimized for this DRP-AI3. This software converts the random pruning model configuration into highly efficient parallel computing, resulting in higher-speed AI processing. In particular, Renesas' highly flexible pruning support technology (flexible N:M pruning technology), which can dynamically change the number of cycles in response to changes in the local pruning rate in AI models, allows for fine control of the pruning rate according to the power consumption, operating speed, and recognition accuracy required by users.
Figure 1: Flexible Dynamically Reconfigurable Processor (DRP) Features.
Heterogeneous Architecture Features in which DRP-AI3, DRP, and CPUs Operate Cooperatively
- Multi-threaded and pipelined processing with AI accelerator(DRP-AI3), DRP, and CPUs
- Low jitter and high speed robot applications with DRP (dynamically reconfigurable wired logic hardware)
Service robots, for example, require advanced AI processing to recognize the surrounding environment. On the other hand, algorithm-based processing that does not use AI is also required for deciding and controlling the robot's behavior. However, current embedded processors (CPUs) lack sufficient resources to perform these various types of processing in real time. Renesas solved this problem by developing a heterogeneous architecture technology that enables the dynamically reconfigurable processor (DRP), AI accelerator (DRP-AI3), and CPU to work together.
As shown in Figure 1, the dynamically reconfigurable processor (DRP) can execute applications while dynamically switching the circuit connection configuration of the arithmetic units on the chip at each operating clock according to the content to be processed. Since only the necessary arithmetic circuits are used, the DRP consumes less power than with CPU processing and can achieve higher speed.
Furthermore, compared to CPUs, where frequent external memory accesses due to cache misses and other causes will degrade performance, the DRP can build the necessary data paths in hardware ahead of time, resulting in less performance degradation and less variation in operating speed (jitter) due to memory accesses.
The DRP also has a dynamic reconfigurable function that switches the circuit connection information each time the algorithm changes, enabling processing with limited hardware resources, even in robotic applications that require processing of multiple algorithms.
The DRP is particularly effective in processing streaming data such as image recognition, where parallelization and pipelining directly improve performance. On the other hand, programs such as robot behavior decision and control require processing while changing conditions and processing details in response to changes in the surrounding environment. CPU software processing may be more suitable for this than hardware processing such as in the DRP. It is important to distribute processing to the right places and to operate in a coordinated manner. Renesas’ a heterogeneous architecture technology allows the DRP and CPU to work together.
An overview of the MPU and AI accelerator (DRP-AI3) architecture is shown in Figure 2. Robotic applications use a sophisticated combination of AI-based image recognition and non-AI decision and control algorithms. Therefore, a configuration with a DRP for AI processing (DRP-AI3) and a DRP for non-AI algorithms will significantly increase the throughput of the robotic application.
Figure 2: DRP-AI 3-based Heterogeneous Architecture Configuration.
EVALUATION OF PROCESSING PERFORMANCE
RZ/V2H equipped with this technology has achieved a maximum of 8 TOPS (8 trillion sum-of-products operations per second) for the processing performance of the AI accelerator. Furthermore, for AI models that have been pruned, the number of operation cycles can be reduced in proportion to the amount of pruning, thus achieving AI model processing performance equivalent to a maximum of 80 TOPS when compared to models before pruning. This is about 80 times higher than the processing performance of the previous RZ/V products, a significant performance improvement that can sufficiently keep pace with the rapid evolution of AI (Figure 3).
Figure 3: Comparison of Measured Peak Performance of DRP-AI3.
On the one hand, as AI processing speeds up, the processing time for algorithm-based image processing without AI, such as pre- and post-AI processing is becoming a relative bottleneck. In AI-MPUs, a portion of the image processing program is offloaded to the DRP, thereby contributing to the improvement of the overall system processing time. (Figure 4)
Figure 4: Heterogeneous Architecture Speeds Up Image Recognition Processing (Measured by Test Chip).
In terms of power efficiency, the performance evaluation of the AI accelerator demonstrated the world's top level power efficiency (approximately 10 TOPS per watt) when running major AI models. (Figure 5)
Figure 5: Power Efficiency of Real AI Models (Measured by Test Chip).
We also showed that the same AI real-time processing could be performed on an evaluation board equipped with the RZ/V2H, without a fan at temperatures comparable to competitor products equipped with fans. (Figure 6)
Figure 6: Comparison of Heat Generation between a Fanless RZ/V2H Board and a GPU with Fan.
EXAMPLES OF APPLICATIONS
For example, SLAM (Simultaneously Localization And Mapping), one of the typical robot applications, has a complex configuration that requires multiple program processes for robot position recognition in parallel with environment recognition by AI processing. The Renesas DRP enables the robot to switch programs instantaneously, and parallel operation with an AI accelerator and CPU has proven to be about 17 times faster than CPU operation alone, and to reduce power consumption to 1/12 the level of CPU operation alone.
CONCLUSIONS
Renesas developed RZ/V2H, a unique AI processor that combines the low power and flexibility required by endpoints, with processing capabilities for pruning AI models, and 10 times more power efficient (10 TOPS/W) than the previous products.
Renesas will release products in a timely manner responding to the AI evolution, which is expected to become increasingly sophisticated, and will contribute to deploy systems that respond to end-point products in a smart and real-time manner.