25–27 Mar 2025
European Space Research and Technology Centre (ESTEC)
Europe/Amsterdam timezone
Draft Agenda published

Fast SEU Detection and Recovery in FPGA-Based AI Accelerators

26 Mar 2025, 12:10
25m
Newton 1 and 2 (European Space Research and Technology Centre (ESTEC))

Newton 1 and 2

European Space Research and Technology Centre (ESTEC)

Keplerlaan 1 2201AZ Noordwijk ZH The Netherlands
Oral presentation Fault Tolerance Methodologies and Tools Fault Tolerance Methodologies and Tools

Speaker

Eleonora Vacca (Politecnico di Torino)

Description

The increasing complexity of deep learning models has created a demand for high-performance computing platforms that can efficiently execute inference tasks. FPGA's flexibility made them an appealing choice for accelerating such tasks. Recently, we also witnessed a growing interest in RISC-V-based solutions combined with dedicated AI accelerators to enhance computational capabilities. While these platforms successfully address performance requirements, their implementation in reconfigurable logic for safety-critical applications—such as space missions—introduces reliability challenges. Specifically, Single Event Upsets (SEUs) in the Configuration RAM (CRAM) can alter circuit behavior, potentially leading to system failure. Traditional redundancy-based fault-tolerance techniques are impractical for Deep Neural Network (DNN) accelerators due to limited hardware resources and the inherently high parallelism of their datapaths. Therefore, fast detection of radiation-induced faults is critical to prevent mission-compromising consequences, along with an efficient recovery mechanism to minimize system downtime.
To address these challenges, we propose an FPGA-based heterogeneous computing platform that integrates the NEORV32 RISC-V processor with a Systolic Array-based DNN accelerator, implemented on an AMD KCU105 device. Our system features a built-in self-test and self-recovery mechanism that leverages algorithm-based fault tolerance to detect errors in the accelerator’s datapath during neural network execution. By extending the accelerator’s ISA, inference can run in either standard or testing mode, enabling fault detection capability in the order of a few clock cycles rather than seconds. When a fault is detected, we exploit the FPGA partial reconfigurability feature to trigger dynamic partial reconfiguration (DPR) and reload only the affected bitstream section—preserving ongoing computations while restoring the faulty accelerator.
On AMD KCU105 device, the proposed platform reduces system downtime by up to 900× compared to full-device reconfiguration, ensuring rapid fault recovery and enhancing system availability. Additionally, our approach limits worst-case inference execution overhead to 30%, a significant improvement over traditional methods that can incur up to 96% overhead.

Affiliation of author(s)

Politecnico di Torino

Track Reconfiguration

Primary authors

Eleonora Vacca (Politecnico di Torino) Mr Giorgio Cora (Politecnico di Torino) Prof. Luca Sterpone (Politecnico di Torino)

Presentation materials

There are no materials yet.