SEFUW: SpacE FPGA Users Workshop, 6th Edition

Name: SEFUW: SpacE FPGA Users Workshop, 6th Edition
Start: 2025-03-25T08:00:00+01:00
End: 2025-03-27T22:50:00+01:00
Location: European Space Research and Technology Centre (ESTEC)

25–27 Mar 2025

European Space Research and Technology Centre (ESTEC)

Europe/Amsterdam timezone

Draft Agenda published

For information please write to

sefuw@esa.int

Politecnico di Torino - Fast SEU Detection and Recovery in FPGA-Based AI Accelerators

26 Mar 2025, 12:10

25m

Newton 1 and 2 (European Space Research and Technology Centre (ESTEC))

Newton 1 and 2

European Space Research and Technology Centre (ESTEC)

Keplerlaan 1 2201AZ Noordwijk ZH The Netherlands

Oral presentation Fault Tolerance Methodologies and Tools Fault Tolerance Methodologies and Tools

Eleonora Vacca (Politecnico di Torino)

The increasing complexity of deep learning models has created a demand for high-performance computing platforms that can efficiently execute inference tasks. FPGA's flexibility made them an appealing choice for accelerating such tasks. Recently, we also witnessed a growing interest in RISC-V-based solutions combined with dedicated AI accelerators to enhance computational capabilities. While these platforms successfully address performance requirements, their implementation in reconfigurable logic for safety-critical applications—such as space missions—introduces reliability challenges. Specifically, Single Event Upsets (SEUs) in the Configuration RAM (CRAM) can alter circuit behavior, potentially leading to system failure. Traditional redundancy-based fault-tolerance techniques are impractical for Deep Neural Network (DNN) accelerators due to limited hardware resources and the inherently high parallelism of their datapaths. Therefore, fast detection of radiation-induced faults is critical to prevent mission-compromising consequences, along with an efficient recovery mechanism to minimize system downtime.
To address these challenges, we propose an FPGA-based heterogeneous computing platform that integrates the NEORV32 RISC-V processor with a Systolic Array-based DNN accelerator, implemented on an AMD KCU105 device. Our system features a built-in self-test and self-recovery mechanism that leverages algorithm-based fault tolerance to detect errors in the accelerator’s datapath during neural network execution. By extending the accelerator’s ISA, inference can run in either standard or testing mode, enabling fault detection capability in the order of a few clock cycles rather than seconds. When a fault is detected, we exploit the FPGA partial reconfigurability feature to trigger dynamic partial reconfiguration (DPR) and reload only the affected bitstream section—preserving ongoing computations while restoring the faulty accelerator.
On AMD KCU105 device, the proposed platform reduces system downtime by up to 900× compared to full-device reconfiguration, ensuring rapid fault recovery and enhancing system availability. Additionally, our approach limits worst-case inference execution overhead to 30%, a significant improvement over traditional methods that can incur up to 96% overhead.

Affiliation of author(s)

Politecnico di Torino

Track	Reconfiguration

Eleonora Vacca (Politecnico di Torino) Mr Giorgio Cora (Politecnico di Torino) Prof. Luca Sterpone (Politecnico di Torino)

SEFUW.pptx

SEFUW_Eleonora_Vacca.pdf

SEFUW: SpacE FPGA Users Workshop, 6th Edition

For information please write to

Politecnico di Torino - Fast SEU Detection and Recovery in FPGA-Based AI Accelerators

Newton 1 and 2

European Space Research and Technology Centre (ESTEC)

Speaker

Description

Affiliation of author(s)

Authors

Presentation materials

Choose timezone

SEFUW: SpacE FPGA Users Workshop, 6th Edition

For information please write to

Speaker

Description

Affiliation of author(s)

Authors

Presentation materials