Universidade Federal do Rio Grande do Sul, Brazil Graduate Program in Microelectronics www.ufrgs.br/pgmicro



#### Enhancements on Fault Injection for Xilinx 7 Series and UltraScale+ SRAM-Based FPGAs

Fabio Benevenuti, Fernanda Lima Kastensmidt

5<sup>th</sup> SEFUW SpacE FPGA Users Workshop March 2023, ESTEC, Noordwijk, The Netherlands

#### Content



- Reliability and fault injection
- Investigation on Xilinx 7 Series & UltraScale+
- Improvements on UFRGS fault injector

# **Motivation**

- Physical fault injection
  - Real hardware
  - Radioactive sources, particle accelerators, laser
  - Costly facilities, controlled environment
- Emulation-based fault injection
  - Real hardware, but exploiting test & configuration circuitry to manipulate device and emulate radiation effects

□ JTAG, SelectMAP, PCAP, ICAP, ...

- Lower cost, no complex facilities
- May be focused on modules of interest
- Application running near or at nominal speed in real hardware
- Simulation-based fault injection
  - Hardware or circuit models used to simulate faults
  - Detailed observations
  - Available on early stages of engineering, even before real hardware existence
  - Possibly the lower cost (mostly software), but also the slower

3





Fault Injector

Improvements







#### Motivation

#### Motivation

- Physical fault injection
  - Real hardware
  - Radioactive sources, particle accelerators, laser
  - Costly facilities, controlled environment
- Emulation-based fault injection
  - Real hardware, but exploiting test & configuration circuitry to manipulate device and emulate radiation effects

□ JTAG, SelectMAP, PCAP, ICAP, ...

- Lower cost, no complex facilities
- May be focused on modules of interest
- Application running near or at nominal speed in real hardware
- Simulation-based fault injection
  - Hardware or circuit models used to simulate faults
  - Detailed observations
  - Available on early stages of engineering, even before real hardware existence
  - Possibly the lower cost (mostly software), but also the slower







#### Reliability & Fault Injection

|--|

# Fault space

UFRGS PGMICRO

Block RAM

- Fault location: where
  - User/Application data: BRAM, flip-flips, LUTRAM/shift-registers,...
  - Configuration memory: LUT equation, DSP opmode, INT/PIP switchbox routing,...

- Fault type: emulated effect
  - Single bit-flip (SBU-SEU), multiple bit-flips (MBU-SEU)
- Fault time: when
  - Important for dynamic data (BRAM, flip-flop), subject to temporal masking
  - Also important for CRAM when memory scrubbing is active
  - Less important for persistent data (CRAM) without scrubbing





 $(\mathbf{X})$ 

# Qualification metrics

- Essential and critical bits
  - Captures only a static behavior of the design
  - There special conditions where non-essential bit may become a critical bit



(Xilinx, 2012)

# Qualification metrics

- AVF, architectural vulnerability factor
  - May require extensive scan of the whole processing cycle to describe AVF in terms of sensitive surface over residency time
- Mean metrics, single-point statistics
  - Cross-section, Mean time between failure (MTBF): not good for comparison in the presence of mitigations/redundancy (TMR)
  - Mean execution between failure (MEBF), Mean workload between failure (MWBF): not good for comparison when design bellow 100% duty cycle
- Reliability curves, mission time
  - Captures the dynamic and cumulative effect over time
  - Allows for focusing on the high-reliability zone (experiment truncation/censoring)
  - Allows for extraction of mean metrics
  - Allows cross-validation between radiation and emulated fault injection

## Reliability curves



- Accumulate faults (or fluence) until failure (or until limit of interest)
  - Collect a number N of events, recording time t, fluence  $\Phi$  or number of accumulated faults for each event



#### Fault injectors for Xilinx FPGAs



#### Not an exhaustive list

|                        | Fault Injector                          | Data manipulation                               | Target device    | Access<br>interface |
|------------------------|-----------------------------------------|-------------------------------------------------|------------------|---------------------|
| Cumulative,<br>R curve | Antoni et al. (2000)                    | Modifies bitstream file before loading in FPGA. | Xilinx Virtex    | JTAG/<br>MultiLINX  |
|                        | Johnson et al. (2003)                   | Bitstream may be 1 single frame.                | Xilinx Virtex    | JTAG/<br>MultiLINX  |
|                        | Aldeghiri et al. (2007)<br>FLIPPER      |                                                 | Xilinx Virtex-II | SelectMAP/<br>JTAG  |
|                        | Napoles et al. (2007)<br>FT-UNSHADES    |                                                 | Xilinx Virtex-II | SelectMAP           |
|                        | Mogollon et al. (2011)<br>FT-UNSHADES2  | Uses one device/FPGA to manipulate bitstream    | Xilinx Virtex-5  | SelectMAP           |
|                        | Aldeghiri et al. (2014)<br>FLIPPER2     | inside the target FPGA.                         | Xilinx Virtex-4  | SelectMAP/<br>JTAG  |
|                        | Hardward et al. (2015)<br>BYU XRTC-V5FI |                                                 | Xilinx Virtex-5  | SelectMAP           |
|                        | Thurlow et al. (2019)<br>BYU TURTLE     |                                                 | Xilinx 7 Series  | JTAG                |

#### Fault injectors for Xilinx FPGAs



#### Not an exhaustive list

|               | Fault Injector                  | Data manipulation                                                                  | Target device                    | Access<br>interface |
|---------------|---------------------------------|------------------------------------------------------------------------------------|----------------------------------|---------------------|
| Sampling for  | Sterpone et al. (2007)          | Instrumented design, modifies bitstream from within                                | Xilinx Virtex-II                 | ICAP                |
|               | Nazar et al. (2012)<br>UFRGS    | FPGA.                                                                              | Xilinx Virtex-5                  | ICAP                |
|               | Tarrillo et al. (2015)<br>UFRGS | Replays SEUs database collected from radiation<br>experiment.                      | Xilinx Virtex-5                  | ICAP                |
|               | Leipnitz et al. (2016)<br>UFRGS | High speed fault injetion controller through PCI-<br>Express interface.            | Xilinx Virtex-5                  | ICAP                |
| MBU           | Nunes et al. (2015)<br>FIRED    | Instrumented design, modifies bitstream from within<br>FPGA.                       | Xilinx Virtex-5                  | ICAP                |
| Complete list | Villalta et al. (2014)          | Modifies bitstream from Arm Cortex-A software<br>using processador access port.    | Xilinx Zynq-<br>7000             | PCAP                |
|               | Tonfat et al. (2016)<br>UFRGS   | Exhaustive scan for critical bits.                                                 | Xilinx Artix-7                   | ICAP                |
|               | Gomes-Cornejo et al. (2017)     | Modifies BRAM content from Arm Cortex-A<br>software using processador access port. | Xilinx Zynq-<br>7000             | PCAP                |
| Cumulative,   | Bozzoli et al. (2018)<br>PyXEL  | Produces bitstream file variants emulating SEU<br>effect.                          | Xilinx 7 Series                  | SelectMAP/<br>JTAG  |
| R curve       | This work<br>UFRGS              | Cumulative fault injection, SBU+MBU, coexistence<br>with scrubbing.                | Xilinx 7 Series<br>& UltraScale+ | ICAP                |



#### Understanding Xilinx 7 Series & UltraScale+

| Fabio B. | Motivation | Reliability & Fault<br>Injection | Xilinx 7 Series &<br>UltraScale+ | Fault Injector<br>Improvements |  |
|----------|------------|----------------------------------|----------------------------------|--------------------------------|--|
|----------|------------|----------------------------------|----------------------------------|--------------------------------|--|

#### Investigation on Xilinx FPGAs





#### Investigation on Xilinx FPGAs





#### Investigation on Xilinx FPGAs





## Floorplan reachable by fault injector



Example Zynq-7000 XC7Z030



# Floorplan reachable by fault injector







Two ROIs, less then 0.02 mm<sup>2</sup>

9.1 mm



- Laser energies 300pJ and 220 pJ
- Laser shot each ~4 s, readback each ~16 s
- Step ~1 μm horizontal, 5 μm vertical



- Each color a different readback file
  - Run 1
  - Run 2
- Time/position of artifacts

Die *x* dimension (column/framewise)







- Each color a different readback file
  - Run 1
    Run 2







Time/position of artifacts







Time/position of artifacts







#### 7 Series static tests





#### CRAM 1-1-1 CRAM 2-2-2 BRAM 1-1-1 BRAM 2-1-2 CRAM 2-1-2

#### 7 Series static tests



| Type of | Type of   | Examples | a Particles  | Particles Heavy ions |           | Neutrons    |              |
|---------|-----------|----------|--------------|----------------------|-----------|-------------|--------------|
| memory  | SEU       |          | u i articics | ricary folio         | 14 MeV 0° | 14 MeV 180° | (Epi)Thermal |
| BRAM    | SBU 1-1-1 | ×        | 100.0%       | 82.0%                | 93.4%     | 97.1%       | 95.4%        |
|         | MBU 2-1-2 | XX       |              | 16.2%                | 4.7%      | 2.9%        | _            |
|         | MBU 1-2-2 | ×        |              | _                    | _         | _           | 4.5%         |
|         | Others    |          | _            | 1.8%                 | 1.9%      | _           | 0.1%         |
| CRAM    | SBU 1-1-1 |          | 97.6%        | 38.1%                | 76.7%     | 79.9%       | 78.1%        |
|         | MBU 2-2-2 | *        | 2.4%         | 41.9%                | 16.9%     | 15.5%       | 0.0%         |
|         | MBU 2-1-2 | X        | 0.0%         | 4.4%                 | 3.5%      | 2.1%        | 0.0%         |
|         | MBU 1-2-2 | XX       | _            |                      | 0.3%      | 1.5%        | 17.8%        |
|         | MBU 2-2-3 | ¥        |              | 3.0%                 | 1.3%      | 0.5%        | 0.0%         |
|         | MBU 2-2-4 | 8        |              | 0.2%                 | _         | 0.5%        | _            |
|         | MBU 2-3-4 | **       | —            | 8.3%                 | 0.6%      | _           | 0.0%         |
|         | MBU 2-3-5 | × × ×    |              | 0.6%                 | 0.3%      |             | 0.0%         |
|         | Others    |          | _            | 3.4%                 | 0.3%      | _           | 4.1%         |

#### 7 Series static tests



| Type of | Type of   | Examples | <i>α</i> Particles | Heavy ions   |           | Neutrons                |              |
|---------|-----------|----------|--------------------|--------------|-----------|-------------------------|--------------|
| memory  | SEU       |          | u rancies          | Theavy Ton's | 14 MeV 0° | 14 MeV 180°             | (Epi)Thermal |
| BRAM    | SBU 1-1-1 | *        | 100.0%             | 82.0%        | 93.4%     | 97.1%                   | 95.4%        |
|         | MBU 2-1-2 | XX       | _                  | 16.2%        | 4.7%      | 2.9%                    | _            |
|         | MBU 1-2-2 | ×        |                    | _            | _         | _                       | 4.5%         |
|         | Others    |          |                    | 1.8%         | 1.9%      | _                       | 0.1%         |
| CRAM    | SBU 1-1-1 | ×        | 97.6%              | 38.1%        | 76.7%     | 79.9%                   | 78.1%        |
|         | MBU 2-2-2 | *        | 2.4%               | 41.9%        | 16.9%     | 15.5%                   | 0.0%         |
|         | MBU 2-1-2 | *        | 0.0%               | 4.4%         | 3.5%      | 2.1%                    | 0.0%         |
|         | MBU 1-2-2 | ×        |                    |              | 0.3%      | 1.5%                    | 17.8%        |
|         | MBU 2-2-3 | *        | _                  | 3.0%         | 1.3%      | 0.5%                    | 0.0%         |
|         | MBU 2-2-4 |          | _                  | 0.2%         | _         | 0.5%                    | _            |
|         | MBU 2-3-4 | ***      |                    | 8.3%         | 0.6%      |                         | <u>0.0%</u>  |
|         | MBU 2-3-5 | ***      |                    | 0.6%         | Cr        | hallenge to<br>scrubber | 0%           |
|         | Others    |          | _                  | 3.4%         | 0.3%      | _                       | 4.1%         |

## UltraScale+ (16 nm FinFET)



- Main changes in floorplan
  - Same number of LUTs and flip-flops in a CLB, but now all in one slice instead of two
  - Floorplan simplification
  - Switch box columns are now independent from CLB/DSP/BRAM logic
  - There is no more the concept of TOP and BOTTOM rows
  - New layout of BRAM bits (sliced in 256 frames instead of 128)
- Since UltraScale (20 nm planar)
  - More interleaving on CRAM: aggressive reduction of intraframe MBU
     Positive impact of scrubbing

## UltraScale+ (16 nm FinFET)





# UltraScale+ (16 nm FinFET) ERGS PGMICRO UltraScale+ CLB CLB Switch box





#### Enhancements on UFRGS Fault Injector

| Fabio B. Motivation | Reliability & Fault | Xilinx 7 Series & | Fault Injector |
|---------------------|---------------------|-------------------|----------------|
|                     | Injection           | UltraScale+       | Improvements   |



- Emulate cumulative effect of SEUs
  - Build a reliability curve (CDF) similar to obtained from radiation
- Accelerate fault injection campaign

- Emulate cumulative effect of SEUs
  - Build a reliability curve (CDF) similar to obtained from radiation
- Accelerate fault injection campaign





Test results for two implementations of *study-case CNN* for aerial image classification (RADECS 2021)



33

Mitigations

MB

MIPS



- An estimate on the number of critical bits can still be obtained with randomaccumulated methodology
  - R(t=1) ~ Nazar et al. (2012) random sampling

Works for

7 Series UltraScal

Fabio B.

Where

BRAM

CRAM

FF



When

ASYNC

Mode

RAR

CCUM

What

MBU

SBU

N MLP CNN

Study-cases

ARM M0

MB

мхм



Test results for different blocks of a study-case MLP (SBCCI 2018)



Works for

7 Series UltraScale

Where

BRAM

FF

CRAM

Mode

RAR

ACCUM

When

ASYNC

What

MBU

NN MLP CNN

SBU

Study-cases

мхм

ARM M0

MB

MIPS

SCRUB

Mitigations

HAMM

# Asynchronous Fault Injection

- UFRGS PGMICRO
- Inject faults in designs when Xilinx native scrubber (FRAME\_ECC) is active
- Fault may be injected <u>during</u> DUT processing cycle
  - And scrubbed at any point in time of the processing cycle



## **Asynchronous Fault Injection**

PGMICRO

Test results for two implementations of *study-case CNN* for traffic sign classification (RADECS 2019, JICS 2021)

#### Mission time for $R(f) \ge 90\%$ from fault injection

#### Mission time for $R(\Phi) \ge 90\%$ from heavy ions



ARM M0

MB

MIPS

Implementations using Xilinx scrubber

Mode

RAR

CCUM

When

ASYNC

SBU

Fabio B. 7 Series

100

90

80

70

60

50 40

30 20

10 0

Float

Q16

Works for

UltraScale

Q10

Accumulated faults injected

Where

CRAM

Design variant

What Study-cases CNN мхм HLS MBU NN MLP

37

# **Asynchronous Fault Injection**

Test results for experimental design implementing a softcore microprocessor with different levels of fault mitigation (TNS 2019)



Mode

RAR

ASYNC

SBU

MBU

CCUM

BRAM

CRAM

FF

Fabio B.

7 Series

UltraScale

#### Dynamic cross section from heavy ions

#### Error rate from fault injection



#### failure $au_{\it failure}$ – faults iniected

ARM M0

MB

MIPS

SCRUB

CNN

NN MLP

мхм

HLS





- All bit-flips must be seen as a single SEU
  - Instrument clock-gate for design under test
- Scrubber should not correct partially injected SEU
  - Suspend scrubber during MBU injection (FPGA control registers)
- A table of most frequent MBU geometries was embedded into the fault injection module
  - Single command, faster communication
- It is up to the fault injection campaign scripting to decide the ratio of SBU and MBU, and its geometry, to emulate the targeted radiation environment



- Test results for experimental design of *matrix multiplication generated by high-level synthesis (HLS)*



$$R_{Weibull,\alpha,\beta}(t) = e^{-\left(\frac{t}{\alpha}\right)^{\beta}} \qquad MT_{\{R(t) \ge r\}} = \alpha \left(-\ln r\right)^{\frac{1}{\beta}}$$

What

MBU

NN MLP

CNN

SBU

Study-cases

ARM M0

MB

MIPS

мхм

Works for

UltraScale

7 Series

Where

BRAM

FF

CRAM

Mode

RAR

ACCUM

When

ASYNC

Mitigations

SIHFT

намм

SCRUB



Test results for experimental design of *matrix multiplication generated by high-level synthesis (HLS)*



$$R_{Weibull,\alpha,\beta}(t) = e^{-\left(\frac{t}{\alpha}\right)^{\beta}} \qquad MT_{\{R(t) \ge r\}} = \alpha \left(-\ln r\right)^{\frac{1}{\beta}}$$

What

MBU

NN MLP

CNN

SBU

Study-cases

ARM M0

MB

MIPS

SCRUB

мхм

When

ASYNC

Mode

RAR

ACCUM

Works for

UltraScale

7 Series

Where

BRAM

FF

CRAM

 Test results for experimental design implementing a softcore microprocessor with different levels of fault mitigation (TNS 2019)



Where

BRAM

FF

CRAM

When

ASYNC

Mode

RAR

ACCUM

What

MBU

NN MLP

CNN

SBU

Study-cases

HLS

ARM M0

MB

MIPS

SCRUB

мхм

#### Heavy ions <sup>16</sup>O

Fault injection with single bit-flips 100% 80% 60% Reliability 40% 20% 0% 1 10 100 1000 Accumulated faults injected (F/kf) Unmit, No Scrub. ----- Unmit. Scrub. CGTMR No Scrub. CGTMR Scrub. FGDTMR No Scrub. ----- FGDTMR Scrub.



Works for

Mitigations

- PGMICRO
- Test results for experimental design implementing a *softcore microprocessor* with different levels of fault mitigation (TNS 2019)



#### Heavy ions <sup>16</sup>O



Works for

Where

BRAM

FF

CRAM

Mode

RAR

ACCUM

When

ASYNC

What

MBU

NN MLP

CNN

SBU

Study-cases

HLS

ARM M0

MB

MIPS

SCRUB

мхм

Mitigations

1000



 Test results for experimental design implementing a *softcore microprocessor* with different levels of fault mitigation (TNS 2019)

Ratio between radiation and fault injection with and without MBU



Fabio B.

# Fault injection on flip-flops

 Reuse the mechanisms for capture, readback, partial reconfiguration mask and reset after reconfiguration (RAR)

PGMICRO

- 7 Series only, RAR is different for UltraScale+
- Can be used concomitantly with the CRAM fault injection



# Fault injection on flip-flops

- UFRGS PGMICRO
- Test results for experimental design of *softcore* MicroBlaze and miniMIPS microprocessors onboard NanoSatC-BR2 cubesat payload (RAW 2019)
  - Software hardened by SIHFT techniques
  - Faults injected selectively on CLB flip-flops only



## Port of fault injector to UltraScale+



47

#### Main changes

- Different number of bits inside a frame
- Switch box columns addressed independently
- New semantics for some FPGA registers
- Minor changes on ICAP hardware block interface
  - □ Better coordination among multiple ICAP instances



## Port of fault injector to UltraScale+

When

ASYNC

What

MBU

NN MLP

CNN

SBU

Study-cases

HLS ARM M0

MB

MIPS

мхм

Mode

RAR

ACCUM



Test results for experimental design of *matrix multiplication generated by high-level synthesis (HLS)*

| Matrix Multiplication Benchmark Application         |                      |                |                                         |  |  |  |
|-----------------------------------------------------|----------------------|----------------|-----------------------------------------|--|--|--|
| 7 Series                                            | Zynq-7000 SoC (Z030) | Ultrascale     | + Zynq MPSoC (ZU3EG)                    |  |  |  |
| Clock: 100 MHz                                      | 2                    | Clock: 100 MHz |                                         |  |  |  |
| WNS (FGDIMR                                         | ): 0.7 ns            | WNS (FGDIMR    | ): 3.1 ns                               |  |  |  |
| Unhardened:                                         |                      | Unhardened:    |                                         |  |  |  |
| LUT                                                 | 487                  | LUT            | 430                                     |  |  |  |
| FF                                                  | 564                  | FF             | 564                                     |  |  |  |
| CARRY4                                              | 59                   | CARRY8         | 28                                      |  |  |  |
| BRAM                                                | 3                    | BRAM           | 3                                       |  |  |  |
| DSP                                                 | 7                    | DSP            | 7                                       |  |  |  |
| CGTMR:                                              |                      | CGTMR:         |                                         |  |  |  |
| LUT                                                 | 1691                 | LUT            | 1536                                    |  |  |  |
| FF                                                  | 1692                 | FF             | 1692                                    |  |  |  |
| CARRY4                                              | 177                  | CARRY8         | 75                                      |  |  |  |
| BRAM                                                | 9                    | BRAM           | 9                                       |  |  |  |
| DSP                                                 | 21                   | DSP            | 21                                      |  |  |  |
| FGDTMR:                                             |                      | FGDTMR:        |                                         |  |  |  |
| LUT                                                 | 8383                 | LUT            | 8381                                    |  |  |  |
| FF                                                  | 3276                 | FF             | 3276                                    |  |  |  |
| CARRY4                                              | 132                  | CARRY8         | 66                                      |  |  |  |
| BRAM                                                | 9                    | BRAM           | 9                                       |  |  |  |
| DSP                                                 | 18                   | DSP            | 18                                      |  |  |  |
| Fl pblock 1166 frames x 101 words =<br>3768512 bits |                      | FI pblock 1    | 330 frames x 93 words =<br>3958080 bits |  |  |  |

Fabio B.

Where

BRAM

FF

CRAM

### Port of fault injector to UltraScale+

Test results for experimental design of *matrix multiplication generated by high-*level synthesis (HLS)



UltraScale+



49

# Final discussion



- Legacy features of UFRGS fault injector ported successfully to UltraScale+
- Tighter integration of fault injector with clock gating and FPGA control registers allowed coexistence of scrubbing and MBU emulation
- Fault injector support to MBU improved fidelity to radiation
  - MBU breaks scrubbing
  - MBU breaks fine-grained distributed TMR
  - Without MBU, the fault injector is exceedingly optimistic
- Fault injector operates with a general interface
  - It is up to the campaign planning and scripting to emulate the MBU profile of the targeted environment

#### Future work

- Open to new experiments:
  - UltraScale+ new reset-after-reconfiguration (RAR, PR) policies
  - UltraScale+ softcore only scrubber (SEM IP)
- Port fault injector to Xilinx Versal ACAP
  - Keep up with new product family and technology (FinFET 7 nm)



## Thank you for your attention!

Contact details

Name: Fernanda Lima Kastensmidt Head of Fault Tolerance & Reliability Team

Affiliation: Universidade Federal do Rio Grande do Sul (UFRGS)

- Email: fglima@inf.ufrgs.br
- Link: www.inf.ufrgs.br/~fglima