Modern programmable SoCs, such as the Xilinx Zynq-7000 APSoC FPGAs, provide an attractive COTS platform for building high-performance, miniaturized systems in space avionics. However, since SRAM FPGAs are vulnerable in radiation-induced effects, fault tolerance techniques must be developed to support their proliferation in critical applications. SEE mitigation approaches usually combine redundancy techniques (e.g. TMR) with memory scrubbing to correct upsets in the configuration memory.
We present a configuration memory scrubbing approach which is based on a two-dimensional (2D) Error Detection and Correction (EDC) coding scheme by combining: a) the Xilinx embedded (internal) frame-level ECC code (vertical direction) and b) an (external) interfame, interleaved parity code (horizontal direction). The internal ECC detects all single bit upsets (SBUs) and the vast majority of multiple-bit upsets (MBUs) per frame, but the error correction is only guaranteed for the SBUs. The internal ECC mechanism, based on the built-in Xilinx 7-series Readback CRC, achieves fast error detection without extra cost. On the other hand, the 2D coding scheme guarantees the detection and correction of all SBUs and the vast majority of MBUs. The proposed scheme eliminates the need for storing externally the golden bitstream; only the parity bits should be stored in a rad-hard memory.
We have executed a radiation experiment in collaboration with ESA-ESTEC in CERN SPS North Area at November 2018 using ultra-high energy heavy ions to evaluate our approach. The test was performed at an energy of 150A GeV/c for different LETs (8.8 and 12.45 MeVcm^2/mg) for a total effective fluence of more than 10^6 ions/cm^2. The outcomes of the radiation experiment were two-fold: first, the proposed scrubbing scheme achieved 100% error correction coverage of the single and multiple upsets observed and second, the offline analysis of the results produced useful inferences for the topology of the multiple cell upsets (MCUs) which guided us to fine-tune the 2D ECC algorithm. The configuration frames of the Zynq-7000 FPGA are divided into parity clusters in order to enable the correction of MCUs expanding into adjacent frames as observed in the experiment and reduce the storage requirements of the parity data.
We have implemented the proposed approach using an external board (for prototyping purposes we use a Zybo board but we plan to migrate in a radiation-tolerant microcontroller) in the role of external scrubber. The external scrubber communicates with the on-chip logic (ECC mechanism) through the JTAG port. The ARM SoC runs the error correction algorithm and various configuration memory access functions provided in a software library, a flash memory stores the parity data, while the FPGA fabric implements the low-level JTAG functions. This solution combines the hardware speed and popularity of the JTAG interface with the software versatility provided by the embedded processor.
Moreover, we aim at implementing the proposed approach as a full-hardware solution by integrating on-chip (in the reconfigurable logic) the 2D error correction algorithm. This solution will reduce the error correction latency improving the system availability and provide a self-healing system eliminating the need for an external controller.