EDHPC 2023 European Data Handling & Data Processing Conference for Space 2 - 6 October 2023 | Juan-Les-Pins | France



# High-Level Synthesis (HLS)-Based On-board Payload Data Processing considering the Roofline Model

Seungah Lee, Ruben Salvador, Angeliki Kritikakou, Olivier Sentieys, Julien Galizzi, and Emmanuel Casseau











### Payload data processing



[1] Gaia Collaboration, "The Gaia mission," A&A, vol. 595, p. A1, Nov. 2016.







[1] Gaia Collaboration, "The Gaia mission," A&A, vol. 595, p. A1, Nov. 2016.





### Current data processing hardware

- Processing hardware
  - Frontgrade LEON processors and Microsemi FPGA

| Mission       | Launch year | Payload data processing hardware                     | Reference |
|---------------|-------------|------------------------------------------------------|-----------|
|               |             | GR712RC LEON3FT ASIC,                                |           |
| Plato         | 2026*       | 26* MDPA LEON2FT ASIC,                               |           |
|               |             | LEON3FT with RTAX2000 FPGA                           |           |
| JUICE         | 2023        | GR712RC LEON3FT ASIC                                 | [3]       |
| Solar Orbiter | 2020        | LEON3FT with RTAX4000 FPGA                           | [4]       |
| Cheops        | 2019        | GR712RC LEON3FT ASIC                                 | [5]       |
| BepiColombo   | 2018        | HIREC-MIPS-HR5000 CPU,<br>RTAX2000 FPGA              | [6]       |
| Gaia          | 2013        | Custom pre-processing board,<br>SCS750 PowerPC board | [7]       |

\*Estimated launch schedule



### Future data processing hardware



#### Research focus: FPGA design based on High-Level Synthesis (HLS)



### Space science algorithms

|                                          | Classification          | Sub-classification      | Nb. of users |
|------------------------------------------|-------------------------|-------------------------|--------------|
|                                          | Fourier transform       | FFT, IFFT, DFT          | 5            |
| l'Observatoire                           | Filter                  | IIR                     | 4            |
| de Paris                                 |                         | CIC                     | 4            |
| Survey results from                      |                         | Kalman                  | 1            |
| payload teams                            | Comprossion             | CCSDS 121               | 3            |
| Irène Joliot-Curie                       |                         | CCSDS 122               | 3            |
| Laboratoire de Physique<br>des 2 Infinis | Compression             | CCSDS 123               | 3            |
| des 2 Infinis                            |                         | CCSDS 124               | 3            |
|                                          | Optimization            | Interpolation           | 3            |
|                                          |                         | Fitting and correlation | 2            |
| CNIS                                     |                         | Gradient descent        | 2            |
| LPC2E                                    | Histogram               |                         | 1            |
|                                          | Digital Elevation Model |                         | 1            |

Selected key algorithm: 2-Dimensional Fast Fourier Transform (2-D FFT)



# **Optimization methodology**

- High-Level Synthesis (HLS)-based hardware acceleration architectures
  - Combination of Design Space Exploration (DSE) and the roofline model [8]



[8] M. Siracusa et al., "A Comprehensive Methodology to Optimize FPGA Designs via the Roofline Model," IEEE Trans Comput, vol. 71, no. 8, pp. 1903–1915, Aug. 2022

EDHPC 2023: High-Level Synthesis(HLS)-Based On-board Payload Data Processing considering the Roofline Model, Lee et al,



# High-Level Synthesis (HLS)

- HLS: refactor C/C++ using pragmas and directives
- Tools: Vitis HLS, Catapult HLS, Bambu HLS...
- Advantages
  - Generate architectures faster and efficiently
  - Adaptive to missions and design reviews
  - For payload teams with limited HDL experts

| Optimization type  | Vitis HLS pragmas           |
|--------------------|-----------------------------|
| Loop unrolling     | #pragma HLS unroll          |
| Loop pipelining    | #pragma HLS pipeline        |
| Task pipelining    | #pragma HLS dataflow        |
| Array partitioning | #pragma HLS array_partition |





### Roofline performance model

- Characterize designs using the limits of bandwidth and performance on a given architecture [9].
- Originally appropriate to multicore CPUs and GPUs, but extended to FPGAs.
- The FPGA roofline model shall consider the reconfigurable characteristics of FPGA [10].





### Use case: 2-D FFT

• Algorithm requirements

| Item           | Description                            | Note                      |  |
|----------------|----------------------------------------|---------------------------|--|
| Algorithm      | 2-D FFT                                | SVOM ECLAIRs payload [11] |  |
| FFT size       | 200 × 200 (256 × 256 )                 |                           |  |
| Data type      | Single-precision floating point (FP32) |                           |  |
| Execution time | < 100 ms                               |                           |  |

• FPGA platform specifications

| Item                                      | Description                                                  | Note                                               |
|-------------------------------------------|--------------------------------------------------------------|----------------------------------------------------|
| Digital Signal Processing<br>(DSP) blocks | 2520 blocks (DSP48E2)<br>Multiplier input: 27 $	imes$ 18 bit | Xilinx Zynq UltraScale+<br>(XCZU9EG) embedded FPGA |
| Block RAM (BRAM)                          | 32.1 Mb                                                      |                                                    |
| Interface                                 | Advanced eXtensible Interface (AXI)                          |                                                    |



### Theoretical FPGA roofline model

#### • Computational ceilings (C)

 $C = \frac{available \, DSP \, blocks \times clock \, frequency}{required \, DSP \, blocks \, per \, operation}$ 

- DSP blocks per operation (DSP48E2)
  - 27-bit data: 1 DSP block per multiplication
  - FP32 data: 3 DSP blocks per FP multiplication
- I/O Bandwidth ceiling (BW)

 $BW = min(BW_{PS_DDR4}, BW_{AXI})$   $BW_{PS_DDR4} = DDR4 \ transfer \ rate \times DDR4 \ width$  $BW_{AXI} = clk \ freq. \times transfer \ data \ bitwidth \times ports$ 







### Theoretical FPGA roofline model





EDHPC 2023: High-Level Synthesis(HLS)-Based On-board Payload Data Processing considering the Roofline Model, Lee et al, 2-6.10.2023, Juan-les-Pins, France

### Theoretical FPGA roofline model

- Computational ceilings (C)
  - = available DSP blocks × clock frequency
    - required DSP blocks per operation
- DSP blocks per operation (DSP48E2)
  - 27-bit data: 1 DSP block per multiplication
  - FP32 data: 3 DSP blocks per FP multiplication



 $BW = min(BW_{PS_DDR4}, BW_{AXI})$   $BW_{PS_DDR4} = DDR4 \ transfer \ rate \times DDR4 \ width$  $BW_{AXI} = clk \ freq. \times transfer \ data \ bitwidth \times ports$ 







# Parallelism and pipelining

- AMD-Xilinx open-source DSP library
  - HLS-based 2-D FFT
- Parallelism and pipelining
  - Loop unrolling combined with array partitioning
  - Loop pipelining and task pipelining
- Optimization (FP32)
  - Reference code: DSP blocks utilization > 100 %
  - Modification: loop pipelining (#pragmas HLS pipeline)
  - Reduction of resource utilization

| C synthesis | C/RTL         | PTL cynthosis     |
|-------------|---------------|-------------------|
|             | Co-simulation | <br>RTL synthesis |

| FFT size   | 64 x 64    |         | 256x256    |         |
|------------|------------|---------|------------|---------|
| Data type  | 27-bit FxP | FP32    | 27-bit FxP | FP32    |
| DSP blocks | 196        | 928     | 308        | 1276    |
| DSP DIOCKS | (7.8%)     | (37%)   | (12%)      | (51%)   |
|            | 48         | 32      | 618        | 528     |
| BRAM       | (2.6%)     | (1.8%)  | (34%)      | (29%)   |
| LUT        | 58843      | 109870  | 89508      | 157710  |
|            | (21%)      | (40%)   | (33%)      | (58%)   |
| FF         | 82129      | 199467  | 103394     | 262969  |
|            | (15%)      | (36%)   | (19%)      | (51%)   |
| Frequency  | 225 MHz    | 227 MHz | 224 MHz    | 217 MHz |



## FPGA roofline model for the 2-D FFT



nría

EDHPC 2023: High-Level Synthesis(HLS)-Based On-board Payload Data Processing considering the Roofline Model, Lee et al,

2-6.10.2023, Juan-les-Pins, France

### FPGA roofline model for the 2-D FFT





EDHPC 2023: High-Level Synthesis(HLS)-Based On-board Payload Data Processing considering the Roofline Model, Lee et al, 2-6.10.2023, Juan-les-Pins, France



FPGA roofline model for the 2-D FFT

Computational intensity (ops/Byte)

Time requirement: < 100 ms

nnía

EDHPC 2023: High-Level Synthesis(HLS)-Based On-board Payload Data Processing considering the Roofline Model, Lee et al,

2-6.10.2023, Juan-les-Pins, France

### Conclusion

- Analysis of current and future on-board data processing hardware
- Survey on payload data processing algorithms
- Design methodology considering the roofline model with HLS-based DSE
- The 2-D FFT case study: 30-65 faster processing than the algorithm requirement
- On-going work: HW/SW benchmark and roofline model for Zynq UltraScale+ (Arm CPU&FPGA)

Thank you for your attention! Main speaker: **Seungah Lee** (<u>seungah.lee@irisa.fr</u>)

Acknowledgement: this research was supported by CNES and University of Rennes thanks to PhD funding granted to Seungah Lee.



### References

[1] Gaia Collaboration, "The Gaia mission," A&A, vol. 595, p. A1, Nov. 2016.

[2] P. Plasson, G. Brusq, F. Singhoff, H. N. Tran, S. Rubini, and P. Dissaux, "PLATO N-DPU on-board software: an ideal candidate for multicore scheduling analysis," in 11th European Congress ERTSS Embedded Real Time Software and System, Toulouse, France, 2022.

[3] F. Torelli, "Common DPU and Basic Software for JUICE Instruments," in European Workshop on On-Board Data Processing (OBDP2019), European Space Research and Technology Centre (ESTEC), Feb. 2019.

[4] M. Maksimovic et al., "The Solar Orbiter Radio and Plasma Waves (RPW) instrument," Astronomy & Astrophysics, vol. 642, p. A12, Oct. 2020.

[5] W. Benz et al., "The CHEOPS mission," Experimental Astronomy, vol. 51, no. 1, pp. 109–151, Feb. 2021.

[6] Y. Kasaba et al., "Mission Data Processor Aboard the BepiColombo Mio Spacecraft: Design and Scientific Operation Concept," Space Science Reviews, vol. 216, no. 3, p. 34, Apr. 2020.

[7] J. H. J. d. Bruijne, M. Allen, S. Azaz, A. Krone-Martins, T. Prod'homme, and D. Hestroffer, "Detecting stars, galaxies, and asteroids with Gaia," Astronomy & Astrophysics, vol. 576, p. A74, Apr. 2015.

[8] M. Siracusa et al., "A Comprehensive Methodology to Optimize FPGA Designs via the Roofline Model," IEEE Trans Comput, vol. 71, no. 8, pp. 1903–1915, Aug. 2022.

[9] S. Williams, A. Waterman, and D. Patterson, "Roofline: an insightful visual performance model for multicore architectures," Communications of the ACM, vol. 52, no. 4, pp. 65–76, Apr. 2009.

[10] B. da Silva Gomes, A. Braeken, E. D'Hollander, and A. Touhafi, "Performance modeling for FPGAs: extending the roofline model with high-level synthesis tools," International Journal of Reconfigurable Computing, vol. 2013, pp. 1–10, 2013.

[11] S. Schanne et al., "A Scientific Trigger Unit for space-based real-time gamma ray burst detection I - Scientific software model and simulations," IEEE Nuclear Science Symposium and Medical Imaging Conference (2013 NSS/MIC), Oct. 2013, pp. 1–5



# Appendix



### Roofline performance model

- Characterize designs using the limits of bandwidth and performance on a given architecture [9].
- Originally appropriate to multicore CPUs and GPUs, but extended to FPGAs.
- The FPGA roofline model shall consider the reconfigurable characteristics of FPGA [10].



Computational intensity (ops/Byte)

