



# FPGA-Based Multi-Threading for On-board Processing

Pasquale Lombardi - Syderal Andrea Guerrieri - EPFL Bilel Belhadj - Syderal

9-11 of April 2018





### Content

- ---- Spacecraft on-board computing trend and needs
- - □ Introduction
  - □ Implementation
  - □ Proof of concept test case

# SYLXA

# Spacecraft On-board Computing Trend

- ---- Big data challenge
  - Increasing data volume
  - Requiring data reduction before downloading
  - o On-board selection of useful data



```
Credit : Attunity.com
```



Miniaturization allows for smaller satellites

- o Similar observational capabilities
- o Lower energy/power capacity
- o Lower download bandwidth



SYD-R/L



# Spacecraft On-board Key Computing Needs

---- Scalable and energy efficient





Credit : Helios Touch



Credit : mytypewriter.com

### ---- Fast adoption of new technologies

# SYDAR



### FBMT basic elements and their relevance

- FPGA (Field Programmable Gate Array) is an attractive technology for space missions
  - o Unbeatable flexibility
  - o Best performance-to-power ratio
- ---- Multi-threading is effectively used in software to
  - o Exploit multi-core as well as single-core platforms
  - o Improve responsiveness
  - o Improve real-time performance
- ----Heterogeneous computing is a new trend for achieving
  - High performance and
  - o Energy efficiency







# FBMT conceptual basis

- Do heterogeneous computing on a CPU + FPGA platform
  - Increase performance and power efficiency
- Use abstraction to handle threads on FPGA hardware
  - Make FPGA hardware threading as easy as software threading
- Use dynamic partial reconfiguration to allocate FPGA resources to threads
  - Maximize hardware utilization by reusing FPGA resources for different functions



# SYD-R/L

### FBMT benefits and architecture

- --- Does not require a high performance CPU
  - Computational intensive tasks are executed by the FPGA
- Allows the design to be portable against adoption of different technologies and components quality
  - e.g. due to technology obsolescence or mission quality requirements

#### ---- Allows for scalability of system performance

- More than one FPGA can be operated by the same CPU; CPU can be multicore
- --- Allows for graceful degradation of the system
  - Either CPU or FPGA can be reconfigured to replace any failed functionality in the system even at degraded performance





1/0

SYD-R/L



# FBMT programming

#### 

- o Improved responsiveness
- o Optimized exploitation of parallel resources
- Improved performance

#### Allows for independent tasks and parallel execution

- Threaded tasks can execute independently of each other and possibly in parallel
- - Most operating systems support threads
- - It provides support for FPGA hardware threads as for software threads



#### 9-11 of April 2018





# **IMPLEMENTATION**

9-11 of April 2018







9-11 of April 2018





### Architectural Overview

### --- Key Ideas

- Manage the FPGA resources by software: Execution Controller (EC) (Flexibility)
- Logical separation between the Host Machine (HM) and EC (Portability and Scalability)
- HM and EC both part of the Processing System (PS) (Hard IPs)
- Modular and scalable FPGA design (Reusability and Power Scalability)

### --- Advantages

- The HM and EC continue running even during the full FPGA reconfiguration
- No halt or reboot needed to change the size and the number of the PRRs
- Could perform SEE mitigation techniques such as partial and full FPGA scrubbing

9-11 of April 2018





### Demonstrator Xilinx ZC706 Development Board







### Internal Block Diagram



9-11 of April 2018





### --- Execution Controller

Management software running on ARM CortexA9 (CPU1) of Zynq SoC.

### --- Main functions

Executes a sequence of predefined operations for FPGA management:

- Exchange information with HM (asynchronous commands and status)
- Load partial/full bitstreams and configure the PL
- Implements non-preemptive scheduling algorithms
- Monitor power consumption/temperature of the SoC

### --- Advantages

- Written in ANSI C for maximum portability
- High-Performance and Real-time Execution
- Real Concurrency of operation in respect with respect to HM

9-11 of April 2018





### --- Host Machine

Embedded Linux on ARM CortexA9 (CPU0) of the Zynq SoC

- $\circ$  Application code C/C++
- Manage Hardware Threads using APIs

### --- APIs

Set of predefined functions to control and use the FPGA resources:

- Simple for a software developer familiar with Pthreads
- Hides the complexity and details of FPGA



9-11 of April 2018





# List of APIs

| API                | Description                                       |
|--------------------|---------------------------------------------------|
| fbmt_initialize () | Set the maximum number of concurrent HW Threads   |
| fbmt_create ()     | Create one or more HW Threads                     |
| fbmt_join()        | Wait until the end of one or more HW Threads      |
| fbmt_cancel()      | Terminate the execution of one or more HW Threads |
| fbmt_set_sched ()  | Select the scheduling algorithm                   |
| fbmt_malloc ()     | Allocate virtual memory space for HW Thread       |
| fbmt_free ()       | Free virtual memory space for HW Thread           |
| fbmt_mutex ()      | Mutual Exclusion for HW Thread                    |





# APIs Usage Example



9-11 of April 2018





### Performance

#### **APIs Performance**

| Description                   | API                       | Latency [ms] |
|-------------------------------|---------------------------|--------------|
| HOST-FPGA<br>Command/Response | <pre>fbmt_create( )</pre> | 1            |
| HW Thread Instantiation       | fbmt_create ()            | 60*          |
| FPGA Initialization           | fbmt_initialize ()        | 700*         |

\*Full Bitstream/Partial Bitstream Configuration: latency dependent on the size of FPGA/Hardware Thread

# FPGA Configuration Throughput: 210 Mib/s

#### SoC Layout



- C++(APIs)/ C(EC)
- VHDL (no Vendor IPs)
- 200MHz





# Platform's Main Features

- Continues to operate during the full FPGA reconfiguration
- Has the capability to change the size and the number of concurrent Hardware Threads at run-time
- Manages the Hardware Threads using similar scheduling algorithms used for software threads;
- --- Provide virtual memory Support to Hardware Threads
- Monitors power consumption and temperature of the SoC and adapts the scheduling accordingly;
- Could autonomously perform SEE mitigation techniques such as full and partial FPGA scrubbing;
- Source code fully portable and scalable over different platforms





# PROOF OF CONCEPT TEST CASE





# **Cloud Detection Application**

### ---- Motivation

- On-board processing of multi-spectral images
- o 66% of cloudy pixels are considered useless data
- Save storage resources and downlink capacities

### ---- Sentinel-2 multi-spectral images

- Copernicus Open Access Hub
- Download S2 Level-1C products
- Process Scene Classification Algorithm
- Accelerate execution with FBMT Hardware Threads

### ---- Data volume

- Real-World, High volume data
- High memory throughput
- High processing power







# **Cloud Detection Algorithm**

#### // Input Data

- 9 spectral bands of S2 MSI ~ 250 MB
- Covers 100x100 km of earth surface
- o 60 meters resolution per pixel

#### ---- Output Data

- Scene Classification Mask
- o Cloud Mask
- o Snow Mask
- ----- Threshold based Algorithm
  - Compare reflectance to thresholds
  - Two main filter sequences
    - Cloud Detection Sequence (CDS)
    - Snow Detection Sequence (SDS)
- ---- Two accelerator designs
  - Accelerator 1 includes CDS and SDS
  - Accelerator 2 includes CDS only
  - o Dynamic reconfiguration of accelerator designs according to image content



#### 9-11 of April 2018





### **Processing Examples**







## Inside PR regions: Accelerator Architecture

#### ---- Common Modules

- DMA engines, Arbiters, AXI interfaces
- o Internal Buses, Configuration Registers

#### Custom Processing Elements (PEs)

- From simple addition to user-defined computation
- Could communicate together
- May be heterogeneous

#### ---- Data Triggered Configuration

- Computation is a side effect of data transfer
- Provides opportunities for data processing parallelism
- ---- Configurable parameters
  - PE number, DMA size, Register map, ...
  - AXI data bus width, AXI burst length



#### 9-11 of April 2018





### **Cloud Detection Example**

RGB image



Class Mask -- Software-Hardware mismatch



Class Mask -- Software processing









#### 9-11 of April 2018

SYD-R/L



### Performance

- ---- Software version
  - o ARM Cortex-A9 hard core, 1 GHz
  - Linux OS, 1 GB of RAM

#### 

- FPGA utilization (4 PRRs, 32 bits)
  - Static Design ~ 5%
  - Total Design ~ 38%
- Accelerator DDR3 throughput
  - Read ~ 2 GB/s (max. 6 GB/s)
  - Write ~ 2.5 GB/s (max. 8 GB/s)









# **CONCLUSION AND PERSPECTIVE**





### Conclusion

FPGA-based multi-threading system for on-board processing

- Opens opportunities for flexible parallel computing
- Best performance-to-power ratio

#### 

- Implemented on Xilinx Zynq SoC FPGA
- o Leverages dynamic partial reconfiguration technology
- Full-abstraction of FPGA thread creation and synchronization
- Portable and industry-friendly design

#### ---- Cloud detection test case

- Proof of FPGA-based multi-threading concepts
- Processing element based accelerator design
- Performance
  - x40 faster
  - x12 more energy efficient





### Perspectives

#### ---- Develop a library of accelerators

- High-level synthesis
- Automatic RHBD (Radiation Hardened By Design) design generation
- ---- Scalability Road map
  - Multi-FPGA system
  - Compact backplane technology
  - High-speed links

#### 

- System-level radiation analysis
- COTS component selection
- Radiation tests



#### 9-11 of April 2018





### For further information or feedback

Syderal

# EPFL LAP

Pasquale Lombardi pasquale.lombardi@syderal.ch

Bilel Belhadj bilel.belhadj@syderal.ch Andrea Guerrieri andrea.guerrieri@epfl.ch

Sahand Kashani-Akhavan sahand.kashani-akhavan@epfl.ch

Paolo Ienne paolo.ienne@epfl.ch