### Using eFPGA for LEON2-FT Instruction Set Extensions





### **SEFUW2023**

March 16, 2023

### Martin Daněk, Roman Bartosiński, daiteq







- Concept of custom instructions with the SWAR unit
- SWAR in eFPGA
- SWAR for CCSDS121
  - Speedup due to SWAR
  - Implementation in Menta eFPGA
- Conclusions

This work has been performed under an ESA contract 4000122242/17/NL/LF.

Using eFPGA for LEON2-FT ISE

## Overview







### SWAR = SIMD-within-a-register

- H.A.Bridonneau at ESTEC in 2014
- Term introduced by R.J.Fisher in 2003 Initial study of SWAR for GNSS in LEON2-FT performed by
- SWAR implemented in LEON2-FT (2019) and NOEL-V (2022) by daiteq, including support in LLVM

## SWAR history









- Find ways to increase processing with minimal changes to the processor microarchitecture
- E.g. for LEON2-FT:



# Motivation for SWAR Consider low-precision integer numbers, e.g. 2-bit GNSS samples









RESULT1

RESULT2

**RESULT3** 

RESULT4

RESULT5

RESULT6

RESULT7

RESULTB

SWAR OP #2

E4

E5

E6

E7



## **ACx** denotes an accumulator



SWAR OP #1

E0

E1

E2

E3

### SWAR "skeletons"

#### REDUCE

#### **MAP & ACCUMULATE**





Type definition

typedef unsigned int u16x2b attribute ((subword(2, 16)));

u16x2b lvecL[50], lvecM[50], lvecN[50];

• Either inferring SWAR operations

Or explicitly configuring SWAR operations

#### Using eFPGA for LEON2-FT ISE

## SWAR C-level API

```
for (int i=0;i<50;i++) {
 lvecN[i] = lvecL[i] * lvecM[i];
unsigned int accum = 0;
for (int i=0;i<50;i++) {</pre>
  accum += lvecL[i];
```

```
set swar op(0x4);
for (int i=0;i<50;i++) {</pre>
  op swar(lvecL[i], lvecM[i], lvecN[i]);
```

```
set swar op(0x10);
for (int i=0;i<50;i++) {</pre>
  op_swar(lvecL[i], 0x0, tmp);
  accum += tmp;
```





### A minimalist approach using generic instructions:

- SWAR
  - exec SWAR operation
- SWARcc
  - exec SWAR operation —> flags
- WRASR
  - set SWAR operation
  - set SWAR accumulator for readback
- RDASR
  - read the selected SWAR accumulator

### SWAR machine-level API











## LEON2-FT + SWAR

| ſ | ARITH    | op3(5 |        |                 |         |
|---|----------|-------|--------|-----------------|---------|
|   | op3(3-0) | 00    | 01     | 10              | 11      |
|   | 0000     | ADD   | ADDcc  | TADDcc          | WRASR/W |
|   | 0001     | AND   | ANDcc  | TSUBcc          | WRPSR   |
|   | 0010     | OR    | ORcc   | TADDccTV        | WRWIM   |
|   | 0011     | XOR   | XORcc  | TSUBccTV        | WRTBR   |
|   | 0100     | SUB   | SUBcc  | MULScc          | FPop1   |
|   | 0101     | ANDN  | ANDNcc | SLL             | FPop2   |
|   | 0110     | ORN   | ORNcc  | SRL             | CPop1   |
|   | 0111     | XNOR  | XNORcc | SRA             | CPop2   |
|   | 1000     | ADDX  | ADDXcc | RDASR/RDY/STBAR | JMPL    |
|   | 1001     | SWAR  | SWARCC | RDPSR           | RETT    |
|   | 1010     | UMUL  | UMULcc | RDWIM           | TICC    |
|   | 1011     | SMUL  | SMULcc | RDTBR           | FLUSH   |
|   | 1100     | SUBX  | SUBXcc |                 | SAVE    |
|   | 1101     |       |        |                 | RESTORE |
|   | 1110     | UDIV  | UDIVcc |                 |         |
|   | 1111     | SDIV  | SDIVcc |                 |         |

#### SWAR CFG

SWAR ACC SEL









### True user-defined instructions with SWAR

The idea: enable users to define their own application-specific machine-level instructions in LEON2-FT ASIC (or NOEL-V ASIC). How: implement the SWAR unit as an eFPGA.

Current evaluation: main focus on satellite applications

- GNSS tracking loop
- GPP FIR filter
- Image compression CCSDS121









- Correlation 2b, 3b, 4b
- Demodulation (multiplication) 2b, 3b, 4b
- Sine/cosine lookup 32b argument —> 2b, 3b, 4b value
- ALU add, sub, mul, shr 8b, 16b
- Entropy coding 8b, 16b, 32b

## SWAR modules











### The concept: eFPGA = SWAR unit









### The concept: eFPGA = SWAR unit

SWAR accumulators









### The concept: eFPGA = SWAR unit





SHyLoC case 1 SWAR: CCSDS121 input: 32b words block size: 64 words k-split (shift right): 1,2,...,29

SHyLoC case 2 SWAR: CCSDS121 input: 16b words block size: 64 words k-split (shift right): 1,2,...,13

SHyLoC case 3 SWAR: CCSDS121 input: 8b words block size: 64 words k-split (shift right): 1,2,...,5

#### Using eFPGA for LEON2-FT ISE

### SWAR for CCSDS121

#### 16/03/2023



14

### CCSDS121 speedup w/ SWAR (shr)

| TestID       | Nx  | Ny  | Nz | D  | J  | ICount_orig | ICount_swar | Improvement [%] |
|--------------|-----|-----|----|----|----|-------------|-------------|-----------------|
| img1_14_16_0 | 512 | 512 | 32 | 14 | 16 | 10208501256 | 8740487560  | 14.38           |
| img1_14_32_0 | 512 | 512 | 32 | 14 | 32 | 9696999386  | 7964748186  | 17.86           |
| img1_14_64_0 | 512 | 512 | 32 | 14 | 64 | 9439838250  | 7575468298  | 19.75           |
| img1_14_8_1  | 512 | 512 | 32 | 14 | 8  | 12081073145 | 11141437689 | 7.78            |
| img1_14_16_1 | 512 | 512 | 32 | 14 | 16 | 10974968387 | 9506848963  | 13.38           |
| img1_14_32_1 | 512 | 512 | 32 | 14 | 32 | 10418597848 | 8686236440  | 16.63           |
| img1_14_64_1 | 512 | 512 | 32 | 14 | 64 | 10141518742 | 8277036342  | 18.38           |
| img2_14_8_0  | 512 | 512 | 32 | 14 | 8  | 11127286428 | 10187747740 | 8.44            |
| img2_14_16_0 | 512 | 512 | 32 | 14 | 16 | 10107851497 | 8639837801  | 14.52           |
| img2_14_32_0 | 512 | 512 | 32 | 14 | 32 | 9593914977  | 7861663777  | 18.06           |
| img2_14_64_0 | 512 | 512 | 32 | 14 | 64 | 9333788568  | 7469418616  | 19.97           |
| img2_14_8_1  | 512 | 512 | 32 | 14 | 8  | 11973732857 | 11034055740 | 7.85            |
| img2_14_16_1 | 512 | 512 | 32 | 14 | 16 | 10886353712 | 9418234288  | 13.49           |
| img2_14_32_1 | 512 | 512 | 32 | 14 | 32 | 10340108590 | 8607747182  | 16.75           |
| img2_14_64_1 | 512 | 512 | 32 | 14 | 64 | 10067925373 | 8203442973  | 18.52           |

Execution in QEMU:

#### Using eFPGA for LEON2-FT ISE

 ICount\_orig - instruction count for SHyLoC w/o SWAR • ICount\_swar - instruction count for SHyLoC w/ SWAR





## eFPGA for SWAR - sampled SWAR configurations Generate a common architecture for all configurations Achieve frequency at least 100MHz in GF22FDX UHDGP

| config | swar_unit = description                | OPS | SinCos | Accumulators |
|--------|----------------------------------------|-----|--------|--------------|
| 0      | hand-optimized for GNSS, all bitwidths | CDL | Y      | No           |
| 1      | hand-optimized for GNSS, 2b only       | CDL | Y      | No           |
| 2      | hand-optimized for GNSS, 3b only       | CDL | Y      | No           |
| 3      | hand-optimized for GNSS, 4b only       | CDL | Y      | No           |
| 4      | swar_alu w/ swar_acc video - 2x16b     | AMS | N      | No           |
| 5      | swar_alu w/ swar_acc audio - 4x8b      | AMS | Ν      | No           |
| 6      | swar_alu w/ swar_acc gen 3x10b         | AMS | Ν      | No           |
| 7      | swar_alu w/ swar_acc video - 2x16b     | AMS | Ν      | 2x38b        |
| 8      | swar_alu w/ swar_acc audio - 4x8b      | AMS | N      | 4x22b        |
| 9      | swar_alu w/ swar_acc gen 3x10b         | AMS | Ν      | 3x20b        |
| 10     | ALU SHR 1x 32b                         | S   | N      | 1x38b        |
| 11     | ALU SHR 2x 16b                         | S   | N      | 1x22b        |
| 12     | ALU SHR 4x 8b                          | S   | Ν      | 1x15b        |

#### Using eFPGA for LEON2-FT ISE

OPS:

- C correlation
- D demodulation
- L sine/cosine lookup
- A addition
- M multiplication
- S shift right









## Menta Origami Programmer

| var |                                |              |           |       |                       | 8  |
|-----|--------------------------------|--------------|-----------|-------|-----------------------|----|
|     |                                | G            | ené       | ratio | n 👌                   | -  |
|     |                                |              |           |       |                       |    |
| 8   | Resources summary              |              |           |       | Q                     | 96 |
| ٦   | Туре                           | Arch         | Арр       | 96    |                       | ľ  |
| 1   | ≚ ELB                          | 108          | -58       |       | ж \varTheta           |    |
| â.  | LE                             |              | 399       |       |                       |    |
|     | LUT<br>DEF                     | 864          | 390<br>75 |       | 96 😑<br>86 😝          |    |
|     | <ul> <li>DSP/Memory</li> </ul> | 18           | 0         |       | - 🔴                   |    |
|     | MNT_DSP_I16_32P                | 18           | 0         |       | - \varTheta           |    |
|     | Boundary DFF<br>Ports          | 763          |           |       | 86 \varTheta<br>16. 🖨 |    |
|     | N N                            | 381          |           |       |                       | ч  |
|     | OUT                            | 382          | 68        |       |                       |    |
|     | * CLOCK                        | 3            | 2         |       | 86 <del>(</del> )     |    |
|     | IN<br>OUT                      | 1            | 1         | 100.0 | % 😝<br>               | 1  |
|     |                                | on primitiv  |           |       |                       |    |
| =   | Nets                           |              |           |       | Ğ                     | 96 |
|     | Filter nets                    |              |           |       | :s∏ F                 | E) |
|     | Name                           |              | ~         | Sink  | s ID                  | 6  |
|     | \config_9.i_swar.g_acc.swacc0. | .r_reg.q [0] |           | з     | 20                    | T  |
|     | \config_9.i_swar.g_acc.swacc0  | .r_reg.q [10 |           | 5     | 10                    |    |
|     | \config_9.i_swar.g_acc.swacc0  | .r_reg.q [11 |           | 5     | 9                     |    |
|     | \config_9.i_swar.g_acc.swacc0  | .r_reg.q [12 | J         | 4     | 8                     |    |
|     | \config_9.i_swar.g_acc.swacc0  | .r_reg.q [13 |           | 6     | 7                     |    |
|     | \config_9.i_swar.g_acc.swacc0  | .r_reg.q [14 | J         | 5     | 6                     |    |
| J   | \config_9.i_swar.g_acc.swacc0  | .r_reg.q [15 | 1         | 5     | 5                     |    |
| ÷   | \config_9.i_swar.g_acc.swacc0  | .r_reg.q [16 | 1         | 4     | 4                     |    |
| 8   | \config_9.i_swar.g_acc.swacc0  | .r_reg.q [17 | 1         | 6     | 3                     |    |
| Ĩ   | \config_9.i_swar.g_acc.swacc0  | .r_reg.q [18 |           | 5     | 2                     |    |
|     | \config_9.i_swar.g_acc.swacc0  | .r_reg.q [19 | l         | 4     | 1                     | k  |
|     | ·                              | 141          |           | -     | 40                    | ×  |
|     | Nets Paths Instances Dens      | aty          |           |       |                       |    |
|     | Properties                     |              |           |       | Q                     | 96 |
|     |                                |              |           |       |                       | _  |
|     |                                |              |           |       |                       |    |
|     |                                |              |           |       |                       |    |
|     |                                |              |           |       |                       |    |
|     |                                |              |           |       |                       |    |
| J.  |                                |              |           |       |                       |    |
|     |                                |              |           |       |                       |    |
|     |                                |              |           |       |                       |    |
|     |                                |              |           |       |                       |    |
|     |                                |              |           |       |                       |    |
|     |                                |              |           |       |                       |    |
|     |                                |              |           |       |                       |    |

SWAR configurations are described in VHDL/Verilog, and implemented in Origami Programmer.

The tool generates simulation netlists (Verilog), configuration bitstream, and implementation reports.



17

### eFPGA for SWAR - operating frequencies for implemented configurations

| Parameter       | Architecture |         |         |         |         |
|-----------------|--------------|---------|---------|---------|---------|
| •               | base         | dsp_v1  | dsp_v2  | dsp_v3  | dsp_v4  |
| CLK/S/R         | 1            | 1       | 1       | 1       | 1       |
| #LUT6           | 1152         | 1296    | 1008    | 1440    | 864     |
| DSP type        | I24_48_F32P  | I16_32P | I16_32P | I8_16P  | I16_32P |
| DSP             | 24           | 27      | 21      | 18      | 18      |
| INs             | 477          | 477     | 413     | 477     | 381     |
| OUTs            | 478          | 478     | 414     | 478     | 382     |
| GCLK            | 1            | 1       | 1       | 1       | 1       |
| GS/GR           | 1            | 1       | 1       | 1       | 1       |
| Area [%]        | 100          | 66      | 52      | 63      | 44      |
| Pstat [%]       | 100          | 112     | 87      | 99      | 99      |
| SWAR implementa | ations       | •       |         | •       |         |
| config1 [MHz]   | 126.378      | 117.691 | 137.039 | 111.696 | 128.925 |
| config2 [MHz]   | 117.681      | 119.252 | 129.957 | 108.532 | 128.155 |
| config3 [MHz]   | 118.010      | 126.364 | 126.921 | 136.267 | 117.446 |
| config4 [MHz]   | 91.387       | 90.573  | 93.188  | 82.643  | 103.406 |
| config5 [MHz]   | 85.159       | 91.886  | 86.579  | 86.100  | 93.158  |
| config6 [MHz]   | 79.218       | 85.303  | 81.046  | 80.302  | 84.098  |
| config7 [MHz]   | 85.333       | 97.586  | 94.341  | 85.010  | 94.730  |
| config8 [MHz]   | 88.389       | 96.574  | 101.720 | 90.525  | 94.287  |
| config9 [MHz]   | 85.022       | 93.203  | 97.269  | 90.707  | 93.675  |
| config10 [MHz]  | 199.848      | 207.524 | 204.413 | 175.645 | 203.886 |
| config11 [MHz]  | 146.017      | 161.346 | 156.018 | 131.403 | 155.925 |
| config12 [MHz]  | 157.286      | 152.539 | 167.408 | 157.389 | 156.068 |

#### Using eFPGA for LEON2-FT ISE

Target tech - gf22fdxuhdgp

All configurations used identical pinout.

Architecture: Base - auto-generated dsp\_vx - hand-tuned







- Using custom instructions in LEON2-FT can significantly improve performance, e.g., ~1.25x fewer executed instructions for CCSDS121 (compiled, swar shr version), ~2x faster FIR filter (compiled), >2x faster GNSS tracking loop (hand-crafted).
- For the considered SWAR configurations realistically achievable eFPGA frequencies are slightly below 100MHz (gf22fdxuhdgp).
- Certain SWAR configurations may need to execute in 2 or more pipeline cycles not to impose low frequency on the whole LEON2-FT => introduce support for SWAR idle cycle insertion.
  - The impact on the performance is expected to be low since the intensity of SWAR instructions is unlikely to be close to 100% in real applications. Key factor - (clock cycles per SWAR instruction) / (clock cycles per replaced kernel)
- Reconfiguration time for the dsp\_v4 fabric is  $\sim$ 106us (10580 cycles @100MHz over SPI-1). Consider multi-context configuration, e.g. using two multiplexed eFPGAs.

### Conclusions















daiteq

### THANK YOU





